1,205
Views
0
CrossRef citations to date
0
Altmetric
Report

BERT2DAb: a pre-trained model for antibody representation based on amino acid sequences and 2D-structure

, , , , , & ORCID Icon show all
Article: 2285904 | Received 25 Jun 2023, Accepted 16 Nov 2023, Published online: 27 Nov 2023

ABSTRACT

Prior research has generated a vast amount of antibody sequences, which has allowed the pre-training of language models on amino acid sequences to improve the efficiency of antibody screening and optimization. However, compared to those for proteins, there are fewer pre-trained language models available for antibody sequences. Additionally, existing pre-trained models solely rely on embedding representations using amino acids or k-mers, which do not explicitly take into account the role of secondary structure features. Here, we present a new pre-trained model called BERT2DAb. This model incorporates secondary structure information based on self-attention to learn representations of antibody sequences. Our model achieves state-of-the-art performance on three downstream tasks, including two antigen-antibody binding classification tasks (precision: 85.15%/94.86%; recall:87.41%/86.15%) and one antigen-antibody complex mutation binding free energy prediction task (Pearson correlation coefficient: 0.77). Moreover, we propose a novel method to analyze the relationship between attention weights and contact states of pairs of subsequences in tertiary structures. This enhances the interpretability of BERT2DAb. Overall, our model demonstrates strong potential for improving antibody screening and design through downstream applications.

Introduction

The high specificity and strong neutralizing ability of antibodies have made them highly valued in the prevention and treatment of diseases such as tumors and viral infections.Citation1–3 Compared to standard antibody discovery approaches, computer-based screening methods offer a promising alternative in terms of efficiency and cost-effectiveness.Citation4–6 Computer-aided screening of neutralizing antibodies typically involves several steps, including structural modeling, affinity prediction, and developability assessment, and this process usually requires multiple iterations.Citation7,Citation8 Compared to using structures as inputs, the use of sequences can significantly reduce computational requirements and is generally associated with lower data acquisition hurdles for accomplishing the aforementioned tasks. Leveraging an extensive collection of antibody sequences obtained through sequencing efforts performed by other researchers, we can pre-train an antibody language model (ALM) to subsequently enable downstream tasks related to antibody screening based on sequence information.Citation9–11

Within the realm of protein sequences, several protein language models (PLM) have been developed that leverage general protein datasets (e.g., UniProtKB and BFD) to learn common sequence features and transfer these features to specific tasks such as biophysical property prediction.Citation12–14 ESM2, ProtTrans, and UniRep are examples of PLM.Citation13,Citation15,Citation16 Among these models, the first two are based on self-attention and include multiple models with varying parameter sizes, while the last one is a single model based on LSTM. However, the evolution of antibodies relies on gene rearrangement and somatic hypermutation induced by antigens, which is fundamentally distinct from ordinary protein (e.g., enzymes, structural protein, transport protein) evolution.Citation17 Consequently, antibody sequences may possess unique features, and using PLMs to capture characteristics of antibody sequences may be inappropriate. AbLang, PARA, and AntiBERTy are both ALMs trained on antibody sequence datasets using self-attention.Citation18–20

While these language models have achieved promising performance in tasks such as structure modeling, there is still room for further optimization.Citation21 First, with the exception of AbLang, all other language models use a single model to embed both light and heavy chains. However, significant differences exist between the two chains in terms of sequencing data availability (which affects the amount of available training data), physicochemical properties, and gene expression characteristics (supplementary Fig.S1). As a result, using the same model to learn and represent sequences of both chains may dilute features specific to the light chain. Second, all of the aforementioned language models use amino acids as the fundamental embedding unit for pre-training. This approach may not effectively consider the collective impact of secondary structures on the spatial structure and functionality of antibodies. Additionally, previous studies have revealed the presence of linear motifs of varying lengths within protein sequences, which essentially represent conserved fragments in the sequence.Citation22 By referring to frequency-based vocabulary construction methods used in natural language processing, Asgari and colleagues developed a vocabulary of protein conserved sequence fragments that has been applied to tasks such as motif recognition, toxicity prediction, subcellular localization prediction, and protease prediction, with promising performance outcomes reported.Citation23

The secondary structure of antibodies is closely related to antibody-specific antigen recognition and stability.Citation24,Citation25 Firstly, the complementarity determining regions (cCDRs), especially CDR H3, are the most variable regions in antibodies. The secondary structure of these regions is related to the positioning and overall conformation of the antibody binding site. Their variations affect the orientation and exposure of the CDRs, determining whether optimal spatial matching can occur with the antigen, thereby achieving high specificity binding between the antigen and antibody. Secondly, the secondary structure of the framework region is relatively conserved, forming a stable scaffold to ensure the overall stability of the antibody. Past research has shown that mutations can occur in the framework regions of antibodies, resulting in changes in secondary structure and subsequently affecting the stability and functionality of the antibodies.Citation26 In summary, the close association between the secondary structure of antibodies and stability/functionality suggests that we need to consider the impact of the secondary structure as a whole when designing models.

Here, we propose a pair of antibody language models specifically designed for variable (V) regions that utilize self-attention and incorporate secondary structure information. These models undergo separate pre-training on the light and heavy chain sequence data and are collectively referred to as Bidirectional Encoder Representation from Transformers for Antibody Sequences based on Secondary Structure (BERT2DAb), which includes BERT2DAb_H and BERT2DAb_L (). We demonstrated the performance of BERT2DAb across several tasks, including classification of binding of mutant trastuzumab to human epidermal growth factor receptor 2 (HER2), classification of binding of multiple antibodies to multiple coronavirus antigens, and prediction of the change in binding free energy (ΔΔG) after antibody mutation (highlighted by the red box in ). While deep neural networks can be challenging to interpret due to opaque network parameters, explainability is essential in biomedical research.Citation27 To address this, Vig and colleagues explored the relationship between attention weights and tertiary structure, inspiring our novel approach to explore the interpretability of BERT2DAb.Citation28 We examine the association between the attention weights of pre-trained models and contact states of pairs of subsequences (highlighted by the yellow box in ).

Figure 1. Workflow. Steps ‘a’ through ‘e’ correspond to data cleaning (a), sequence splitting (b), subsequence vocabulary construction (c), pre-training (d), and pre-training model evaluation (e), respectively. In step ‘b’, the ‘H’ refers to 310-helix, alpha-helix, and pi-helix, the ‘e’ refers to beta-bridge and beta-strand, and the ‘C’ refers to high curvature loop, beta-turn, and coil. In step ‘C’, the vocabulary training examples are highlighted using a light blue background. The symbol ‘##’ indicates that the subsequence is located within a secondary structure.

Figure 1. Workflow. Steps ‘a’ through ‘e’ correspond to data cleaning (a), sequence splitting (b), subsequence vocabulary construction (c), pre-training (d), and pre-training model evaluation (e), respectively. In step ‘b’, the ‘H’ refers to 310-helix, alpha-helix, and pi-helix, the ‘e’ refers to beta-bridge and beta-strand, and the ‘C’ refers to high curvature loop, beta-turn, and coil. In step ‘C’, the vocabulary training examples are highlighted using a light blue background. The symbol ‘##’ indicates that the subsequence is located within a secondary structure.

Results

BERT2DAb

We trained BERT2DAb_H and BERT2DAb_L separately for 12,1508K and 976.4K steps, respectively. The training process took approximately three weeks. The final losses were 1.09 and 0.69 for BERT2DAb_H and BERT2DAB_L, respectively. Each model consisted of approximately 110 million trainable parameters. The pre-training process was speeded on 8 parallelized A100 GPUs.

We applied t-SNE to perform dimensionality reduction on the antibody sequence embeddings obtained from BERT2DAb and visualized the results. This showcases the model’s ability to extract diverse B-cell features from the sequences (supplementary Fig.S2).Citation29

Results of three downstream tasks

The classifier based on BERT2DAb can distinguish antibodies that bind to HER2 (multiple antibodies and an antigen)

Here, we aimed to evaluate BERT2DAb by using the classifier based on it to screen for trastuzumab mutants that can bind to HER2. Firstly, we trained a classifier using the Tra_dataset and compared it with the CNN classifier used in the original study through testing. We plotted the Receiver Operating Characteristic Curve (ROC) and Precision-Recall Curve (PR Curve) based on the output results (P) of the testing set. As shown in , the area under the ROC curve for the classifier based on BERT2DAb was 0.90 and the average precision was 91.0%, while the area under the ROC curve for the classifier used in the original study was 0.91 and the average precision was 83.0%.

Figure 2. ROC and PR curve. The AUC of the ROC and average precision are robust measures of model accuracy and precision. The blue dashed line on the ROC curve represents the ROC for a random classifier.

Figure 2. ROC and PR curve. The AUC of the ROC and average precision are robust measures of model accuracy and precision. The blue dashed line on the ROC curve represents the ROC for a random classifier.

To compare with other language models, we trained and tested five language models using the balanced_Tra_dataset. As shown in , the classifier based on BERT2DAb achieved the best performance in terms of F1-score (86.26%), Recall (87.41%), and Accuracy (85.46%), especially Recall, which improved by nearly 1.5% compared to the AbLang model that was also pre-trained using antibody sequences. The classifier based on BERT2DAb achieved the second best performance on precision (85.15%). Moreover, we observed that the results of classifiers based on ALMs (BERT2DAb, AbLang, and AntiBERTy) were superior to those of classifiers based on PLMs (esm2_t30_150M_UR50D and ProtBert).Citation15

Table 1. Benchmarks results in classifying the binding of trastuzumab mutants to HER2 (%).

Due to the difficulty in obtaining labeled data, we further evaluated the performance of our model using subsets of balanced_Tra_dataset at 100%, 60%, and 20% (). The results demonstrated that the classifier based on BERT2DAb achieved high classification accuracy even when trained with limited data, indicating its robustness and efficiency.

Table 2. Performance of BERT2DAb-based classifier with different training data volumes in classifying the binding of trastuzumab mutants to HER2 (%).

The classifier based on BERT2DAb can screen antibodies that bind to coronaviruses (multiple antibodies and multiple antigens)

To compare with other language models, we also trained and tested five language models using CoV-Ab-Bind (to evaluate the classification performance of the classifier for the binding of multiple antibodies to multiple coronavirus). As shown in , the classifier based on BERT2DAb achieved the best performance in terms of F1-score (90.30%), Recall (86.15%), and Accuracy (83.54%). Based on F1-score, the performance of the classifier based on ALMs ((88.81%+89.33%+90.30%)/3 = 89.48%) is superior to that of the classifier based on PLMs ((89.78%+87.11%)/2 = 88.45%).

Table 3. Benchmarks results in classifying the binding of antibodies and coronavirus (%).

The predictor based on BERT2DAb can predict ΔΔG of antigen-antibody complexes

ΔΔG is an important parameter for evaluating the reversibility of chemical reactions, which reflects the energy difference between molecules in different states, such as the free energy change before and after mutation in antigen-antibody complexes.Citation30,Citation31 As a quantitative indicator, ΔΔG is often used to describe changes in binding free energy in biological macromolecular systems, such as receptor-ligand, enzyme-substrate, and protein complex. Specifically, when ΔΔG value is negative, the reaction is more likely to occur; conversely, when the value is positive, progression of the reaction is difficult and requires external energy input to achieve. In the case of the antigen-antibody complex before and after mutation, if the ΔΔG value is negative, it indicates that the complex is more stable after mutation; conversely, if the ΔΔG value is positive, it indicates that the complex is less stable than the original.

ΔΔG is a more direct indicator to describe the changes in affinity before and after mutation of antigen-antibody complex. Here, we evaluated the performance of the predictor based on BERT2DAb for predicting ΔΔG and compared it with other language models. As shown in , the ΔΔG prediction based on BERT2DAb demonstrated excellent performance in both datasets (AB-Bind-allMut and AB-Bind-single). For single point mutations, the TopNetTree model,Citation32 which is based on tertiary structure information, achieved an Rp of 0.65 (SOTA), while our model achieved an Rp of 0.77. Moreover, our predictor can handle multiple point mutations (currently few predictors can adapt to this situation) with an Rp of 0.73. Compared with other language-model based predictors that only use sequences as inputs, the BERT2DAb-based predictor has demonstrated significant advantages.

Table 4. Benchmarks results in predicting the ΔΔG.

The attention weights of BERT2DAb reflect the 3D structure of the antibody

To investigate the relationship between attention weights of token pairs in BERT2DAb and the contact status of antibody sub-sequences, we fitted a binary logistic regression (using linear models to analyze information on attention weights) and further visualized the connection using Contact correlation coefficient.

Firstly, shows that the model achieved the highest accuracy of 77.71% (l7) and 73.59% (l3) for heavy chains and light chains, respectively. The performance was better for heavy chains possibly due to the larger pre-training data compared to light chains. Additionally, high hidden layers exhibited higher sensitivity, while low hidden layers demonstrated higher specificity. This could be attributed to the ability of high hidden layers to learn short-range relationships (contacts), whereas the lower hidden layers learned long-range relationships (non-contacts), resulting in different spatial information expression capabilities for each hidden layer. Moreover, we identified the top three attention heads in logistic regression corresponding to each hidden layer, sorted by the size of the regression coefficient (supplementary Table S1).

Table 5. Results of logistic regression(%).

Secondly, heatmaps of contact correlation coefficients for different attention heads and hidden layers were generated and displayed in . Overall, the contact correlation coefficients of each attention head gradually increased as the threshold increased, with some attention heads exceeding 90% when the threshold was greater than 0.2. This result indicated that in these attention heads, pairs of subsequences were more likely to contact in the tertiary structure if the attention weight of the token pair was large. In order to further demonstrate the relationship between attention weights and subsequence pair contacts, we used Pymol (2.6) to visualize the tertiary structures of three randomly selected antibodies () and annotated the spatial distance and attention weights of pairs of subsequences. The visualization showed that in the same attention head, the attention weight of a contact subsequence pair was greater than that of a non-contact subsequence pair. We believe that the pre-trained model extracts spatial features from sequence pairs of subsequences. In another study, under a threshold of 0.3, the highest “contact correlation coefficient” in TapeBert, ProtAlbert, ProtBert, ProtBert-BFD, and ProtXLNet was 44.7%, 55.7%, 58.5%, 63.2%, and 44.5%, respectively,Citation28 while our study achieved a maximum of 99.99%. We regarded the basic structural unit as a whole, rather than embedded representations of single amino acids, which may better preserve protein topology information or motif features. Consequently, the structural features learned by the pre-training model are more in line with the essence of tertiary structure of antibody molecules.

Figure 3. Contact correlation coefficient. Darker colors correspond to higher proportions in the heatmap, with the horizontal axis representing the attention head and the vertical axis representing the hidden layer. Blank cells indicate no token pair with weight greater than the threshold in the attention head.

Figure 3. Contact correlation coefficient. Darker colors correspond to higher proportions in the heatmap, with the horizontal axis representing the attention head and the vertical axis representing the hidden layer. Blank cells indicate no token pair with weight greater than the threshold in the attention head.

Figure 4. Relationship between attention weights and contact states of pairs of subsequences. the subfigures display the structures of heavy chain (upper row) and light chain (lower row) for each antibody. We analyzed BERT2DAb_H through l2_h8 and l11_h6, as well as BERT2DAb_L through l2_h12 and l4_h9. Minimum distances between control, contact, and non-contact subsequences are indicated by red dotted lines, with a distance less than 8 Å considered as contact. In the same attention head, the attention weight (orange box) of a contact subsequence pair is greater than that of a non-contact subsequence pair. The loop highlighted in red corresponds to CDRH3 or CDRL3.

Figure 4. Relationship between attention weights and contact states of pairs of subsequences. the subfigures display the structures of heavy chain (upper row) and light chain (lower row) for each antibody. We analyzed BERT2DAb_H through l2_h8 and l11_h6, as well as BERT2DAb_L through l2_h12 and l4_h9. Minimum distances between control, contact, and non-contact subsequences are indicated by red dotted lines, with a distance less than 8 Å considered as contact. In the same attention head, the attention weight (orange box) of a contact subsequence pair is greater than that of a non-contact subsequence pair. The loop highlighted in red corresponds to CDRH3 or CDRL3.

Discussion

We utilized secondary structure information to construct the vocabulary for antibody subsequences and trained the Transformer-based pre-training model BERT2DAb, which comprises the light chain model (BERTA2DAb_L) and heavy chain model (BERTA2DAb_H). The attention weights were statistically connected with the tertiary structure of antibodies, thus demonstrating that our pre-trained model had learned spatial information from sequence data. This demonstrates some interpretability of BERT2DAb, which refers to the information learned by the model from the sequences and preserved in the form of model parameters, capable of being reflected in the structure of antibodies. This indicates that the model indeed learns relevant features, which are similar to the information reflected by structure modeling tools based on self-attention mechanisms, such as AlphaFold2 and IgFold.Citation21,Citation33 In the future, BERT2DAb could potentially be used for fragment-based structure modeling and fragment-based paratope recognition tasks. Moreover, BERT2DAb accomplished critical tasks in developing neutralizing antibodies, such as antibody-antigen-specific binding classification and ΔΔG prediction.

We found that classifiers or predictors based on ALMs outperformed those based on PLMs in three downstream tasks. We believe that the main reason for this is the significant differences between protein and antibody sequences in terms of variability (antibodies have highly variable CDR sequences for recognizing different antigens, whereas protein sequences are more conserved and stable) and secondary structure composition (antibodies typically use beta sheet and loop structures to form their variable regions for antigen binding, while proteins generally have more regular structures such as alpha helices).Citation34 The incorporation of secondary structure information may enable BERT2DAb to extract additional features (such as the composition of secondary structures) from the sequence, leading to improved performance of the model in downstream tasks.

Moreover, in two binary classification tasks, we found that classifiers based on BERT2DAb achieved higher recall (87.41% and 86.15%) compared to classifiers based on other language models. We attribute this mainly to the incorporation of secondary structure information. The secondary structure of an antibody can affect the conformation of its CDRs and ultimately its binding affinity. Certain types of secondary structures, such as beta sheets and turns, are often present in CDRs and can contribute to antigen recognition and binding. Furthermore, the orientation and flexibility of these secondary structures can affect the geometry of the binding interface and the stability of the antibody-antigen complex.Citation6,Citation35 By embedding the secondary structure information into the representation, we can highlight its role as a whole.

The superior performance of BERT2DAb in the aforementioned antibody function prediction can be attributed to the incorporation of secondary structure information. To further illustrate this point, we conducted validation experiments on BERT2DAb for Paratope prediction (supplementary Table S2) and secondary structure prediction (supplementary Table S3). For Paratope prediction, BERT2DAb achieved the highest F1-score (69.3%) and, compared to AntiBERTy, Parapred, and ProtBERT, demonstrated a relatively high recall (72.8%) while maintaining a satisfactory precision (66.1%).Citation15,Citation36,Citation37 The higher recall can be attributed to the fragment embedding representation based on secondary structure, which enables the model to capture spatial features of the fragments and facilitates the identification of segments that can form optimal spatial matches with antigens. These segments serve as the building blocks for the complete antibody binding sites (conformational epitopes). Regarding secondary structure prediction, BERT2DAb outperformed AntiBERTy and AbLang, achieving the highest accuracy (VH: 99.5%, VL: 99.4%).Citation18,Citation19 These results demonstrate the explicit incorporation of secondary structure information by BERT2DAb, enabling downstream models to effectively capture the features related to variations in secondary structure types. In conclusion, the incorporation of secondary structure information through fragment embedding representations makes BERT2DAb more capable of predicting whether an antibody can tightly bind to an antigen, thereby facilitating antibody function (e.g., binding/non-binding) prediction.

There are some limitations in our study that need to be further investigated in future work. Firstly, although we validated BERT2DAb in various related tasks, it has yet to be applied to tasks such as structure modeling and developability prediction. Therefore, we plan to expand the model’s application scenarios and develop a versatile antibody-specific model. Our model embeds subsequences, making it possible to attempt modular antibody design and three-dimensional structure modeling, or add constraints to complete antibody generation or design. Secondly, we used a simple neural network to accomplish downstream tasks, and further tuning is required to find an output module more suitable for specific tasks. Thirdly, the attention mechanism and relationship that drive productive and functional heavy and light chain pairing are paramount for structural stability, as well as specificity, yet less explored by language models reported so far. This could limit the generalizability of the pre-train models to transfer to diverse and real-world antibody tasks. So, we aim to further pre-training BERT2DAb to elucidate the underlying patterns governing the pairing of light and heavy chains, and learn the inter-chain spatial relationships of light and heavy chains. Fourthly, due to time constraints and limited computational resources, the current model has a parameter count of approximately 110 M. However, in future work, we intend to train even larger models to further enhance model performance.

In conclusion, our pre-training model provides insights into the relationship between amino acid sequences and protein structures, offering a promising approach for antibody optimization and drug development. Our findings suggest that integrating secondary structural information can effectively enhance the performance of pre-training models for antibody-related tasks.

Materials and methods

Data

Antibody sequences for pre-training

We conducted pre-training using human V regions sequences from Observed Antibody Space (OAS), which currently holds the largest collection of antibody sequence data sourced from various studies.Citation38 Following data cleaning, the light (VL, the V region of light chain) and heavy (VH, the V region of heavy chain) chain datasets contained 1.2 billion and 33 million sequences, respectively ().

Datasets for benchmarks

To evaluate the performance of BERT2Db, we fine-tuned the language model on three benchmark datasets related to antibody screening. Firstly, the trastuzumab mutation dataset, which we named Tra_dataset, included 38,839 sequences of heavy chain CDR 3 (CDRH3) mutations of trastuzumab (we supplemented the sequence of heavy chain V region except CDRH3 and the sequence of light chain V region) and their binding outcomes to HER2 in wet lab experiments (‘1’ for binding and ‘0’ for non-binding), with 11,300 (29.09%) of them being binding sequences.Citation8 To facilitate baseline comparison with other language models, we randomly sampled a subset of 21,600 sequences (with approximately 50% binding samples) from the Tra_dataset and named it balance_Tra_dataset. Secondly, the Coronavirus antibody dataset was derived from CoV-AbDab and curated as CoV-Ab-Bind, containing 15,912 antibody VH and VL sequences along with their binding outcomes to various strains and subtypes of coronavirus (‘1’ for binding and ‘0’ for non-binding), including SARS-CoV1, SARS-CoV2_Alpha, SARS-CoV2_Omicron-BA1, SARS-CoV2_WT, SARS-CoV2_Omicron-BA2.11, and MERS.Citation39 We represented the antigenic epitopes (S protein RBD or NTD) sequences as a feature vector of length 147 using Pyprotein,Citation40 as the antigenic feature representation. Binding samples comprised 88% of CoV-Ab-Bind. Thirdly, the AB-Bind dataset is commonly used as a benchmark for ΔΔG (supplementary Formula 1–2) prediction, consisting of 1,101 data points of measured ΔΔG for antigen-antibody complex mutations.Citation41We extracted two subsets, namely AB-Bind-allMut (681) and AB-Bind-single (536), where the former contains samples with multi-point mutations while the latter contains only single-point mutations. We extracted the antibody and antigen sequences from the wild-type complex (PDB) and obtained the mutated antibody and antigen sequences based on the mutation sites. Similarly, we represented the antigenic sequences as feature vectors using Pyprotein.Citation40 In AB-Bind-allMut and AB-Bind-single, the antibody sequence refers to the VH and VL sequences, while the antigen sequence is the full sequence extracted from PDB (we excluded complexes with multiple peptide chains).

Dataset for explore the relationship between attention weights and subsequence pair contacts

In tertiary structure, the closer the pairs of amino acids are, the more likely they are to make contact.Citation42 According to the definition used in the Critical Assessment of protein Structure Prediction (CASP) competition, two amino acids are considered to be in contact if their Cβ atoms are less than 8 Å apart.Citation43 Based on this definition, we assumed that two subsequences were in contact if they contained contacting amino acid pairs (<8 Å).

The Thera-SabDab dataset, which includes 828 antibody sequences, was used to perform an interpretability analysis of BERT2DAb.Citation44 Firstly, the light and heavy chains of the antibodies in Thera-SabDab were modeled using IgFold.Citation21 The contact relationship between each subsequence and other subsequences was determined based on the predicted atomic coordinates and the previously defined subsequence contact concept. Secondly, we fed the heavy chain and light chain sequences separately into BERT2DAb to obtain attention weights of all token pairs in each attention head of each hidden layer.

Pre-training of BERT2DAb

Introduction of secondary structure information

Secondary structures are represented as continuous segments on the primary sequence, and we split each secondary structure in the antibody sequence by inserting spaces. The following example explains how we introduced the secondary structure information before pre-training (). ‘VSGSPGQSVTISCTGTSSDVGAY’ represents a fragment of an antibody sequence. The proteinUnet can predict the secondary structure for each amino acid in the sequence.Citation45 Based on the prediction results, consecutive amino acids belonging to the same secondary structure type were considered as a segment and separated by spaces. The sequence fragment was split into VSGSPGQ, SVTISCT, GTSSDVG, and AY, which were equivalent to ‘words’ in English, so we referred to them as ‘sequence words’. We split the secondary structures for all sequences in the light and heavy chain datasets used for pre-training.

Construction of vocabulary

To make each secondary structure a cohesive token in the embedded representation and also to control the length of the vocabulary, we use the wordpiece algorithm to train separate vocabularies for light and heavy sub-sequences on the segmented secondary structure sequences.Citation46,Citation47The main steps of wordpiece in this study are described below ().

Initializing the subsequence vocabulary

The initial subsequence vocabulary comprises 20 standard amino acids and identifiers required for pre-training (which are not involved in the iteration process).

Finding the subsequence pair with the highest score

We calculate the score of any two pairs of subsequences in the subsequence vocabulary using formula (1) and add the subsequence pair with the highest score to the vocabulary.

Iterating the second step

We continue iterating the second step until we reach the maximum vocabulary size set in advance. In this study, the subsequence vocabulary size was set to 40,000.

(1) Score=freq_of_pairfreq_of_first_element×freq_of_second_element(1)

Pre-training

We trained BERT2DAb_H and BERT2DAb_L using Pytorch 3.8 with a masked language model (MLM, as shown in ) as a self-supervised pre-training task. Both models were based on Bidirectional Encoder Representations from Transformers and had 12 hidden layers, each with 12 attention heads. The hidden layer embedding size was 768, the maximum sequence length was 128 (we only encoded the variable region sequence, which is typically no more than 128 amino acids, and we used subsequences as tokens for embedding. Thus, setting a maximum length of 128 is sufficient). The pre-training hyperparameters are listed in supplementary Table S4.

Figure 5. Masked language model. Each sequence segmented based on the secondary structure was further divided into subsequences (tokens) based on the subsequence vocabulary using the principle of maximum length matching. Additionally, 15% of tokens in each sequence were randomly masked. After passing through the embedding layer, an input vector of shape (128,768) was obtained. Sequence representations (128,768) were generated through 12 transformer layers.

Figure 5. Masked language model. Each sequence segmented based on the secondary structure was further divided into subsequences (tokens) based on the subsequence vocabulary using the principle of maximum length matching. Additionally, 15% of tokens in each sequence were randomly masked. After passing through the embedding layer, an input vector of shape (128,768) was obtained. Sequence representations (128,768) were generated through 12 transformer layers.

Benchmarks for BERT2DAb

We compared BERT2DAb with four other self-attention-based language models: esm2_t30_150M_UR50D, ProtBert, AbLang, and AntiBERTy, the former two being PLMs and the latter two being ALMs. Their parameters are 150 M, 430 M, 85 M, and 26 M, respectively. The main aim of this study is to explore a new method of incorporating secondary structure information for embedding representations. To control the impact of parameter count on model performance, we selected models for baseline testing that have a parameter count comparable to BERT2DAb. The esm2_t30_150M_UR50D, ProtBert, and AntiBERTy use the same model to represent light chain and heavy chain, so their parameter count will double during fine-tuning. We compared the performance of five language models across three downstream tasks, and in fine-tuning, all language model parameters were set to be trainable.

In the three downstream tasks, the input to the downstream classification or prediction network is as follows: Task one (classification of binding of mutant trastuzumab to HER2): concatenated VH embedding representation vector and VL embedding representation vector. Task 2 (classification of binding of multiple antibodies to multiple coronavirus antigens): concatenated VH embedding representation vector, VL embedding representation vector, and antigen sequence representation vector (147). Task 3 (prediction of ΔΔG after antibody mutation): concatenated wild-type antigen-antibody complex VH embedding representation vector, VL embedding representation vector, and antigen sequence representation vector (147), concatenated with mutated antigen-antibody complex VH embedding representation vector, VL embedding representation vector, and antigen sequence representation vector (147). According to the use of different pre-trained language models, the lengths of the VH and VL representation vectors are different, specifically: 128 (BERT2DAb), 640 (esm2_t30_150M_UR50D), 1024 (ProtBert), 768 (AbLang), and 512 (AntiBERTy). BERT2DAb employs subsequence-based tokenization for embedding, while other language models employ amino acid-based tokenization for embedding. During fine-tuning, all models used the embedding representation of the last hidden layer output.

As our objective was to evaluate the performance of BERT2DAb and compare it with other language models, we used simple neural network layers to process the output of the language models for all three tasks. For task one, we reduced the dimensionality of the antibody sequence embedding representation to 1 using a fully connected layer and used a sigmoid function to output the binding probability (P, where a mutant antibody is considered to bind HER2 when P > .5). For task two, we concatenated the antibody sequence representation with the antigen feature vector and used a fully connected layer to reduce the dimensionality, followed by a Softmax to output the binary classification result. For task three, the embedded representation vector of the wild-type antigen-antibody complex and the embedded representation vector of the mutated antigen-antibody complex were then processed through two fully connected layers to reduce their dimensions, respectively. We computed the prediction of ΔΔG by subtracting the embedding representation of the wild-type complex from that of the mutated complex and summing the resulting value.

In all three downstream tasks, we performed a dataset split (train: 80% train, validation: 10%, test: 10%, supplementary Table S5), and initially conducted hyperparameter search using the train set and validation set. Subsequently, we merged the train set and validation set, trained the models using the best hyperparameter combination obtained, and evaluated their performance on the test set. We used the Adam optimizer to iteratively optimize the parameters of the model. We performed the hyperparameter search with the following hyperparameters being searched: train epoch, batch size, learning rate, Adam weight decay, dropout rate, and warm-up steps ratio. The results of the hyperparameter search are shown in supplementary Table S6. For binary classification tasks (task one and two), we evaluated the models based on F1-score, precision, recall, and accuracy. For regression tasks (task three), we evaluated the models based on Root Mean Square Error (RMSE) and Pearson correlation coefficient (Rp) (supplementary Formula 3–7).

Explore the relationship between attention weights and subsequence pair contacts

We trained BERT2DAb using a multi-head attention mechanism.Citation48 For any sequence S_T=t1,,tL with a length of L after tokenization, each attention head h=h1,,h12 of each hidden layer l=l1,,l12 in the model would generate a set of attention weights for each token, and each wi,j(0<wi,j <1,jLwij=1) represented the attention weight of tito tj ,that was, the degree of correlation between them.

To analyze the relationship between the interaction of pairs of subsequences and the corresponding attention weights of the token pairs, we carried out the steps indicated below.

Binary logistic regression

We conducted a binary logistic regression analysis on the contact relationship of pairs of subsequences (dependent variable) and the attention weights of the corresponding token pairs (independent variable) in 12 attention heads. The purpose was to determine the ability to predict the contact relationship of pairs of subsequences through attention weights, which reflects the model’s ability to learn spatial information of antibodies from sequences during pre-training. We used Accuracy, Sensitivity (predictive ability for contact samples), and Specificity (predictive ability for non-contact samples) to evaluate the performance of logistic regression. The calculation formulas are shown in equations (2)-(4). We obtained the True Positive value (TP), True Negative value (TN), False Positive value (FP), and False Negative value (FN) by comparing the prediction results of the test set with the actual labels.

(2) Accuracy=TP+TNTP+FN+FP+TN(2)
(3) Sensitivity=TPTP+FN(3)
(4) Specificity=TNTN+FP(4)

Contact correlation coefficient

To further investigate the relationship between attention weights and contact states of pairs of subsequences, we defined a ‘contact correlation coefficient’. This coefficient represents the proportion of token pairs with a contact relationship among the token pairs whose attention weight is greater than the threshold (t) in each attention head of each hidden layer. The thresholds used were 0.05, 0.1, 0.2, and 0.3. The contact correlation coefficient was averaged across the sequence level.

Data and code availability

The OAS dataset that support the pretraining of this study is available in https://opig.stats.ox.ac.uk/webapps/oas/.Citation38 The Trastuzumab dataset that support the classification of specific binding of mutant Trastuzumab to HER2 is available from github repository: https://github.com/dahjan/DMS_opt/.Citation8 The CoV-AbDab dataset that support the classification of specific binding of multiple antibodies to multiple coronavirus antigens is available in https://opig.stats.ox.ac.uk/webapps/covabdab/.Citation39 The AB-Bind dataset that support the prediction of ΔΔG after antibody mutation is available from github repository: https://github.com/sarahsirin/AB-Bind-Database.Citation41 The Thera-SabDab dataset that support the analyze the relationship between the attention weights of pre-trained models and the contact states of pairs of subsequences is available in https://opig.stats.ox.ac.uk/webapps/newsabdab/therasabdab/.Citation44

The pre-trained model and source data files for downstream task model training and data analyses in this study are provided on https://huggingface.co/w139700701.

The source code and code for analyses in the study are opened on GitHub: https://github.com/Xiaoxiao0606/BERT2DAb.

Author contributions

X.W.L. designed the study, implemented the code, performed the experiments, analyzed the results and wrote the paper. F.T. implemented the code and analyzed the results. W.B.Z. implemented the code and performed the experiments. X.W.Z. implemented the code and analyzed the results. J.Y.L. implemented the code and analyzed the result. J.L. performed the experiments and analyzed the results. D.S.Z. designed and supervised the study, analyzed the results and wrote the paper. All the authors revised the manuscript.

Supplemental material

BERT2DAb_revised supplementary_round2.docx

Download MS Word (364.2 KB)

Acknowledgments

Our gratitude goes to the developers of datasets used in this study, including OAS, CoV-AbDab, AB-Bind, Thera-SabDab and the dataset of mutated Trastuzumab. Their excellent work and the public resources enable us to engage in this research.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/19420862.2023.2285904

Additional information

Funding

The author(s) reported there is no funding associated with the work featured in this article.

References

  • Shan S, Luo S, Yang Z, Hong J, Su Y, Ding F, Fu L, Li C, Chen P, Ma J, et al. Deep learning guided optimization of human antibody against SARS-CoV-2 variants with broad neutralization. Proc Natl Acad Sci USA. 2022;119(11):e2122954119. doi:10.1073/pnas.2122954119.
  • de Assis RR, Jain A, Nakajima R, Jasinskas A, Felgner J, Obiero JM, Norris PJ, Stone M, Simmons G, Bagri A, et al. Analysis of SARS-CoV-2 antibodies in COVID-19 convalescent blood using a coronavirus antigen microarray. Nat Commun. 2021;12(1):6. doi:10.1038/s41467-020-20095-2.
  • Ge J, Wang R, Ju B, Zhang Q, Sun J, Chen P, Zhang S, Tian Y, Shan S, Cheng L, et al. Antibody neutralization of SARS-CoV-2 through ACE2 receptor mimicry. Nat Commun. 2021;12(1):250. doi:10.1038/s41467-020-20501-9.
  • Zhao J, Nussinov R, Wu WJ, Ma B. In silico methods in antibody design. Antibod. 2018;7(3):22. doi:10.3390/antib7030022.
  • Norman RA, Ambrosetti F, Bonvin AMJJ, Colwell LJ, Kelm S, Kumar S, Krawczyk K. Computational approaches to therapeutic antibody design: established methods and emerging trends. Brief Bioinform. 2020;21(5):1549–12. doi:10.1093/bib/bbz095.
  • Kuroda D, Shirai H, Jacobson MP, Nakamura H. Computer-aided antibody design. Protein Eng Des Sel. 2012;25(10):507–22. doi:10.1093/protein/gzs024.
  • Liang T, Chen H, Yuan J, Jiang C, Hao Y, Wang Y, Feng Z, Xie X-Q. IsAb: a computational protocol for antibody design. Brief Bioinform. 2021;22(5):bbab143. doi:10.1093/bib/bbab143.
  • Mason DM, Friedensohn S, Weber CR, Jordi C, Wagner B, Meng SM, Ehling RA, Bonati L, Dahinden J, Gainza P, et al. Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning. Nat Biomed Eng. Published online 2021 April 15;5(6):600–12. doi:10.1038/s41551-021-00699-9.
  • Khurana S, Rawi R, Kunji K, Chuang GY, Bensmail H, Mall R, Valencia A. DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformat. 2018;34(15):2605–13. doi:10.1093/bioinformatics/bty166.
  • Rawi R, Mall R, Kunji K, Shen CH, Kwong PD, Chuang GY, Hancock J. PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine. Bioinformat. 2018;34(7):1092–98. doi:10.1093/bioinformatics/btx662.
  • Chen X, Dougherty T, Hong C, Schibler R, Zhao YC, Sadeghi R, Matasci N, Wu YC, Kerman I. Predicting antibody developability from sequence using machine learning. bioRxiv. Published online 2020 June 20. doi:10.1101/2020.06.18.159798.
  • Biswas S, Khimulya G, Alley EC, Esvelt KM, Church GM. Low-N protein engineering with data-efficient deep learning. Nat Methods. 2021;18(4):389–96. doi:10.1038/s41592-021-01100-y.
  • Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019;16(12):1315–22. doi:10.1038/s41592-019-0598-1.
  • Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M, Martelli PL. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformat. 2022;38(8):2102–10. doi:10.1093/bioinformatics/btac020.
  • Elnaggar A, Heinzinger M, Dallago C, Rihawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans Pattern Anal Mach Intell. Published online 2021:1–1. doi:10.1109/TPAMI.2021.3095381
  • Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, Smetanin N, Verkuil R, Kabeli O, Shmueli Y, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Sci. 2023;379(6637):1123–30. doi:10.1126/science.ade2574.
  • Murphy K, Weaver C. Janeway’s immunobiology. Garland science. 2016.
  • Ruffolo JA, Gray JJ, Sulam J Deciphering antibody affinity maturation with language models and weakly supervised learning. Published online 2021 December 14 [Accessed 2022 November 9]. http://arxiv.org/abs/2112.07782
  • Olsen TH, Moal IH, Deane CM, Lengauer T. AbLang: an antibody language model for completing antibody sequences. Bioinformat Adv. 2022;2(1):vbac046. doi:10.1093/bioadv/vbac046.
  • Gao X, Cao C, Lai L . Pre-training with a rational approach for antibody. 2023. https://www.biorxiv.org/content/10.1101/2023.01.19.524683v2.abstract.
  • Ruffolo JA, Chu LS, Mahajan SP, Gray JJ. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun. 2023;14(1):2389. doi:10.1038/s41467-023-38063-x.
  • Prytuliak R. Recognition of short functional motifs in protein sequences. 2018. https://edoc.ub.uni-muenchen.de/22474/.
  • Asgari E, McHardy AC, Mofrad MRK. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecx). Sci Rep. 2019;9(1):3577. doi:10.1038/s41598-019-38746-w.
  • Totrov M, Dash C. Estimated secondary structure propensities within V1/V2 region of HIV gp120 are an important global antibody neutralization sensitivity determinant. PLoS ONE. 2014;9(4):e94002. doi:10.1371/journal.pone.0094002.
  • Roig X, Novella IS, Giralt E, Andreu D. Examining the relationship between secondary structure and antibody recognition in immunopeptides from foot-and-mouth disease virus. Lett Pept Sci. 1994;1(1):39–49. doi:10.1007/BF00132761.
  • Saini S, Agarwal M, Pradhan A, Pareek S, Singh AK, Dhawan G, Dhawan U, Kumar Y. Exploring the role of framework mutations in enabling breadth of a cross-reactive antibody (CR3022) against the SARS-CoV-2 RBD and its variants of concern. J Biomol Struct Dyn. 2023;41(6):2341–54. doi:10.1080/07391102.2022.2030800.
  • Zhang Y, Tiňo P, Leonardis A, Tang K. A survey on neural network interpretability. IEEE Trans Emerg Top Comput Intell. 2021;5(5):726–42. doi:10.1109/TETCI.2021.3100641.
  • Vig J, Madani A, Varshney LR, Xiong C, Socher R, Rajani NF. Bertology meets biology: interpreting attention in protein language models. Published online 2021 March 28[Accessed November 10, 2022]. http://arxiv.org/abs/2006.15222.
  • Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9(11):2579–2605.
  • Kollman PA, Massova I, Reyes C, Kuhn B, Huo S, Chong L, Lee M, Lee T, Duan Y, Wang W, et al. Calculating structures and free energies of complex molecules: combining molecular mechanics and continuum models. Acc Chem Res. 2000;33(12):889–97. doi:10.1021/ar000033j.
  • Gilson MK, Zhou HX. Calculation of protein-ligand binding affinities. Annu Rev Biophys Biomol Struct. 2007;36(1):21–42. doi:10.1146/annurev.biophys.36.040306.132550.
  • Wang M, Cang Z, Wei GW. A topology-based network tree for the prediction of protein–protein binding affinity changes following mutation. Nat Mach Intell. 2020;2(2):116–23. doi:10.1038/s42256-020-0149-6.
  • Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–89. doi:10.1038/s41586-021-03819-2.
  • Al-Lazikani B, Lesk AM, Chothia C. Standard conformations for the canonical structures of immunoglobulins. J Mol Biol. 1997;273(4):927–48. doi:10.1006/jmbi.1997.1354.
  • Iuchi H, Matsutani T, Yamada K, Iwano N, Sumi S, Hosoda S, Zhao S, Fukunaga T, Hamada M. Representation learning applications in biological sequence analysis. Comput Struct Biotechnol J. 2021;19:3198–208. doi:10.1016/j.csbj.2021.05.039.
  • Leem J, Mitchell LS, Farmery JHR, Barton J, Galson JD. Deciphering the language of antibodies using self-supervised learning. Patterns. 2022;3(7):100513. doi:10.1016/j.patter.2022.100513.
  • Liberis E, Velickovic P, Sormanni P, Vendruscolo M, Lio P, Hancock J. Parapred: antibody paratope prediction using convolutional and recurrent neural networks. Bioinformat. 2018;34(17):2944–50. doi:10.1093/bioinformatics/bty305.
  • Olsen TH, Boyles F, Deane CM. Observed antibody space: a diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Sci. Published online 2021 October 29;31(1):141–46. doi:10.1002/pro.4205.
  • Raybould MIJ, Kovaltsuk A, Marks C, Deane CM, Wren J. CoV-AbDab: the coronavirus antibody database. Bioinformat. 2021;37(5):734–35. doi:10.1093/bioinformatics/btaa739.
  • Dong J, Yao ZJ, Zhang L, Luo F, Lin Q, Lu A-P, Chen AF, Cao D-S. PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions. J Cheminformat. 2018;10(1):16. doi:10.1186/s13321-018-0270-2.
  • Sirin S, Apgar JR, Bennett EM, Keating AE. AB‐bind: antibody binding mutational database for computational affinity predictions. Protein Sci. 2016;25(2):393–409. doi:10.1002/pro.2829.
  • Rao R, Bhattacharya N, Thomas N, Duan Y, Chen P, Canny J, Abbeel P, Song Y. et al. Evaluating protein transfer learning with TAPE Advances in Neural Information Processing Systems 32 . 2019.
  • Ezkurdia I, Graña O, Izarzugaza JMG, Tress ML. Assessment of domain boundary predictions and the prediction of intramolecular contacts in CASP8: CASP8 domain and contact assessment. Proteins Struct Funct Bioinforma. 2009;77(S9):196–209. doi:10.1002/prot.22554.
  • Raybould MIJ, Marks C, Lewis AP, Shi J, Bujotzek A, Taddese B, Deane CM. Thera-SAbDab: the therapeutic structural antibody database. Nucleic Acids Res. 2020;48(D1):D383–D88. doi:10.1093/nar/gkz827.
  • Kotowski K, Smolarczyk T, Roterman‐Konieczna I, Stapor K. ProteinUnet—an efficient alternative to SPIDER3-single for sequence-based prediction of protein secondary structures. J Comput Chem. 2021;42(1):50–59. doi:10.1002/jcc.26432.
  • Shibata Y, Kida T, Fukamachi S, Takeda M, Shinohara A, Shinohara T, Arikawa S. Byte pair encoding: a text compression scheme that accelerates pattern matching. Published online 1999.
  • Wu Y, Schuster M, Chen Z. Google’s neural machine translation system: bridging the gap between human and machine translation. Published online 2016 October 8 [Accessed 2022 November 9]. http://arxiv.org/abs/1609.08144.
  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in Neural Information Processing Systems 30. 2017.