1,205
Views
0
CrossRef citations to date
0
Altmetric
Report

BERT2DAb: a pre-trained model for antibody representation based on amino acid sequences and 2D-structure

, , , , , & ORCID Icon show all
Article: 2285904 | Received 25 Jun 2023, Accepted 16 Nov 2023, Published online: 27 Nov 2023
 

ABSTRACT

Prior research has generated a vast amount of antibody sequences, which has allowed the pre-training of language models on amino acid sequences to improve the efficiency of antibody screening and optimization. However, compared to those for proteins, there are fewer pre-trained language models available for antibody sequences. Additionally, existing pre-trained models solely rely on embedding representations using amino acids or k-mers, which do not explicitly take into account the role of secondary structure features. Here, we present a new pre-trained model called BERT2DAb. This model incorporates secondary structure information based on self-attention to learn representations of antibody sequences. Our model achieves state-of-the-art performance on three downstream tasks, including two antigen-antibody binding classification tasks (precision: 85.15%/94.86%; recall:87.41%/86.15%) and one antigen-antibody complex mutation binding free energy prediction task (Pearson correlation coefficient: 0.77). Moreover, we propose a novel method to analyze the relationship between attention weights and contact states of pairs of subsequences in tertiary structures. This enhances the interpretability of BERT2DAb. Overall, our model demonstrates strong potential for improving antibody screening and design through downstream applications.

Acknowledgments

Our gratitude goes to the developers of datasets used in this study, including OAS, CoV-AbDab, AB-Bind, Thera-SabDab and the dataset of mutated Trastuzumab. Their excellent work and the public resources enable us to engage in this research.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data and code availability

The OAS dataset that support the pretraining of this study is available in https://opig.stats.ox.ac.uk/webapps/oas/.Citation38 The Trastuzumab dataset that support the classification of specific binding of mutant Trastuzumab to HER2 is available from github repository: https://github.com/dahjan/DMS_opt/.Citation8 The CoV-AbDab dataset that support the classification of specific binding of multiple antibodies to multiple coronavirus antigens is available in https://opig.stats.ox.ac.uk/webapps/covabdab/.Citation39 The AB-Bind dataset that support the prediction of ΔΔG after antibody mutation is available from github repository: https://github.com/sarahsirin/AB-Bind-Database.Citation41 The Thera-SabDab dataset that support the analyze the relationship between the attention weights of pre-trained models and the contact states of pairs of subsequences is available in https://opig.stats.ox.ac.uk/webapps/newsabdab/therasabdab/.Citation44

The pre-trained model and source data files for downstream task model training and data analyses in this study are provided on https://huggingface.co/w139700701.

The source code and code for analyses in the study are opened on GitHub: https://github.com/Xiaoxiao0606/BERT2DAb.

Author contributions

X.W.L. designed the study, implemented the code, performed the experiments, analyzed the results and wrote the paper. F.T. implemented the code and analyzed the results. W.B.Z. implemented the code and performed the experiments. X.W.Z. implemented the code and analyzed the results. J.Y.L. implemented the code and analyzed the result. J.L. performed the experiments and analyzed the results. D.S.Z. designed and supervised the study, analyzed the results and wrote the paper. All the authors revised the manuscript.

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/19420862.2023.2285904

Additional information

Funding

The author(s) reported there is no funding associated with the work featured in this article.