Applying Protein Language Models Using Limited Dataset. Sequence-Based Hot Spot Prediction in Protein Interactions Using AutoGluon-Reference-Cited by-同舟云学术

Applying Protein Language Models Using Limited Dataset. Sequence-Based Hot Spot Prediction in Protein Interactions Using AutoGluon

Published:2024-01-05 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Sargsyan Karen¹,Lim Carmay¹

Affiliation:

1. Institute of Biomedical Sciences, Academia Sinica

Abstract

Abstract Background Protein language models, inspired by the success of large language models in deciphering human language, have emerged as powerful tools for unraveling the intricate code of life inscribed within protein sequences. They have gained significant attention for their promising applications across various areas, including the sequence-based prediction of secondary and tertiary protein structure, the discovery of new functional protein sequences/folds, and the assessment of mutational impact on protein fitness. However, their utility in learning to predict protein residue properties based on scant datasets, such as protein-protein interaction (PPI)-hotspots whose mutations significantly impair PPIs, remained unclear. Here, we explore the feasibility of using protein language-learned representations as features for machine learning to predict PPI hotspots using a dataset containing 414 experimentally confirmed PPI-hot spots and 504 PPI-nonhot spots. Results Our findings showcase the capacity of unsupervised learning with protein language models in capturing critical functional attributes of protein residues derived from the evolutionary information encoded within amino acid sequences. We show that methods relying on protein language models can compete with methods employing sequence and structure-based features to predict PPI hotspots from the free protein structure. We observed an optimal number of features for model precision, suggesting a balance between information and overfitting. Conclusions This study underscores the potential of transformer-based protein language models to extract critical knowledge from sparse datasets, exemplified here by the challenging realm of predicting PPI hotspots. These models offer a cost-effective and time-efficient alternative to traditional experimental methods for predicting certain residue properties. However, the challenge of explaining the importance of specific features in determining residue properties remains.

Publisher

Research Square Platform LLC

Reference31 articles.

1. Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, Min Y, Zhang B, Zhang J, Dong Z et al. A Survey of Large Language Models. In.; 2023: arXiv:2303.18223.

2. Unified rational protein engineering with sequence-based deep representation learning;Alley EC;Nat Methods,2019

3. Modeling aspects of the language of life through transfer-learning protein sequences;Heinzinger M;BMC Bioinformatics,2019

4. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences;Rives A;Proc Natl Acad Sci USA,2021

5. Evolutionary-scale prediction of atomic-level protein structure with a language model;Lin ZM;Sci 2023