Designer Proteins from Italy 🧬
So essentially,
LLMs can be used to generated proteins
Paper: Design Proteins Using Large Language Models: Enhancements and Comparative Analyses (14 Pages)
Researchers from University of Siena, Italy are interested in the usage of Large Language Models in bioinformatics, specifically in protein generation.
Hmm..What’s the background?
The protein alphabet consists of 20 amino acids, each represented by a character, forming sequences akin to letters in words. Protein sequences, like natural language, have directionality and often reuse modular elements with slight variations. Protein motifs and domains, the basic building blocks, resemble words and phrases in human language. This analogy suggests that LLMs, adept at handling sequential data, could be used to generate amino acid chains, or proteins.
Ok, So what is proposed in the research paper?
The researchers focus on using medium-sized LLMs (7-8 billion parameters) like Mistral-7B, Llama-2-7B, Llama-3-8B, and gemma-7B to generate high-quality protein sequences.
The researchers hypothesize that these models, even when trained on relatively small datasets, can effectively produce viable protein sequences
They fine-tune these LLMs on a dataset of human protein sequences considerably smaller than those used to train established protein-focused models
The performance of these fine-tuned models is then compared with established protein-focused language models like ProGen, ProtGPT2, and ProLLaMA, which are trained on much larger datasets
Among the fine-tuned models, P-Mistral consistently outperforms other models across various evaluation metrics (pLDDT, RMSD, TM-score, REU)
What’s next?
Future work would focus on developing techniques to guide the generation process towards proteins that meet specific criteria, enhancing the practical utility of these models in fields like drug design and synthetic biology.
So essentially,
LLMs can be used to generated proteins
Learned something new? Consider sharing with your friends!