Protein Language Models (Part 1): Introduction & Datasets :: Draft

Table of Contents

I’m experimenting with Lilian Weng style review blog posts on topics I wish I’d had someone to explain to me when I got started in protein machine learning.

Following the success of large sequence models in language and other domain,(1) (1)The terminology people use here is extremely confusing. “Large language models” often denote large sequence models trained on language and other modalities or in some cases just other modalities. In the rest of the post, I’ use “large language models” (LLMs) to refer to the cluster of ideas around large-scale pretraining combined with simple objectives as opposed to models that focus on language as a modality and “protein language models” (PLMs) to refer to large sequence models of proteins. protein machine learning folks have been building and training language models on protein sequence data. While the category boundaries in terms of what exactly constitutes a PLM are necessarily fuzzy, PLMs use LLM-inspired pretraining strategies and architectures to model the distribution of protein sequences.

This post is the first in a five part series on Protein Language Models. The other posts are (missing links will be added upon publishing):

Protein Language Models (Part 1): Introduction & Datasets
Protein Language Models (Part 2): Models
Protein Language Models (Part 3): Benchmarks & Evaluation
Protein Language Models (Part 4): Scaling
Protein Language Models (Part 5): FAQ & Conclusion

This post covers the datasets and models used for training PLMs.

Background #

This series is not intended to introduce language models, transformers, or scaling from scratch. If you’re looking for writing that does that, here are some resources I’ve found helpful:

Transformers #

Language Models #

Generalized Language Models

Scaling #

The Scaling Hypothesis

Datasets #

Sequence #

Protein language models are trained using unsupervised objectives on protein sequence data. For general language models, the most commonly used datasets are currently UniRef and BFD. Some papers have used or tested MGNify and Metaclust but UniRef and BFD dominate.

UniRef (UniProt Reference Clusters) provides clustered sequences from the UniProt Knowledgebase. Around 95% of the sequences in UniProt are derived from coding sequences and mostly excludes metagenomic data. UniRef includes three datasets – UniRef100, UniRef90, and UniRef50 – which correspond to clustering UniProt and UniParc at 100% (de-duplication and combining identical fragments), 90%, and 50% sequence identity.

BFD is a big fantastic database of clustered protein sequences that includes sequences from UniProt but also Metaclust and Soil Reference Catalog Marine Eukaryotic Reference Catalog assembled by Plass. This means that BFD has a lot of overlap with UniRef but also contains substantially more metagenomic data. As the above table shows, BFD contains ~10x more proteins than UniRef.

As mentioned, MGNify and Metaclust only include metagenomic sequences. They’re less popular presumably because BFD gives good coverage of their sequence space. In fact, the RITA paper found that training on UniRef100 produced better perplexity results on a hold out generated from a mix of UniRef100, Metaclust, and MGNify than training on either of the other two did.

Clustering #

Protein databases such as UniRef offer versions that cluster based on sequence identity. The idea here is that for some (non-ML and) ML use-cases we may only want to see representative examples of proteins within certain families and genera vs. lots of similar variants. To support this, UniRef offers versions of its database clustered down to 50% and 90% in addition to the 100% one (just removes duplicates). If training PLMs on clustered datasets produces as good or better results, it would be a big win because we’d be saving compute for training without degrading the quality of our results. Because of this, various papers have looked at the impact of clustering on performance.

The ESM-1v authors evaluated the impact of different sequence identity clustering thresholds on zero-shot fitness prediction (2) (2)“Zero-shot” here refers to the fact that the model’s predictions are being made without ever having seen labels for the benchmark. Instead, we typically use a model’s likelihood or pseudo-likelihood (for masked language models) as a prediction under the assumption that likelihood will be a good general proxy for fitness. We’ll discuss zero-shot benchmarks and the likelihood-to-fitness assumption more in the evaluation post. performance. They found 50-90% to be the sweet spot, with models trained on datasets using higher or lower thresholds performing worse. The below figure shows this. The model trained with the 90% clustering threshold comes in with the highest average performance on downstream tasks, with the 70% and 50% models coming in second and third respectively, followed by a big drop-off for the 30% and 100% models. (3) (3)Weirdly, the 100% model actually does achieve performance close to the 50% model’s earlier in training but then its performance degrades past ~50,000 updates. This is surprising because if anything I’d expect the 100% model to climb the performance scale more slowly and be the least likely to overfit given that it takes the most updates per epoch. Clearly, my intuition is off here though!

More broadly, results so far have been mixed on whether clustering before training produces better results. The ProtTrans and Ankh authors tentatively concluded that training on UniRef50 produced their best model but their overall results on the relationship between data and downstream task performance were more mixed. On the other hand, the RITA authors chose to use UniRef100 under the hypothesis that clustering reduces information given to the model but seemingly didn’t have the opportunity to test this assumption. Progen2 used UniRef90 combined with BFD but didn’t test different clusterings. Transception also tested different clustering thresholds and found that their UniRef100 models performed best.

Transception Table 6: Performance at different model sizes on different versions of UniRef.

Given that LLM researchers have put thousands of cumulative research hours into cleaning, mixing, and weighting LLM datasets, our prior is that these decisions will matter a lot. Increasingly the insights from this type of work are treated as trade secrets, but anecdotal and hard evidence points to dataset curation continuing to matter a lot for LLM performance.(4) (4)One thing that could change the game here would be if methods in the space of active learning and/or reinforcement learning allow models to guide their own training and data point selection. There are some interesting initial results in this direction, but as far as I know, no major public results with SoTA performance on benchmarks (admittedly a lagging indicator). But right now, the takeaways for PLMs seem ambiguous. Preliminarily, it seems like autoregressive models may do better without clustering whereas MLMs seem to do best with it but this conclusion comes from something closer to an informal meta-analysis versus a single controlled experiment and so could easily change as so many have if someone were to compare everything carefully together. All we can really say with confidence is that “more work is needed.”

Structure #

AlphaFold 2 and RoseTTAFold both take multiple sequence alignments (MSAs) as inputs for structure prediction. Multiple more recent folding models – ESMFold, OmegaFold, and RGN2 – that have achieved comparable or nearly as good performance replaced MSA retrieval with sequence embedding as the first step in their models. For training the structure module, these models use a very similar subset of the structures in the protein databank (PDB).

Another source of structure data is the Class, Architecture, Topology, Hierarchy (CATH) database. CATH contains 151 million protein domains classified into 5,841 superfamilies (at time of writing). More relevant to this, it contains a set of ~16,000 sequence/structure protein domain pairs that have been used both for evaluation of PLMs and for training “inverse folding” models.

Post-development of AlphaFold, Deepmind and EMBL worked together to generate predicted structures for 213M proteins in UniProp, resulting in AlphaFold DB. These predicted structures have since been used by “inverse folding” models like ESM-IF and ProstT5 to provide paired structure-sequence data for training and I expect more models will use them for similar purposes in the future. While using them comes with a risk of learning the wrong things from incorrectly predicted structures, initial indications from these models are that this risk can be mitigated.

Function #

The post on models will discuss conditional models and when the one on evaluation will talk about predicting function as a downstream evaluations. Here, we briefly discuss the datasets used for both.

When training conditional models, we use some sort of labels as either a prediction target or a conditioning token. The closest analogue to sequence and structure for function in terms of generality is taxonomy. For taxonomic labels, papers have previously used NCBI’s taxonomy database and Pfam.

Datasets for more specific, quantitative functional labels vary much more by application and tend to be smaller. The FLIP paper collected labeled datasets for the purpose of evaluating prediction. Other papers use synthetic datasets generated with in silico tools, although these tend to be more for demonstration than for downstream use. For antibodies, there are more labeled datasets such as the one used here, but covering them in detail would require its own post.

Conclusion #

To summarize, most PLMs currently rely first and foremost on large sequence databases such as UniRef. More recently, models have started using structure alongside sequence for “inverse folding”, for which PDB and CATH provide a starting point but AlphaFold DB has enabled bigger performance jumps. Large labeled taxonomic datasets have also been used for training family-conditional models, but this area has been less explored. Finally, for more specific functions, datasets tend to be smaller or proprietary and so will vary more based on application.