gLM2: The First Mixed-Modality Genomic Language Model
Alongside the announcement of our latest open source metagenomic training corpus, OMG, we are thrilled to announce our latest model, gLM2, the first mixed-modality genomic language model. gLM2 improves over its previous iteration, gLM1, in several key respects, including residue-level sequence representations, intergenic sequence representations, and end-to-end training.
Improved Functional Representations
Using our Diverse Genomic Embedding Benchmark (DGEB) suite, we compared the performance of gLM2 to ESM2 under two scenarios: one where we use random subset of the quality-filtered data from the JGI IMG and EMBL-EBI MGnify databases for training, and one where we prune our training dataset to decrease bias towards highly represented organisms and sample types (see our blog post about the OMG dataset and semantic deduplication). We see small overall increases in DGEB score after pruning, but noticeable improvement on tasks requiring knowledge of less well-represented taxonomic groups such as the Archaea, highlighting the utility of diversified training data.
gLM2 Learns Protein-Protein Contact Maps Without Supervision
We next set out to test gLM2’s capability to learn important evolutionary and co-evolutionary signals. Using the categorical Jacobian method first demonstrated in this recent preprint from Zhang et al., we show that gLM2 does indeed learn co-evolutionary signals without supervision. In particular, we observe that gLM2 learns the contact points between ABC transporter subunits ModA and ModC, just like alignment-based approaches. Crucially, models like Evo and ESM2 do not learn this information. This highlights the value of training models using multi-modal data and genomic context, allowing gLM2 to learn important evolutionary signals between biomolecules.
Multi-modal Genomic Intelligence
The development of gLM2 marks a significant advancement in genomic language modeling. By incorporating mixed-modality training on both protein and intergenic sequences, gLM2 demonstrates improved performance across our DGEB benchmark suite, particularly for less represented taxonomic groups. Moreover, its ability to learn protein-protein contact maps without supervision, as shown with the ModA and ModC subunits, highlights the potential of this approach to capture complex and higher-level biological relationships.
We have made gLM2 publicly available and encourage the scientific community to test its capabilities, probe its limitations, and contribute to its ongoing development. We welcome feedback and look forward to seeing how researchers might apply and extend gLM2 in diverse research contexts. And if you’d like to read more about gLM2, its architecture, and the OMG dataset, please read our manuscript here.
GitHub: https://github.com/TattaBio/gLM2
HF: https://huggingface.co/tattabio/gLM2_650M
OMG blog: https://www.tatta.bio/blog/omg
Authors: Andre Cornman (Tatta Bio), Jacob West-Roberts (Tatta Bio), Antonio Pedro Camargo (JGI), Simon Roux (JGI), Martin Beracochea (EMBL-EBI), Milot Mirdita (SNU), Sergey Ovchinnikov (MIT), Yunha Hwang (Tatta Bio)