gLM2: The First Mixed-Modality Genomic Language Model

Alongside the announcement of our latest open source metagenomic training corpus, OMG, we are thrilled to announce our latest model, gLM2, the first mixed-modality genomic language model. gLM2 improves over its previous iteration, gLM1, in several key respects, including residue-level sequence representations, intergenic sequence representations, and end-to-end training.

Basic schematic of the gLM2 architecture. A gene-called metagenomic contig is first preprocessed into a mixed-modal sequence consisting of coding sequence (protein) elements (in blue) and intergenic sequence (DNA) elements (in gray). The mixed-modal sequence then undergoes masking at 30% and gLM2 is trained with a masked token reconstruction objective.

 

Improved Functional Representations

Using our Diverse Genomic Embedding Benchmark (DGEB) suite, we compared the performance of gLM2 to ESM2 under two scenarios: one where we use random subset of the quality-filtered data from the JGI IMG and EMBL-EBI MGnify databases for training, and one where we prune our training dataset to decrease bias towards highly represented organisms and sample types (see our blog post about the OMG dataset and semantic deduplication). We see small overall increases in DGEB score after pruning, but noticeable improvement on tasks requiring knowledge of less well-represented taxonomic groups such as the Archaea, highlighting the utility of diversified training data.

DGEB Scores for gLM2 (pruned and unpruned) and ESM2. X-axis is the number of FLOPs needed for training (i.e. the amount of compute required to train the model); Y-axis is the combined score on the DGEB benchmarks.

 

gLM2 Learns Protein-Protein Contact Maps Without Supervision


We next set out to test gLM2’s capability to learn important evolutionary and co-evolutionary signals. Using the categorical Jacobian method first demonstrated in this recent preprint from Zhang et al., we show that gLM2 does indeed learn co-evolutionary signals without supervision. In particular, we observe that gLM2 learns the contact points between ABC transporter subunits ModA and ModC, just like alignment-based approaches. Crucially, models like Evo and ESM2 do not learn this information. This highlights the value of training models using multi-modal data and genomic context, allowing gLM2 to learn important evolutionary signals between biomolecules.

gLM2 Learns ModAC inter-protein contact sites. Protein-protein contact sites between ModA and ModC are learned by gLM2 and the alignment-based model GREMLIN. Evo and ESM2 do not learn these co-evolutionary patterns.

 

Multi-modal Genomic Intelligence

The development of gLM2 marks a significant advancement in genomic language modeling. By incorporating mixed-modality training on both protein and intergenic sequences, gLM2 demonstrates improved performance across our DGEB benchmark suite, particularly for less represented taxonomic groups. Moreover, its ability to learn protein-protein contact maps without supervision, as shown with the ModA and ModC subunits, highlights the potential of this approach to capture complex and higher-level biological relationships.

We have made gLM2 publicly available and encourage the scientific community to test its capabilities, probe its limitations, and contribute to its ongoing development. We welcome feedback and look forward to seeing how researchers might apply and extend gLM2 in diverse research contexts. And if you’d like to read more about gLM2, its architecture, and the OMG dataset, please read our manuscript here.



GitHub: https://github.com/TattaBio/gLM2

HF: https://huggingface.co/tattabio/gLM2_650M

OMG blog: https://www.tatta.bio/blog/omg

Authors: Andre Cornman (Tatta Bio), Jacob West-Roberts (Tatta Bio), Antonio Pedro Camargo (JGI), Simon Roux (JGI), Martin Beracochea (EMBL-EBI), Milot Mirdita (SNU), Sergey Ovchinnikov (MIT), Yunha Hwang (Tatta Bio)

Next
Next

The OMG Dataset: the CommonCrawl of Biological Sequences