Introducing DGEB: the Diverse Genomic Embedding Benchmark

We're pleased to announce the release of DGEB, the Diverse Genomic Embedding Benchmark – the first large-scale evaluation benchmark designed to evaluate the functional capabilities of biological language models. While biological structure prediction has rapidly advanced thanks to large-scale, trusted benchmarks like CASP, progress in biological function prediction has been slow due to the absence of comprehensive benchmarks. DGEB addresses this crucial gap by providing diverse functional benchmarks for assessing protein language model (pLM) and genomic language model (gLM) representations, paving the way for AI-accelerated functional discovery.

DGEB is a set of 18 evaluation tasks that assess model performance using both amino acid and nucleotide modalities. These tasks address two critical problems with existing functional evaluation techniques: bias in evaluation datasets, and a lack of evaluations which assess functional relationships between genomic elements.

Overview of tasks and datasets in DGEB. Nucleic acid (NA) and amino acid (AA) modality specific datasets are marked in purple and green respectively, and datasets that support both modalities are marked with both colors.

 

Evaluations that Encompass Biological Diversity

DGEB addresses the first issue of taxonomic bias, which stems from the fact that the vast majority of existing evaluation datasets only use data from model organisms. This reinforces existing biases in these datasets, which hinders discovery and design of diverse biological sequences. Foundation models must learn from data from across the tree of life, and evaluation tasks for these models should reflect this. 

The 18 datasets upon which the DGEB evaluation tasks are based span all three domains of life (Bacteria, Archaea, and Eukarya), including numerous underrepresented groups from each domain. Crucially, our findings reveal a key limitation of existing foundation models: for comparatively underrepresented groups, such as Archaea, model performance is poor and does not improve with scaling. 

Phylogenetic tree of all phyla represented in DGEB datasets. One representative 16S/18S sequence for each phylum represented in any DGEB dataset was obtained from SILVA, where available. Phylogeny was estimated using iQ-TREE 2. Widths of tree branches correspond to how well a given phylum is represented across multiple datasets.

 

Enabling Assessment of Diverse Model Capabilities

DGEB also addresses the second critical issue: the lack of evaluations assessing functional relationships between genomic elements, such as co-transcribed genes and interacting subunits of multi-protein complexes. One notable example is operon prediction, or the prediction of co-transcribed genes. This functionality is extremely important for understanding gene regulation and has only recently become possible with biological language models. The DGEB benchmarks include operon prediction tasks for Vibrio cholerae, Escherichia coli, and Synechococcus elongatus, leveraging curated data from the BioCyc database. These tasks allow us to evaluate models' ability to capture complex genomic relationships, a crucial step towards more comprehensive biological understanding.

Inferring biological function requires understanding the evolutionary and functional relationships between biomolecules. The multi-sequence evaluation tasks in DGEB allow us to ask new and important questions about biological language models, addressing vital gaps in current evaluation methods. These questions span various aspects of biological understanding:

  1. Evolutionary relationships: How well do learned embedding distances recapitulate phylogenetic distances? Can pLMs distinguish between paralogs (genes related by duplication) and orthologs (genes related by speciation)?

  2. Cross-domain applicability: Can pLMs be used to retrieve archaeal homologs of bacterial proteins, demonstrating their ability to capture similarities across diverse life forms?

  3. Regulatory interactions: Do pLMs learn co-regulatory interactions, suggesting an understanding of gene expression control?

  4. Complex genomic structures: Can multi-element functional loci, such as biosynthetic gene clusters (groups of genes involved in producing specific compounds), be captured in learned representations?

By addressing these questions, DGEB provides a comprehensive assessment of biological language models' ability to capture and represent complex biological relationships.

 

Average performance across all AA and NA tasks for models benchmarked in the DGEB manuscript. Marker size corresponds to embedding dimension and variants of same models (e.g. evo-1-8k-base, and evo-1-131k-base) are distinguished with text labels.

 

Comparing Nucleotide and Amino Acid Language Models

Some of the DGEB tasks allow for direct comparison between DNA- and protein-based language models. Interestingly, we find that for most functional tasks, models trained on nucleic acids perform poorly when compared to protein-based models, and the performance of these models does not improve with scale. 

Comparison of AA and NA model representations on tasks that support both modalities. Marker color corresponds to the model type and point size corresponds to the number of parameters in the model being evaluated.

 

Inspiration

DGEB is inspired by benchmarks from the field of natural language processing, in particular MTEB from Hugging Face. Transparent, collaborative, and diversified evaluation datasets are pivotal to ensure the success of our field going forward. We highly encourage the contribution of further expert-curated datasets through our GitHub repository: https://github.com/TattaBio/DGEB.

Authors

Jacob West-Roberts (@jwestrob)

Joshua Kravitz (@_joshuakravitz)

Nishant Jha (@parambulat0r)

Yunha Hwang (@micro_yunha)

Andre Cornman (@ancornman1)

This work is funded by Schmidt Futures

Previous
Previous

The OMG Dataset: the CommonCrawl of Biological Sequences