The OMG Dataset: the CommonCrawl of Biological Sequences

We are excited to announce the release of the OMG (Open MetaGenomic) corpus, a new resource designed to accelerate research in genomic and protein language models. This carefully curated dataset represents a significant step forward in making large-scale, high-quality metagenomic data more accessible to the AI research community.

Key Features of the OMG Dataset:

  • 3.1 trillion base pairs of metagenomic data

  • 3.3 billion protein sequences

  • Quality filtered  

  • Publicly available on HuggingFace

By providing a well-documented, pre-processed and high-diversity dataset, we hope to remove barriers and accelerate progress in this rapidly evolving field of biological language models. 

What makes OMG Different?

Metagenomes offer the greatest available sequence diversity compared to other genomic datasets. This diversity makes them invaluable for training robust and versatile protein and genomic language models. However, using large metagenomic datasets for development of genomic language models poses several difficulties. First, the metagenomic assemblies need to be filtered to ensure the quality of the data being fed to the model. Second, metagenomes contain duplicate data and sampling biases, resulting in the need for deduplication to achieve a balanced dataset for model training. The OMG dataset addresses these challenges, offering a ready-to-use resource for researchers.

To provide a point of comparison, consider two popular datasets used for model development in the field of natural language processing: Wikipedia and the CommonCrawl dataset. Wikipedia is small and intensively curated by a community of editors; CommonCrawl is much larger, but too large for that same type of careful curation due to its size. Similarly, many biological sequence models are reliant on databases such as UniProt or SwissProt for training, which is intensively curated and trustworthy, but not large. OMG serves a similar purpose to CommonCrawl; significantly more diverse, yet unstructured, data resource that can be leveraged in the development of biological language models.

Quality Filtering and Genomic SemDedup

The OMG dataset was created using the datasets available on two different databases: JGI’s IMG database, and EMBL-EBI’s MGnify database. After compiling this data, we performed multiple quality filtering and pre-processing steps to create both protein-only and multi-modal (i.e. DNA and predicted proteins together) datasets; the details of these steps can be found alongside the HuggingFace dataset and in our manuscript. 

One particular challenge of working with large scale metagenomic datasets is the need to account for sampling bias and redundancies. Some types of organisms and samples are far more common than others across these large databases. To address this challenge we implemented genomic SemDeDup (semantic deduplication), an embedding-based deduplication for genomic sequences. This method allows the corpus to be tunably balanced and pruned. In particular, this reduces the bias of the dataset towards highly represented organisms such as certain groups of Pseudomonadota (including, for example, E. coli, the most commonly studied bacterium in microbiology).

 

UMAP visualization of the OMG dataset: A) UMAP visualization of the OG dataset examples, colored by taxonomic phylum. B) Semantic deduplication of the OG dataset, with pruned points highlighted in blue. C) Comparison of the OG and OMG datasets using a random 0.1% subset of each. Notably, the metagenomic data (OMG) exhibits higher diversity. See manuscript for full figure legend.

 

Advancing Genomic and Protein Language Models

The OMG dataset represents a significant step forward in providing high-quality, large-scale metagenomic corpus ready for genomic and protein language model development. By addressing key challenges such as data preprocessing, quality filtering, and deduplication, OMG aims to facilitate access to diverse biological sequence datasets, accelerating the field of BioML.

The scale of OMG - 3.1 trillion base pairs and 3.3 billion protein sequences - provides a robust foundation for training increasingly sophisticated models. But it's not just about size. The diverse metagenomic data in OMG, coupled with our careful preprocessing and deduplication steps, as well as the preservation of genomic context information, offers a unique resource that will prove valuable to the research community. 

For those interested in the technical details of how OMG was created, including our quality control measures and the implementation of genomic SemDeDup, we invite you to read our preprint describing OMG and our new genomic language model, gLM2. Your feedback and suggestions are welcome as we continue to refine and improve this resource. We're excited to see where OMG might lead and how it might contribute to pushing the boundaries of what's possible in genomic and protein language modeling.

GitHub: https://github.com/TattaBio/OMG

HF: https://huggingface.co/datasets/tattabio/OMG

gLM2 blog: https://www.tatta.bio/blog/glm2

Authors: Andre Cornman (Tatta Bio), Jacob West-Roberts (Tatta Bio), Antonio Pedro Camargo (JGI), Simon Roux (JGI), Martin Beracochea (EMBL-EBI), Milot Mirdita (SNU), Sergey Ovchinnikov (MIT), Yunha Hwang (Tatta Bio)

Previous
Previous

gLM2: The First Mixed-Modality Genomic Language Model

Next
Next

Introducing DGEB: the Diverse Genomic Embedding Benchmark