• Biological sequences are the fundamental data type through which scientists interpret biology. Despite the exponential increase in the amount of sequence data, we are limited in our ability to predict functions for the vast majority of sequences. For instance, less than 1% of the sequenced genes have laboratory validated functions and less than half can be associated with a hypothesized function. We need a more intelligent and efficient solution to propagate functional information across biological sequences. Tatta Bio is building a new data infrastructure and a search engine to map sequences to function. We first target protein functions, focusing on highly diverse sequences.

    Search Gaia

    Read our paper

    Blog

  • We train the first mixed-modality genomic language model (gLM2) that leverages genomic context information to learn robust functional representations and coevolutionary signals in protein-protein interfaces. Trained on the pruned OMG corpus, gLM2 learns contextualized representations of genomic contigs, which are represented as sequences of coding sequence (CDS) and intergenic sequence (IGS) elements. We demonstrate efficient scaling and improved performance across downstream tasks compared to uncontextualized protein language models trained on curated databases. We further demonstrate the gLM2's ability to learn protein-protein interfaces at residue-level, paving the path towards unsupervised protein-protein complex prediction. The gLM2 models as well as the supporting code are publicly available for download.

    Read our Paper

    gLM2 Repo

    Blog

  • Biological language model performance depends heavily on pretraining data quality, diversity, and size. While metagenomic datasets feature enormous biological diversity, their utilization as pretraining data has been limited due to challenges in data accessibility, quality filtering and deduplication. Here, we present the Open MetaGenomic (OMG) corpus, a genomic pretraining dataset totalling 3.1T base pairs and 3.3B protein coding sequences, obtained by combining two large metagenomic dataset repositories (JGI's IMG and EMBL's MGnify). We first document the composition of the dataset and describe the quality filtering steps taken to remove poor quality data. We make the OMG corpus available as a mixed-modality genomic sequence dataset that represents multi-gene encoding genomic sequences with translated amino acids for protein coding sequences, and nucleic acids for intergenic sequences. Furthermore, we show that deduplication in embedding space can be used to balance the corpus, demonstrating improved performance on downstream tasks.

    Read our paper

    OMG Repo

    Blog

  • Biological foundation models hold significant promise for deciphering complex biological functions. However, evaluating their performance on functional tasks remains challenging due to the lack of standardized benchmarks encompassing diverse sequences and functions. Existing functional annotations are often scarce, biased, and susceptible to train-test leakage, hindering robust evaluation. Furthermore, biological functions manifest at multiple scales, from individual residues to large genomic segments. To address these limitations, we introduce the Diverse Genomic Embedding Benchmark (DGEB), inspired by natural language embedding benchmarks. DGEB comprises six embedding tasks across 18 expert curated datasets, spanning sequences from all domains of life and encompassing both nucleic acid and amino acid modalities. Notably, four datasets enable direct comparison between models trained on different modalities. Benchmarking protein and genomic language models (pLMs and gLMs) on DGEB reveals performance saturation with model scaling on numerous tasks, especially on those with underrepresented sequences (e.g. Archaea). This highlights the limitations of existing modeling objectives and training data distributions for capturing diverse biological functions. DGEB is available as an open-source package with a public leaderboard.

    DGEB Repo

    Read our paper

    Blog

  • Deciphering the relationship between a gene and its genomic context is fundamental to understanding and engineering biological systems. Machine learning has shown promise in learning latent relationships underlying the sequence-structure-function paradigm from massive protein sequence datasets. However, to date, limited attempts have been made in extending this continuum to include higher order genomic context information. Evolutionary processes dictate the specificity of genomic contexts in which a gene is found across phylogenetic distances, and these emergent genomic patterns can be leveraged to uncover functional relationships between gene products. Here, we train a genomic language model (gLM) on millions of metagenomic scaffolds to learn the latent functional and regulatory relationships between genes. gLM learns contextualized protein embeddings that capture the genomic context as well as the protein sequence itself, and encode biologically meaningful and functionally relevant information (e.g. enzymatic function, taxonomy). Our analysis of the attention patterns demonstrates that gLM is learning co-regulated functional modules (i.e. operons). Our findings illustrate that gLM’s unsupervised deep learning of the metagenomic corpus is an effective and promising approach to encode functional semantics and regulatory syntax of genes in their genomic contexts and uncover complex relationships between genes in a genomic region.

    Read our paper