Introducing Gaia: Context-Aware Protein Search Across Genomic Datasets

Search Gaia

We are excited to announce Gaia, a new sequence search platform that enables real-time, context-aware protein sequence searches across large genomic datasets. While traditional sequence search methods like BLAST focus solely on amino acid sequences and tools like Foldseek utilize protein structure, Gaia leverages crucial genomic context information that is often indicative of protein function, especially in microbial systems.

Key Features

Context-aware protein sequence search incorporating genomic neighborhood information.
Real-time search over 85M protein clusters using the Qdrant vector database.
Integration of sequence, structure, and context information.

Understanding Gaia Search

Protein sequence similarity search is fundamental to genomics research, but current methods typically neglect genomic context information. Gaia addresses this limitation through gLM2, a mixed-modality genomic language model trained on both amino acid sequences and their genomic neighborhoods.

The key innovation lies in Gaia's ability to generate embeddings that integrate sequence-structure-context information. This approach enables the identification of functionally related genes found in conserved genomic contexts that may be missed by traditional sequence- or structure-based searches alone.

**Genomic language modeling for generating Gaia embeddings.** Left: gLM2 is a genomic language model trained on multi-gene metagenomic contigs from the Open MetaGenome (OMG) dataset. gLM2 learns context-aware representations of proteins. Right: To endow structural awareness in gLM representations, we fine-tune gLM2 to align with structural clusters in AlphaFold database (AFDB) clusters.

Benchmarking Gaia Search

We benchmark Gaia’s retrieval quality across three main axis of information (sequence, context and structure) against commonly used tools in sequence annotation. Gaia retrieves proteins that are similar in sequence, genomic context and structure at speed orders of magnitude faster (0.2s per query) than existing tools.

**Gaia retrieval sensitivities (Recall@K) across three axis of information:** Sequence (left), Genomic Context (middle), Protein structure (right). Read our manuscript for detailed benchmarking methods and results.

Discovering novel protein functions with Gaia

We demonstrate Gaia's utility through two case studies:

1. Phage Tail Protein Discovery: Gaia successfully identified a previously uncharacterized phage tail protein, validated with structural and genomic context information.

2. Siderophore Loci: Using Gaia, we discovered novel siderophore biosynthetic loci that were previously difficult to identify with traditional tools, showcasing its ability to find functionally related sequences across phylogenetic distances.

Read more about our case studies in our manuscript.

Looking Forward

Gaia represents a significant advance in protein sequence search by incorporating genomic context information. While BLAST excels at finding sequence similarities and Foldseek focuses on structural relationships, Gaia provides the ability to identify related proteins with conserved genomic contexts with search speed orders of magnitude faster than existing methods. This capability enables researchers to address complex questions in comparative genomics, such as identifying conserved gene clusters across distantly related organisms and elucidating the function of hypothetical proteins based on their genomic neighborhoods.

If you are interested in trying Gaia, please visit gaia.tatta.bio. We welcome feedback from the community as we continue to develop and improve this resource.

Read the preprint