Introducing Gaia: Context-Aware Protein Search Across Genomic Datasets
We are excited to announce Gaia, a new sequence search platform that enables real-time, context-aware protein sequence searches across large genomic datasets. While traditional sequence search methods like BLAST focus solely on amino acid sequences and tools like Foldseek utilize protein structure, Gaia leverages crucial genomic context information that is often indicative of protein function, especially in microbial systems.
Key Features
Context-aware protein sequence search incorporating genomic neighborhood information.
Real-time search over 85M protein clusters using the Qdrant vector database.
Integration of sequence, structure, and context information.
Understanding Gaia Search
Protein sequence similarity search is fundamental to genomics research, but current methods typically neglect genomic context information. Gaia addresses this limitation through gLM2, a mixed-modality genomic language model trained on both amino acid sequences and their genomic neighborhoods.
The key innovation lies in Gaia's ability to generate embeddings that integrate sequence-structure-context information. This approach enables the identification of functionally related genes found in conserved genomic contexts that may be missed by traditional sequence- or structure-based searches alone.
Benchmarking Gaia Search
We benchmark Gaia’s retrieval quality across three main axis of information (sequence, context and structure) against commonly used tools in sequence annotation. Gaia retrieves proteins that are similar in sequence, genomic context and structure at speed orders of magnitude faster (0.2s per query) than existing tools.
Discovering novel protein functions with Gaia
We demonstrate Gaia's utility through two case studies:
1. Phage Tail Protein Discovery: Gaia successfully identified a previously uncharacterized phage tail protein, validated with structural and genomic context information.
2. Siderophore Loci: Using Gaia, we discovered novel siderophore biosynthetic loci that were previously difficult to identify with traditional tools, showcasing its ability to find functionally related sequences across phylogenetic distances.
Read more about our case studies in our manuscript.
Looking Forward
Gaia represents a significant advance in protein sequence search by incorporating genomic context information. While BLAST excels at finding sequence similarities and Foldseek focuses on structural relationships, Gaia provides the ability to identify related proteins with conserved genomic contexts with search speed orders of magnitude faster than existing methods. This capability enables researchers to address complex questions in comparative genomics, such as identifying conserved gene clusters across distantly related organisms and elucidating the function of hypothetical proteins based on their genomic neighborhoods.
If you are interested in trying Gaia, please visit gaia.tatta.bio. We welcome feedback from the community as we continue to develop and improve this resource.
More resources
gLM2_embed: https://huggingface.co/tattabio/gLM2_650M_embed
OG_prot_90: https://huggingface.co/datasets/tattabio/OG_prot90