Gaia Agent: Context-Aware Functional Insights at Scale

“AI biologist” predicts protein function 

We are pleased to introduce Gaia Agent, a new feature within the Gaia platform that advances context-aware protein function prediction with agentic AI. While traditional tools focus on sequence or structure, Gaia Agent also considers genomic neighborhoods illuminating functions hidden in complex genomic contexts. A vast majority of predicted protein-coding genes cannot be functionally annotated using sequence and/or structural similarity alone. In order to hypothesize the function of unannotated sequences in microbial genomes, a biologist considers, researches, and reasons over the genomic context. This hypothesis generation process is extremely manual and requires expertise, and currently does not scale to billions of proteins that have poorly characterized function. 

By leveraging the latest LLMs—which have broad knowledge across biology—Gaia Agent acts like an “AI Biologist.” It integrates information from local gene clusters, frequently co-occurring genes, structural search results, taxonomic information, and functional motifs. This approach can suggest putative roles even for uncharacterized proteins, with little sequence similarity to known proteins.

Gaia Agent reasons step-by-step and incorporates guidance from few-shot examples curated by expert bioinformaticians. The result is a holistic functional interpretation, linking protein sequences to potential biosynthetic clusters, metabolic pathways, or regulatory functions. This contextualized perspective supports more informed hypotheses, opening doors to discoveries in microbial ecology, biotechnology, and beyond.


A case study: novel insights for hypothetical Mycobacterium tuberculosis proteins. 

Mycobacterium tuberculosis (Mtb), the bacterium responsible for tuberculosis, is the leading cause of death worldwide from a single infectious agent. Despite decades of research, approximately 40% of Mtb genes are functionally uncharacterized. We asked Gaia Agent to predict functions of a few of these genes: 

 

Example 1: Uncharacterized protein Rv1841c is a putative transmembrane magnesium transporter. 

Genomic context of Rv1841c coding gene (Query in green).

Gaia Agent prediction:

"Short Prediction": "The 349-amino acid protein Rv1841c (positions 61260:62309 on the reverse strand) is likely a transmembrane magnesium transporter or regulator of magnesium homeostasis, potentially working in concert with its adjacent CNNM-domain containing neighbor. It contains CNNM and CBS domains, suggesting a role in metal ion transport with potential regulatory functions.",

To validate this prediction, we folded Rv1841c with the neighboring protein (Rv1842c, also of uncharacterized function) that the agent hypothesized works in concert. Strikingly, Foldseek-Multimer search of the resulting complex matches an archaeal (Methanoculleus thermophilus) magnesium transporter complex with very low (<25%) sequence similarity. 

Foldseek-multimer alignment of Rv1841c-Rv1842c complex (yellow) with Methanoculleus sp. magnesium transporter complex (blue). 

 

Example 2: Discovery of a putative Mtb lanthipeptide loci.  

Genomic context of Rv1376 coding gene (Query in green). The putative lanthipeptide in red is missed in most Mtb gene calls due to its short length and alternative initiation codon.

We asked Gaia Agent to predict the function of a hypothetical protein Rv1376. 

"Short Prediction": "The protein (genomic location 1398384-1399874) likely functions in peptide modification, potentially in concert with the adjacent YcaO domain-containing protein. It may be involved in the biosynthesis of ribosomally synthesized and post-translationally modified peptides (RiPPs) in Mycobacterium tuberculosis."

Gaia Agent uses the presence of TfuA-YcaO domains across two neighboring genes to infer that this may be a RiPP loci. Encouraged by this evidence, we looked for RiPP precursor peptide sequences in the genomic region directly upstream of Rv1375. We found a small (44aa) open reading frame (ORF) with alternative initiation code (TTG) directly upstream. Using RiPPMiner, we further validated that this previously missed ORF likely encodes Lanthipeptide B. 

RiPPMiner predicts the product of the newly identified ORF to be of lanthipeptide class B.


We invite you to try Gaia Agent, share your feedback, and explore new dimensions in protein functional analysis.

Next
Next

Introducing Gaia: Context-Aware Protein Search Across Genomic Datasets