Gaia FAQ — Tatta Bio

Gaia is an embedding based protein search tool, leveraging embeddings from a fine-tuned gLM2 model. Given an input protein sequence, Gaia embeds the sequence, and searches against a database of 85M pre-computed embeddings to retrieve similar sequences and their genomic contexts. Detailed methods can be found in the Gaia manuscript and blog post.

Gaia searches the OG dataset, a microbial genomic dataset with 85M proteins after clustering at 90% sequence identity. We plan to support the ~10x larger Open MetaGenomic dataset (OMG) soon.

Gaia provides two sources of protein-level annotations. For the input sequence and retrieved matches, domain-level Pfam annotations are generated. For all proteins in the genomic context, functional annotations are generated using a CLIP-like model, which provides the text annotation of the closest SWISS-prot representative. The matching SWISS-prot entry for each protein can be reached using the ‘Annotation’ button for that protein. Details about the implementation of the CLIP-like model and annotation method can be found in the ‘Functional annotation’ subsection of the Methods in the Gaia manuscript.

Gaia provides two options for data download: 'All retrieved sequences' and 'All retrieved contexts'. Downloading all retrieved sequences provides a single fasta file containing all retrieved proteins matching the input sequence. Downloading all retrieved contexts provides a zip of multiple fasta files, one per genomic context.

Predicted structures are generated using ESMFold. Recycling is disabled to increase prediction speed.

To provide visual synteny, colors are assigned to the top-5 most frequently occurring protein clusters across all retrievals. Clustering is performed using DBSCAN on protein embeddings.

The retrieved Gaia search results are displayed using a UMAP scatterplot. Each point represents a retrieved sequence, and users can click individual points to see protein structure and genomic context.

If you use Gaia in your research, we ask that you cite the following publication: “Jha, N., Kravitz, J., West-Roberts, J., Camargo A.P., Roux, S., Cornman, A., Hwang, Y., Gaia: A Context-Aware Sequence Search and Discovery Tool for Microbial Proteins (2024)” doi: https://doi.org/10.1101/2024.11.19.624387

Frequently Asked Questions

How does Gaia work?

What database does Gaia search?

How are the annotations generated?

What data can I download from Gaia search?

How are the structures generated?

How are genomic context proteins colored?

What is the interactive scatter plot at the top of the page?

How can I cite Gaia?