Frequently Asked Questions
-
Gaia is an embedding based protein search tool, leveraging embeddings from a fine-tuned gLM2 model. Given an input protein sequence, Gaia embeds the sequence, and searches against a database of 85M pre-computed embeddings to retrieve similar sequences and their genomic contexts. Detailed methods can be found in the Gaia manuscript and blog post.
-
Gaia searches the OG dataset, a microbial genomic dataset with 85M proteins after clustering at 90% sequence identity. We plan to support the ~10x larger Open MetaGenomic dataset (OMG) soon.
-
Gaia provides two sources of protein-level annotations. For the input sequence and retrieved matches, domain-level Pfam annotations are generated. For all proteins in the genomic context, functional annotations are generated using a CLIP-like model, which provides the text annotation of the closest SWISS-prot representative. The matching SWISS-prot entry for each protein can be reached using the ‘Annotation’ button for that protein. Details about the implementation of the CLIP-like model and annotation method can be found in the ‘Functional annotation’ subsection of the Methods in the Gaia manuscript.
-
Gaia provides two options for data download: 'All retrieved sequences' and 'All retrieved contexts'. Downloading all retrieved sequences provides a single fasta file containing all retrieved proteins matching the input sequence. Downloading all retrieved contexts provides a zip of multiple fasta files, one per genomic context.
-
Predicted structures are generated using ESMFold. Recycling is disabled to increase prediction speed.
-
To provide visual synteny, colors are assigned to the top-5 most frequently occurring protein clusters across all retrievals. Clustering is performed using DBSCAN on protein embeddings.
-
The retrieved Gaia search results are displayed using a UMAP scatterplot. Each point represents a retrieved sequence, and users can click individual points to see protein structure and genomic context.
-
If you use Gaia in your research, we ask that you cite the following publication: “Jha, N., Kravitz, J., West-Roberts, J., Camargo A.P., Roux, S., Cornman, A., Hwang, Y., Gaia: A Context-Aware Sequence Search and Discovery Tool for Microbial Proteins (2024)” doi: https://doi.org/10.1101/2024.11.19.624387