Gget
Gget automation and integration for fast genomic data retrieval and analysis
gget is a community skill for querying genomic databases from the command line and Python, covering gene annotation, sequence retrieval, BLAST searches, enrichment analysis, and protein structure prediction for rapid bioinformatics data access.
What Is This?
Overview
gget provides patterns for accessing multiple genomic databases through a unified interface. It covers gene annotation retrieval from Ensembl for gene descriptions, coordinates, and cross-references, nucleotide and protein sequence fetching by gene name or identifier, BLAST sequence similarity search against NCBI databases, functional enrichment analysis using Enrichr for gene set characterization, and AlphaFold structure prediction access for protein 3D coordinates. The skill enables researchers to query genomic resources with minimal code, replacing manual database browsing with programmatic access. This is particularly valuable when working with large gene lists where manual lookups would be impractical.
Who Should Use This
This skill serves bioinformaticians who need rapid access to gene annotations and sequences, researchers performing quick lookups across multiple genomic databases, and computational biologists integrating database queries into analysis scripts. It is also well suited for data scientists new to genomics who want a consistent, approachable interface without learning each database API separately.
Why Use It?
Problems It Solves
Querying multiple genomic databases requires navigating different web interfaces and API conventions. Retrieving gene sequences by name involves looking up identifiers in Ensembl before fetching from sequence databases. Running BLAST searches programmatically needs handling NCBI API conventions and XML result parsing. Enrichment analysis typically requires uploading gene lists to web tools manually. These friction points slow down exploratory analysis and make pipelines harder to reproduce across different computing environments.
Core Highlights
gget.info retrieves gene annotations from Ensembl with a single function call. gget.seq fetches nucleotide and protein sequences by gene name or Ensembl ID. gget.blast runs NCBI BLAST searches and returns structured results. gget.enrichr performs gene set enrichment analysis against curated libraries. gget.alphafold retrieves predicted protein structures directly from the AlphaFold database using a gene name or UniProt identifier.
How to Use It?
Basic Usage
import gget
info = gget.info(["TP53", "BRCA1"])
print(f"Genes found: {len(info)}")
for idx, row in info.iterrows():
print(f"{row.get('symbol', idx)}: "
f"{row.get('description', '')}")
seq = gget.seq("TP53", translate=True)
print(f"Sequence length: {len(seq)}")
results = gget.search(
"tumor suppressor",
species="homo_sapiens",
limit=5)
print(f"Results: {len(results)}")
blast_results = gget.blast(
"MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLD",
program="blastp")
print(f"Hits: {len(blast_results)}")Real-World Examples
import gget
import pandas as pd
class GeneAnnotator:
def annotate_gene_list(
self, genes: list[str]) -> pd.DataFrame:
info = gget.info(genes)
if info is None or info.empty:
return pd.DataFrame()
cols = [c for c in
["symbol", "description",
"biotype", "chromosome"]
if c in info.columns]
return info[cols].reset_index()
def enrich_gene_set(
self, genes: list[str],
database: str
= "KEGG_2021_Human") -> pd.DataFrame:
result = gget.enrichr(
genes, database=database)
if result is None or result.empty:
return pd.DataFrame()
return result.head(10)
def fetch_sequences(
self, genes: list[str],
protein: bool = True
) -> dict:
seqs = {}
for gene in genes:
seq = gget.seq(gene,
translate=protein)
if seq:
seqs[gene] = seq
return seqs
annotator = GeneAnnotator()
genes = ["TP53", "BRCA1", "EGFR"]
annotations = annotator.annotate_gene_list(genes)
print(annotations)
enrichment = annotator.enrich_gene_set(genes)
print(f"Top pathway: {enrichment.iloc[0]['term']}")Advanced Tips
Use gget.setup to configure the Ensembl release version for reproducible annotation lookups. Combine gget.info with gget.seq in a pipeline to annotate and retrieve sequences for gene lists in a single workflow. Filter BLAST results by e-value and identity thresholds to focus on significant homology hits. When running enrichment analysis, test multiple curated libraries such as GO Biological Process and Reactome to get a broader view of gene set function.
When to Use It?
Use Cases
Build a gene annotation pipeline that retrieves descriptions and coordinates for differential expression results. Create a sequence retrieval tool that fetches protein sequences for a list of drug targets. Implement an enrichment analysis wrapper that characterizes gene sets from clustering or pathway analysis.
Related Topics
Genomic databases, Ensembl annotations, BLAST sequence search, gene set enrichment, and bioinformatics data access.
Important Notes
Requirements
Python with the gget package installed. Network access for querying remote databases. Ensembl and NCBI services must be available for annotation and BLAST functions.
Usage Recommendations
Do: specify species explicitly when querying to avoid ambiguous gene name matches. Cache annotation results for gene lists that you query repeatedly. Use the translate parameter to get protein sequences directly from gene names.
Don't: query hundreds of genes individually when batch functions accept gene lists. Assume gene symbols are unique across species without specifying the organism. Ignore BLAST e-values when interpreting similarity search results.
Limitations
Query results depend on the current Ensembl release and may differ between versions. BLAST searches are subject to NCBI server load and may take several minutes. Some gget modules require additional dependencies that are not included in the base installation. Always pin the Ensembl release version in production workflows to ensure consistent results over time.
More Skills You Might Like
Explore similar skills to enhance your workflow
Moe Training
Train Mixture-of-Experts models with automated scaling and architectural integration
Mixpanel Automation
Automate Mixpanel tasks via Rube MCP (Composio): events, segmentation, funnels, cohorts, user profiles, JQL queries. Always search tools first for cur
Idea Scale Automation
Automate Idea Scale operations through Composio's Idea Scale toolkit
Threejs Lighting
Design and automate dynamic Three.js lighting setups with seamless integration
Ml Pipeline
Build robust machine learning pipelines with automated orchestration and integration
Goodbits Automation
Automate Goodbits operations through Composio's Goodbits toolkit via