Gget

Gget automation and integration for fast genomic data retrieval and analysis

Source: K-Dense-AI/claude-scientific-skills

gget is a community skill for querying genomic databases from the command line and Python, covering gene annotation, sequence retrieval, BLAST searches, enrichment analysis, and protein structure prediction for rapid bioinformatics data access.

What Is This?

Overview

gget provides patterns for accessing multiple genomic databases through a unified interface. It covers gene annotation retrieval from Ensembl for gene descriptions, coordinates, and cross-references, nucleotide and protein sequence fetching by gene name or identifier, BLAST sequence similarity search against NCBI databases, functional enrichment analysis using Enrichr for gene set characterization, and AlphaFold structure prediction access for protein 3D coordinates. The skill enables researchers to query genomic resources with minimal code, replacing manual database browsing with programmatic access. This is particularly valuable when working with large gene lists where manual lookups would be impractical.

Who Should Use This

This skill serves bioinformaticians who need rapid access to gene annotations and sequences, researchers performing quick lookups across multiple genomic databases, and computational biologists integrating database queries into analysis scripts. It is also well suited for data scientists new to genomics who want a consistent, approachable interface without learning each database API separately.

Why Use It?

Problems It Solves

Querying multiple genomic databases requires navigating different web interfaces and API conventions. Retrieving gene sequences by name involves looking up identifiers in Ensembl before fetching from sequence databases. Running BLAST searches programmatically needs handling NCBI API conventions and XML result parsing. Enrichment analysis typically requires uploading gene lists to web tools manually. These friction points slow down exploratory analysis and make pipelines harder to reproduce across different computing environments.

Core Highlights

gget.info retrieves gene annotations from Ensembl with a single function call. gget.seq fetches nucleotide and protein sequences by gene name or Ensembl ID. gget.blast runs NCBI BLAST searches and returns structured results. gget.enrichr performs gene set enrichment analysis against curated libraries. gget.alphafold retrieves predicted protein structures directly from the AlphaFold database using a gene name or UniProt identifier.

How to Use It?

Basic Usage

import gget

info = gget.info(["TP53", "BRCA1"])
print(f"Genes found: {len(info)}")
for idx, row in info.iterrows():
    print(f"{row.get('symbol', idx)}: "
          f"{row.get('description', '')}")

seq = gget.seq("TP53", translate=True)
print(f"Sequence length: {len(seq)}")

results = gget.search(
    "tumor suppressor",
    species="homo_sapiens",
    limit=5)
print(f"Results: {len(results)}")

blast_results = gget.blast(
    "MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLD",
    program="blastp")
print(f"Hits: {len(blast_results)}")

Real-World Examples

import gget
import pandas as pd

class GeneAnnotator:
    def annotate_gene_list(
            self, genes: list[str]) -> pd.DataFrame:
        info = gget.info(genes)
        if info is None or info.empty:
            return pd.DataFrame()
        cols = [c for c in
                ["symbol", "description",
                 "biotype", "chromosome"]
                if c in info.columns]
        return info[cols].reset_index()

    def enrich_gene_set(
            self, genes: list[str],
            database: str
            = "KEGG_2021_Human") -> pd.DataFrame:
        result = gget.enrichr(
            genes, database=database)
        if result is None or result.empty:
            return pd.DataFrame()
        return result.head(10)

    def fetch_sequences(
            self, genes: list[str],
            protein: bool = True
            ) -> dict:
        seqs = {}
        for gene in genes:
            seq = gget.seq(gene,
                           translate=protein)
            if seq:
                seqs[gene] = seq
        return seqs

annotator = GeneAnnotator()
genes = ["TP53", "BRCA1", "EGFR"]
annotations = annotator.annotate_gene_list(genes)
print(annotations)
enrichment = annotator.enrich_gene_set(genes)
print(f"Top pathway: {enrichment.iloc[0]['term']}")

Advanced Tips

Use gget.setup to configure the Ensembl release version for reproducible annotation lookups. Combine gget.info with gget.seq in a pipeline to annotate and retrieve sequences for gene lists in a single workflow. Filter BLAST results by e-value and identity thresholds to focus on significant homology hits. When running enrichment analysis, test multiple curated libraries such as GO Biological Process and Reactome to get a broader view of gene set function.

When to Use It?

Use Cases

Build a gene annotation pipeline that retrieves descriptions and coordinates for differential expression results. Create a sequence retrieval tool that fetches protein sequences for a list of drug targets. Implement an enrichment analysis wrapper that characterizes gene sets from clustering or pathway analysis.

Important Notes

Requirements

Python with the gget package installed. Network access for querying remote databases. Ensembl and NCBI services must be available for annotation and BLAST functions.

Usage Recommendations

Do: specify species explicitly when querying to avoid ambiguous gene name matches. Cache annotation results for gene lists that you query repeatedly. Use the translate parameter to get protein sequences directly from gene names.

Don't: query hundreds of genes individually when batch functions accept gene lists. Assume gene symbols are unique across species without specifying the organism. Ignore BLAST e-values when interpreting similarity search results.

Limitations

Query results depend on the current Ensembl release and may differ between versions. BLAST searches are subject to NCBI server load and may take several minutes. Some gget modules require additional dependencies that are not included in the base installation. Always pin the Ensembl release version in production workflows to ensure consistent results over time.

More Skills You Might Like

Explore similar skills to enhance your workflow