Tooluniverse Sequence Retrieval

Tooluniverse Sequence Retrieval

Tooluniverse Sequence Retrieval automation and integration

Category: productivity Source: mims-harvard/tooluniverse

ToolUniverse Sequence Retrieval is an AI skill that enables retrieval of biological sequences from comprehensive genomic and proteomic databases. It covers sequence query construction, database selection, result filtering, format conversion, and batch retrieval workflows that streamline access to nucleotide and protein sequence data for bioinformatics analysis.

What Is This?

Overview

ToolUniverse Sequence Retrieval provides structured workflows for accessing biological sequence data from major public databases. It addresses query construction for nucleotide and protein sequence searches, database selection guidance for choosing between NCBI, UniProt, Ensembl, and specialized repositories, result filtering based on taxonomy, sequence length, quality metrics, and annotation completeness, format conversion between FASTA, GenBank, EMBL, and other standard sequence formats, batch retrieval for downloading large numbers of sequences efficiently, and metadata extraction including gene annotations, functional domains, and cross-references.

Who Should Use This

This skill serves bioinformaticians building analysis pipelines that require sequence input data, molecular biologists retrieving reference sequences for experimental design, computational biology students learning to access public sequence databases, and data scientists working with genomic or proteomic datasets.

Why Use It?

Problems It Solves

Public sequence databases contain billions of records across multiple repositories with different query interfaces and data formats. Finding the right sequences requires knowing which database to search, how to construct effective queries, and how to filter results to relevant entries. Manual retrieval of large sequence sets is impractical without programmatic access.

Core Highlights

The skill selects the appropriate database based on the sequence type and analysis goal. Query optimization retrieves relevant results while minimizing false positives. Batch retrieval handles rate limits and pagination for large downloads. Format conversion ensures sequences are compatible with downstream analysis tools.

How to Use It?

Basic Usage

from Bio import Entrez, SeqIO

Entrez.email = "researcher@university.edu"

def search_and_retrieve(query, database="nucleotide", max_results=10):
    search_handle = Entrez.esearch(
        db=database, term=query, retmax=max_results
    )
    search_results = Entrez.read(search_handle)
    ids = search_results["IdList"]
    print(f"Found {search_results['Count']} results, retrieving {len(ids)}")

    fetch_handle = Entrez.efetch(
        db=database, id=ids, rettype="fasta", retmode="text"
    )
    sequences = list(SeqIO.parse(fetch_handle, "fasta"))
    fetch_handle.close()
    return sequences

results = search_and_retrieve(
    "Homo sapiens[Organism] AND insulin[Gene] AND mRNA[Filter]"
)
for seq in results:
    print(f"{seq.id}: {len(seq)} bp - {seq.description[:60]}")

Real-World Examples

import time
from Bio import Entrez, SeqIO

class BatchSequenceRetriever:
    def __init__(self, email, api_key=None):
        Entrez.email = email
        if api_key:
            Entrez.api_key = api_key

    def batch_retrieve(self, id_list, database, batch_size=200, output_file=None):
        all_sequences = []
        for start in range(0, len(id_list), batch_size):
            batch = id_list[start:start + batch_size]
            handle = Entrez.efetch(
                db=database, id=batch, rettype="fasta", retmode="text"
            )
            sequences = list(SeqIO.parse(handle, "fasta"))
            handle.close()
            all_sequences.extend(sequences)
            print(f"Retrieved {len(all_sequences)}/{len(id_list)} sequences")
            time.sleep(0.34)  # Respect NCBI rate limit

        if output_file:
            SeqIO.write(all_sequences, output_file, "fasta")
        return all_sequences

    def search_by_taxonomy(self, organism, gene, database="protein"):
        query = f"{organism}[Organism] AND {gene}[Gene Name]"
        search = Entrez.read(Entrez.esearch(db=database, term=query, retmax=500))
        return search["IdList"]

retriever = BatchSequenceRetriever("researcher@university.edu")
ids = retriever.search_by_taxonomy("Mus musculus", "BRCA1")
sequences = retriever.batch_retrieve(ids, "protein", output_file="brca1_mouse.fasta")

Advanced Tips

Register for an NCBI API key to increase the rate limit from 3 to 10 requests per second. Use Entrez history server for queries returning large result sets to avoid passing thousands of IDs between calls. Cache retrieved sequences locally to avoid repeated downloads and respect database usage policies.

When to Use It?

Use Cases

Use ToolUniverse Sequence Retrieval when building analysis pipelines that need reference sequences from public databases, when collecting orthologous sequences across species for comparative genomics, when downloading protein sequences for structural or functional analysis, or when assembling custom sequence databases for BLAST searches.

Related Topics

Biopython for sequence manipulation, BLAST for sequence similarity search, multiple sequence alignment tools, phylogenetic analysis, and genomic data formats all complement sequence retrieval workflows.

Important Notes

Requirements

Biopython library installed for programmatic NCBI access. An email address registered with NCBI for Entrez queries (required by usage policy). Network access to public sequence databases.

Usage Recommendations

Do: respect database rate limits by adding appropriate delays between requests. Include organism and sequence type filters to narrow results to relevant entries. Cache frequently accessed sequences locally to reduce database load.

Don't: send more than 3 requests per second to NCBI without an API key. Download entire databases when only a subset of sequences is needed. Parse sequence records manually when Biopython provides dedicated parsers for all standard formats.

Limitations

Public database availability depends on network connectivity and service uptime. Sequence annotations may be incomplete or inconsistent across databases. Large batch downloads can take significant time due to rate limiting requirements. Not all sequences in public databases are experimentally verified, and some may contain errors.