Bioservices

Bioservices

Bioservices automation and integration for seamless access to biological data resources

Category: productivity Source: K-Dense-AI/claude-scientific-skills

Bioservices is a community skill for accessing biological web services and databases programmatically, covering REST API wrappers for UniProt, KEGG, ChEMBL, and other biological databases for automated data retrieval in research workflows.

What Is This?

Overview

Bioservices provides patterns for querying biological databases through Python wrappers that abstract REST API complexity. It covers UniProt protein database queries for sequence and annotation retrieval, KEGG pathway database access for metabolic and signaling pathway data, ChEMBL bioactivity database queries for drug-target interaction data, BioModels access for mathematical models of biological systems, and cross-database identifier mapping between different nomenclature systems. The skill enables researchers to automate data collection from multiple biological databases in unified Python workflows, replacing manual downloads with reproducible programmatic pipelines.

Who Should Use This

This skill serves bioinformatics researchers collecting data from multiple biological databases, drug discovery teams querying bioactivity databases for target analysis, and systems biologists integrating pathway and interaction data from public resources.

Why Use It?

Problems It Solves

Each biological database has its own API conventions, authentication, and response formats that require separate integration code. Manual web browsing for database queries does not scale to thousands of genes or compounds. Cross-referencing identifiers between databases requires mapping tables that are tedious to maintain. Rate limiting and pagination differ across services, complicating batch retrieval. For example, converting a list of gene symbols to UniProt accessions and then retrieving associated KEGG pathways would otherwise require three separate integration efforts.

Core Highlights

Unified Python classes wrap major biological database APIs behind consistent interfaces. UniProt access retrieves protein sequences, annotations, and cross-references. KEGG queries return pathway maps, gene lists, and compound information. Identifier mapping converts between accession systems across databases.

How to Use It?

Basic Usage

from bioservices import UniProt, KEGG

u = UniProt(verbose=False)
result = u.search(
    "gene_exact:TP53 AND organism_id:9606",
    frmt="tsv",
    columns="accession,gene_names,length,"
            "organism_name")
print(result[:500])

entry = u.retrieve("P04637", frmt="txt")
print(f"Entry length: {len(entry)}")

k = KEGG(verbose=False)
pathway = k.get("hsa04110")  # Cell cycle
genes = k.get("hsa04110/genes")
print(f"Pathway data length: {len(pathway)}")

Real-World Examples

from bioservices import UniProt, KEGG, ChEMBL

class BioDataCollector:
    def __init__(self):
        self.uniprot = UniProt(verbose=False)
        self.kegg = KEGG(verbose=False)

    def get_protein_info(self, gene: str,
                         organism: int = 9606
                         ) -> dict:
        query = (f"gene_exact:{gene} AND "
                 f"organism_id:{organism}")
        result = self.uniprot.search(
            query, frmt="tsv",
            columns="accession,length,"
                    "go_biological_process",
            limit=1)
        lines = result.strip().split("\n")
        if len(lines) < 2:
            return {"gene": gene, "found": False}
        fields = lines[1].split("\t")
        return {"gene": gene, "found": True,
                "accession": fields[0],
                "length": fields[1]}

    def get_pathways(self, gene: str) -> list[str]:
        results = self.kegg.find("genes", gene)
        if not results:
            return []
        return [line.split("\t")[0]
                for line in results.strip().split("\n")
                if line][:5]

    def collect(self, genes: list[str]) -> list[dict]:
        return [self.get_protein_info(g) for g in genes]

Advanced Tips

Cache query results locally to avoid repeated API calls for the same identifiers during iterative analysis. Storing responses as JSON or TSV files keyed by accession number is a practical approach. Use batch retrieval endpoints when available to minimize the number of HTTP requests for large gene lists. Set verbose to False in production to suppress debug output from the bioservices logging.

When to Use It?

Use Cases

Build an annotation pipeline that enriches a gene list with protein information from UniProt and pathway data from KEGG. Create a compound screening tool that queries ChEMBL for bioactivity data on drug candidates. Implement an identifier mapping service that converts gene symbols to UniProt accessions for downstream analysis.

Related Topics

Biological database APIs, protein sequence databases, metabolic pathway analysis, drug target identification, and bioinformatics data integration.

Important Notes

Requirements

Python with the bioservices package installed. Network access for querying remote biological databases. Familiarity with biological identifiers and database schemas. Understanding of protein and gene nomenclature conventions improves query accuracy, particularly when distinguishing reviewed Swiss-Prot entries from unreviewed TrEMBL records in UniProt.

Usage Recommendations

Do: cache results from expensive queries to reduce load on public database servers. Use specific query fields rather than broad text searches for accurate results. Handle API errors and timeouts gracefully when querying remote services.

Don't: submit thousands of queries in rapid succession without rate limiting. Assume that all database entries are complete or current without checking update dates. Parse response text with string operations when structured format options like TSV are available.

Limitations

Public database APIs may have rate limits that restrict high-throughput queries. Database content is updated periodically and may not reflect the most recent publications. Some advanced query features require understanding the specific database query syntax. Response formats and field availability differ between services and may change with API updates.