Biopython

Integrate Biopython tools for automated biological data analysis and computational molecular biology

Source: K-Dense-AI/claude-scientific-skills

Biopython is a community skill for performing computational biology tasks using the Biopython library, covering sequence manipulation, BLAST searches, phylogenetic analysis, file parsing, and database access for bioinformatics workflows.

What Is This?

Overview

Biopython provides patterns for working with biological sequence data and bioinformatics tools in Python. It covers DNA, RNA, and protein sequence manipulation including transcription and translation, BLAST search submission and result parsing for sequence similarity analysis, phylogenetic tree construction and visualization, biological file format parsing for FASTA, GenBank, and PDB files, and NCBI database access for retrieving sequences and annotations. The skill enables researchers to build reproducible bioinformatics pipelines with standardized data handling across diverse analysis workflows.

Who Should Use This

This skill serves bioinformatics researchers analyzing genomic and protein sequence data, developers building bioinformatics tools and analysis pipelines, and students learning computational biology with Python. It is particularly valuable for those transitioning from manual scripting approaches to structured, library-based workflows.

Why Use It?

Problems It Solves

Parsing biological file formats manually is tedious and error-prone due to format variations across databases and software versions. Sequence operations like reverse complement and translation require correct genetic code tables, which vary by organism and organelle. Submitting BLAST searches and parsing XML results needs specialized handling. Accessing NCBI databases requires understanding their Entrez API conventions and rate limits.

Core Highlights

Seq objects provide methods for transcription, translation, and complement operations on biological sequences. SeqIO parses and writes over 20 biological file formats with a unified interface. BLAST wrappers submit searches to NCBI and parse results into structured objects. Entrez module provides programmatic access to NCBI databases with proper rate limiting.

How to Use It?

Basic Usage

from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print(f"Length: {len(dna)}")
print(f"Complement: {dna.complement()}")
print(f"Rev comp: {dna.reverse_complement()}")

mrna = dna.transcribe()
protein = dna.translate()
print(f"mRNA: {mrna}")
print(f"Protein: {protein}")

record = SeqRecord(
    dna,
    id="gene_001",
    name="example_gene",
    description="Sample coding sequence")

with open("output.fasta", "w") as f:
    SeqIO.write([record], f, "fasta")

records = list(SeqIO.parse("output.fasta", "fasta"))
print(f"Records: {len(records)}")

Real-World Examples

from Bio import Entrez, SeqIO
from Bio.Blast import NCBIWWW, NCBIXML

Entrez.email = "researcher@example.com"

def fetch_sequence(accession: str) -> dict:
    handle = Entrez.efetch(
        db="nucleotide",
        id=accession,
        rettype="gb",
        retmode="text")
    record = SeqIO.read(handle, "genbank")
    handle.close()
    return {"id": record.id,
            "name": record.name,
            "length": len(record.seq),
            "features": len(record.features)}

def run_blast(sequence: str,
              program: str = "blastn",
              database: str = "nt",
              max_hits: int = 10) -> list[dict]:
    result = NCBIWWW.qblast(
        program, database, sequence,
        hitlist_size=max_hits)
    records = NCBIXML.parse(result)
    hits = []
    for record in records:
        for alignment in record.alignments[:max_hits]:
            hits.append({
                "title": alignment.title[:80],
                "length": alignment.length,
                "e_value": alignment.hsps[0].expect})
    return hits

Advanced Tips

Use SeqIO.index for large FASTA files to create a dictionary-like access without loading all sequences into memory, which is essential when working with reference genomes or large metagenomic datasets. Set Entrez.email before making any NCBI requests to comply with their usage policy. Batch Entrez queries using epost and efetch to retrieve multiple records efficiently rather than issuing individual requests per accession.

When to Use It?

Use Cases

Build a sequence annotation pipeline that fetches GenBank records and extracts feature information for a gene list. Create a BLAST search tool that identifies similar sequences in public databases for newly sequenced genes. Implement a format converter that reads sequences from GenBank and writes them to FASTA for downstream tools.

Important Notes

Requirements

Python with the biopython package installed. An email address for NCBI Entrez API compliance. Network access for remote BLAST and database queries. NumPy is recommended for efficient handling of large sequence datasets and numerical operations on alignment score matrices.

Usage Recommendations

Do: set Entrez.email before any NCBI access to follow their usage requirements. Use SeqIO for all file parsing rather than writing custom parsers. Handle network errors gracefully when making remote BLAST or database requests, including implementing retry logic with appropriate delays for transient failures.

Don't: submit large numbers of BLAST queries without respecting NCBI rate limits. Parse biological files with string splitting when SeqIO handles the format correctly. Store entire genomes in memory as strings when SeqIO.index provides random access.

Limitations

Remote BLAST searches are subject to NCBI server load and may take minutes to complete. Some niche file formats may not be fully supported by the current SeqIO parsers. Pairwise and multiple sequence alignment require additional configuration for scoring matrices. Large-scale genomic analyses may require specialized tools beyond what Biopython provides.

More Skills You Might Like

Explore similar skills to enhance your workflow