Uniprot

Query and retrieve protein data from UniProt knowledge base and API

Source: adaptyvbio/protein-design-skills

UniProt is a development skill for querying and retrieving protein data from the UniProt knowledge base and API, covering sequence retrieval, protein annotation, taxonomy information, and cross-reference lookups

What Is This?

Overview

UniProt is a comprehensive protein database that provides access to millions of protein sequences, functional annotations, and biological metadata. This skill enables developers to programmatically query the UniProt API to retrieve protein information, search by various identifiers, and integrate protein data into applications and workflows. The skill simplifies interaction with UniProt's REST API, handling authentication, request formatting, and response parsing automatically.

UniProt serves as the primary reference for protein sequences and functional information across the scientific community. By using this skill, you can access curated protein data without manually navigating web interfaces or managing complex API calls. It's particularly valuable for bioinformatics pipelines, protein analysis tools, and research applications that need reliable protein information. The UniProt database is updated regularly, ensuring access to the latest protein sequences and annotations. It integrates data from multiple sources, including literature curation and computational analysis, to provide high-quality, standardized information.

UniProt offers two main data sets: UniProtKB/Swiss-Prot, which contains manually reviewed entries, and UniProtKB/TrEMBL, which includes computationally annotated records. This distinction allows users to choose between highly curated data and broader, automatically generated content depending on their needs. The API also supports advanced search features, such as filtering by taxonomy, protein existence evidence, and annotation score.

Who Should Use This

Bioinformaticians, protein researchers, drug discovery teams, and developers building life sciences applications should use this skill to access standardized protein data efficiently. It is also valuable for educators and students in molecular biology, computational biology, and related fields who require reliable protein information for teaching, learning, or small-scale research projects.

Why Use It?

Problems It Solves

Manually searching UniProt through web interfaces is time consuming and not scalable for large datasets. This skill eliminates repetitive queries, provides programmatic access to protein annotations, and ensures consistent data retrieval across multiple analyses. It reduces development time for bioinformatics applications by abstracting away API complexity. Automated access to UniProt data also minimizes human error and supports reproducible research by enabling standardized data extraction.

Core Highlights

UniProt provides access to over 200 million protein sequences from thousands of organisms. The skill supports multiple search methods including accession numbers, gene names, and sequence similarity queries. You can retrieve detailed protein annotations including function, location, modifications, and disease associations. Cross reference lookups connect proteins to other databases like PDB, Ensembl, and DrugBank. The API also allows retrieval of protein isoforms, post-translational modifications, and links to literature references, making it a central resource for protein-centric research.

How to Use It?

Basic Usage

from uniprot import UniProtAPI

api = UniProtAPI()
protein = api.get_protein("P12345")
print(protein.sequence)
print(protein.organism)

Real-World Examples

Retrieve all proteins from a specific organism and filter by keyword:

results = api.search("organism:human AND keyword:kinase")
for protein in results:
    print(f"{protein.name}: {protein.accession}")

Search for proteins by sequence similarity and get functional annotations:

matches = api.search_sequence(query_seq, threshold=0.8)
for match in matches:
    print(f"{match.function}")
    print(f"Location: {match.subcellular_location}")

You can also retrieve protein features such as domains, active sites, and binding regions, or extract protein-protein interaction data for systems biology analyses.

Advanced Tips

Use batch queries to retrieve multiple proteins efficiently rather than making individual API calls for each protein. Filter results by reviewed status to prioritize manually curated UniProt/Swiss-Prot entries over automatically annotated TrEMBL entries. Leverage the API’s pagination and field selection features to optimize data transfer and processing speed. For large-scale analyses, consider integrating UniProt data with local databases or cloud storage solutions.

When to Use It?

Use Cases

Drug target identification requires searching for proteins associated with specific diseases and pathways using UniProt's disease annotations. Protein engineering projects need sequence data and structural information from UniProt to design variants and predict properties. Comparative genomics studies compare protein sequences across species to identify orthologs and understand evolutionary relationships. Systems biology modeling integrates protein interaction data and functional annotations from UniProt into network analysis tools. Additionally, UniProt data supports proteomics workflows, variant effect prediction, and the annotation of novel sequences from sequencing projects.

Important Notes

While the UniProt skill streamlines access to comprehensive protein data, users should be aware of certain practical considerations. Proper configuration, attention to API usage limits, and understanding data curation levels are essential for maximizing reliability. Awareness of data update cycles and the distinction between reviewed and unreviewed entries ensures accurate integration into research and applications.

Requirements

Python 3.7 or higher runtime environment
Internet access for API connectivity
Installation of the uniprot skill package and dependencies
(Optional) API key for advanced or high-throughput access, depending on UniProt's current authentication policies

Usage Recommendations

Prioritize UniProtKB/Swiss-Prot entries for critical applications requiring curated data
Use batch queries and pagination to handle large datasets efficiently
Regularly check for updates to the skill and UniProt API to ensure compatibility
Validate organism and identifier formats to avoid mismatches in search results
Review API documentation for changes in endpoints or data structure

Limitations

Does not provide direct access to raw protein structure files; use PDB for 3D data
Some annotations, especially in UniProtKB/TrEMBL, are computational predictions and may lack manual review
API rate limits may restrict very high-frequency or large-scale queries
Real-time updates may lag behind the latest experimental findings due to curation cycles

More Skills You Might Like

Explore similar skills to enhance your workflow