Pysam
Specialized Pysam automation and integration for high-throughput genomic sequence analysis
PySAM is a community skill for reading and manipulating genomic alignment data using the pysam Python library, covering BAM/SAM file access, variant calling data retrieval, pileup generation, region queries, and index management for bioinformatics analysis.
What Is This?
Overview
PySAM provides tools for working with high-throughput sequencing data stored in standard bioinformatics file formats. It covers BAM/SAM file access that reads aligned sequence reads with mapping quality and CIGAR information, variant calling data retrieval that parses VCF and BCF files for genomic variants, pileup generation that stacks aligned reads at genomic positions for coverage and allele frequency calculation, region queries that efficiently extract reads and variants from specific genomic intervals using index files, and index management that creates and validates BAI and tabix indexes for random access. The skill enables bioinformaticians to process sequencing data efficiently in Python.
Who Should Use This
This skill serves bioinformaticians processing next-generation sequencing alignments, genomics researchers extracting read information from specific genomic regions, and pipeline developers building variant analysis workflows in Python.
Why Use It?
Problems It Solves
BAM files are compressed binary formats that cannot be read directly without specialized parsing libraries. Extracting reads from specific genomic regions requires index-based random access rather than scanning the entire file. Calculating coverage and allele frequencies at specific positions needs pileup operations across aligned reads. Parsing VCF variant files with complex INFO and FORMAT fields requires format-aware tools.
Core Highlights
Alignment reader parses BAM and SAM files with full access to read properties. Region fetcher extracts reads from specific genomic intervals using BAI indexes. Pileup generator stacks reads at positions for coverage analysis. Variant parser reads VCF and BCF files with structured access to variant fields.
How to Use It?
Basic Usage
import pysam
bam = pysam\
.AlignmentFile(
'aligned.bam',
'rb')
print(
f'References: '
f'{bam.nreferences}')
print(
f'Mapped: '
f'{bam.mapped}')
for read in bam.fetch(
'chr1',
1000000,
1001000
):
print(
f'{read.query_name}'
f' | MAPQ: '
f'{read'
f'.mapping_quality}'
f' | CIGAR: '
f'{read'
f'.cigarstring}')
for col in (
bam.pileup(
'chr1',
1000500,
1000501)
):
print(
f'Pos: {col.pos}'
f' Depth: '
f'{col.nsegments}')
bam.close()Real-World Examples
import pysam
import numpy as np
class CoverageAnalyzer:
def __init__(
self,
bam_path: str
):
self.bam = (
pysam.AlignmentFile(
bam_path, 'rb'))
def region_coverage(
self,
chrom: str,
start: int,
end: int
) -> dict:
depths = []
for col in (
self.bam.pileup(
chrom,
start, end,
truncate=True)
):
depths.append(
col.nsegments)
if not depths:
return {
'mean': 0,
'median': 0,
'min': 0,
'max': 0}
arr = np.array(
depths)
return {
'mean': float(
arr.mean()),
'median': float(
np.median(arr)),
'min': int(
arr.min()),
'max': int(
arr.max())}
def count_reads(
self,
chrom: str,
start: int,
end: int,
min_mapq: int = 20
) -> int:
count = 0
for read in (
self.bam.fetch(
chrom,
start, end)
):
if (read
.mapping_quality
>= min_mapq):
count += 1
return count
def close(self):
self.bam.close()Advanced Tips
Use the fetch method with until_eof=True to iterate all reads including unmapped ones when calculating global alignment statistics. Filter reads by flag values to exclude duplicates, secondary alignments, and supplementary alignments from analysis. Use count_coverage for fast base-level coverage arrays without iterating individual pileup columns.
When to Use It?
Use Cases
Calculate sequencing coverage depth across target regions for quality control reporting. Extract aligned reads overlapping specific genes or variants for detailed inspection. Parse VCF files to filter variants by quality score and allele frequency thresholds.
Related Topics
pysam, BAM, SAM, VCF, genomics, bioinformatics, sequencing analysis, and htslib.
Important Notes
Requirements
pysam Python package with htslib C library bindings. Indexed BAM files with corresponding BAI index for region-based queries. Reference genome FASTA file for operations that require sequence context.
Usage Recommendations
Do: ensure BAM files are sorted and indexed before attempting region queries since pysam requires index files for random access. Filter reads by mapping quality and alignment flags to exclude low-quality data from analysis. Use context managers or explicitly close file handles to prevent resource leaks with large BAM files.
Don't: iterate all reads in a BAM file when only specific regions are needed since region queries with fetch are orders of magnitude faster. Modify alignment records without understanding how changes affect downstream tools. Assume read coordinates are zero-based or one-based without checking since pysam uses zero-based half-open intervals.
Limitations
pysam requires compiled C extensions that can be difficult to install on some platforms. Large BAM files with billions of reads consume significant memory during full-file iteration. Some advanced htslib features are not exposed through the Python interface.
More Skills You Might Like
Explore similar skills to enhance your workflow
Text Optimizer
Enhance and optimize text content using intelligent automation and integration tools
Linkhut Automation
Automate Linkhut operations through Composio's Linkhut toolkit via Rube
Close Automation
Automate Close CRM tasks via Rube MCP (Composio): create leads, manage calls/SMS, handle tasks, and track notes. Always search tools first for current
Obsidian Bases
Automate and integrate Obsidian Bases for streamlined knowledge management workflows
Google Maps Automation
Geocode addresses, search places, get directions, compute distance
Ignisign Automation
Automate Ignisign operations through Composio's Ignisign toolkit via