Pysam

Specialized Pysam automation and integration for high-throughput genomic sequence analysis

PySAM is a community skill for reading and manipulating genomic alignment data using the pysam Python library, covering BAM/SAM file access, variant calling data retrieval, pileup generation, region queries, and index management for bioinformatics analysis.

What Is This?

Overview

PySAM provides tools for working with high-throughput sequencing data stored in standard bioinformatics file formats. It covers BAM/SAM file access that reads aligned sequence reads with mapping quality and CIGAR information, variant calling data retrieval that parses VCF and BCF files for genomic variants, pileup generation that stacks aligned reads at genomic positions for coverage and allele frequency calculation, region queries that efficiently extract reads and variants from specific genomic intervals using index files, and index management that creates and validates BAI and tabix indexes for random access. The skill enables bioinformaticians to process sequencing data efficiently in Python.

Who Should Use This

This skill serves bioinformaticians processing next-generation sequencing alignments, genomics researchers extracting read information from specific genomic regions, and pipeline developers building variant analysis workflows in Python.

Why Use It?

Problems It Solves

BAM files are compressed binary formats that cannot be read directly without specialized parsing libraries. Extracting reads from specific genomic regions requires index-based random access rather than scanning the entire file. Calculating coverage and allele frequencies at specific positions needs pileup operations across aligned reads. Parsing VCF variant files with complex INFO and FORMAT fields requires format-aware tools.

Core Highlights

Alignment reader parses BAM and SAM files with full access to read properties. Region fetcher extracts reads from specific genomic intervals using BAI indexes. Pileup generator stacks reads at positions for coverage analysis. Variant parser reads VCF and BCF files with structured access to variant fields.

How to Use It?

Basic Usage

import pysam

bam = pysam\
  .AlignmentFile(
    'aligned.bam',
    'rb')

print(
  f'References: '
  f'{bam.nreferences}')
print(
  f'Mapped: '
  f'{bam.mapped}')

for read in bam.fetch(
  'chr1',
  1000000,
  1001000
):
  print(
    f'{read.query_name}'
    f' | MAPQ: '
    f'{read'
    f'.mapping_quality}'
    f' | CIGAR: '
    f'{read'
    f'.cigarstring}')

for col in (
  bam.pileup(
    'chr1',
    1000500,
    1000501)
):
  print(
    f'Pos: {col.pos}'
    f' Depth: '
    f'{col.nsegments}')

bam.close()

Real-World Examples

import pysam
import numpy as np

class CoverageAnalyzer:
  def __init__(
    self,
    bam_path: str
  ):
    self.bam = (
      pysam.AlignmentFile(
        bam_path, 'rb'))

  def region_coverage(
    self,
    chrom: str,
    start: int,
    end: int
  ) -> dict:
    depths = []
    for col in (
      self.bam.pileup(
        chrom,
        start, end,
        truncate=True)
    ):
      depths.append(
        col.nsegments)
    if not depths:
      return {
        'mean': 0,
        'median': 0,
        'min': 0,
        'max': 0}
    arr = np.array(
      depths)
    return {
      'mean': float(
        arr.mean()),
      'median': float(
        np.median(arr)),
      'min': int(
        arr.min()),
      'max': int(
        arr.max())}

  def count_reads(
    self,
    chrom: str,
    start: int,
    end: int,
    min_mapq: int = 20
  ) -> int:
    count = 0
    for read in (
      self.bam.fetch(
        chrom,
        start, end)
    ):
      if (read
        .mapping_quality
        >= min_mapq):
        count += 1
    return count

  def close(self):
    self.bam.close()

Advanced Tips

Use the fetch method with until_eof=True to iterate all reads including unmapped ones when calculating global alignment statistics. Filter reads by flag values to exclude duplicates, secondary alignments, and supplementary alignments from analysis. Use count_coverage for fast base-level coverage arrays without iterating individual pileup columns.

When to Use It?

Use Cases

Calculate sequencing coverage depth across target regions for quality control reporting. Extract aligned reads overlapping specific genes or variants for detailed inspection. Parse VCF files to filter variants by quality score and allele frequency thresholds.

Related Topics

pysam, BAM, SAM, VCF, genomics, bioinformatics, sequencing analysis, and htslib.

Important Notes

Requirements

pysam Python package with htslib C library bindings. Indexed BAM files with corresponding BAI index for region-based queries. Reference genome FASTA file for operations that require sequence context.

Usage Recommendations

Do: ensure BAM files are sorted and indexed before attempting region queries since pysam requires index files for random access. Filter reads by mapping quality and alignment flags to exclude low-quality data from analysis. Use context managers or explicitly close file handles to prevent resource leaks with large BAM files.

Don't: iterate all reads in a BAM file when only specific regions are needed since region queries with fetch are orders of magnitude faster. Modify alignment records without understanding how changes affect downstream tools. Assume read coordinates are zero-based or one-based without checking since pysam uses zero-based half-open intervals.

Limitations

pysam requires compiled C extensions that can be difficult to install on some platforms. Large BAM files with billions of reads consume significant memory during full-file iteration. Some advanced htslib features are not exposed through the Python interface.