Deeptools

Automate and integrate deeptools for powerful genomic data analysis pipelines

deepTools is a community skill for analyzing high-throughput sequencing data using the deepTools suite, covering BAM file processing, bigWig signal generation, heatmap visualization, correlation analysis, and quality control for genomics workflows.

What Is This?

Overview

deepTools provides patterns for processing and visualizing next-generation sequencing data from ChIP-seq, ATAC-seq, and RNA-seq experiments. It covers BAM file quality assessment including fragment size distribution and GC bias analysis, signal normalization and bigWig generation for genome browser visualization, reference point and scale-region heatmaps around genomic features, sample correlation matrices for replicate and condition comparisons, and multiBamSummary computation for genome-wide signal quantification. The skill enables bioinformaticians to build reproducible analysis pipelines for epigenomic and transcriptomic data.

Who Should Use This

This skill serves bioinformaticians analyzing ChIP-seq and ATAC-seq datasets, genomics researchers visualizing enrichment patterns around genes and regulatory elements, and core facility staff building quality control pipelines for sequencing experiments.

Why Use It?

Problems It Solves

Visualizing sequencing signal around genomic features requires binning and normalization that manual scripts implement inconsistently. Comparing signal between samples needs normalization methods that account for sequencing depth and library complexity. Quality control of BAM files requires multiple metrics that are tedious to compute individually. Generating publication-ready heatmaps from raw alignment data involves many intermediate processing steps.

Core Highlights

bamCoverage generates normalized bigWig files from BAM alignments with RPKM, CPM, and BPM options. computeMatrix builds signal matrices around reference points or scaled regions for heatmap plotting. plotHeatmap creates publication-quality enrichment heatmaps with clustering. plotCorrelation computes and visualizes sample similarity matrices.

How to Use It?

Basic Usage

import subprocess
import os

def bam_to_bigwig(bam_path: str,
                  output_path: str,
                  normalize: str = "RPKM",
                  bin_size: int = 10) -> str:
    cmd = ["bamCoverage",
           "-b", bam_path,
           "-o", output_path,
           "--normalizeUsing", normalize,
           "--binSize", str(bin_size),
           "-p", "4"]
    subprocess.run(cmd, check=True)
    return output_path

def compute_matrix(bigwigs: list[str],
                   bed_file: str,
                   output: str,
                   mode: str = "reference-point",
                   upstream: int = 3000,
                   downstream: int = 3000) -> str:
    cmd = ["computeMatrix", mode,
           "-S"] + bigwigs + [
           "-R", bed_file,
           "-o", output,
           "-a", str(downstream),
           "-b", str(upstream),
           "-p", "4"]
    subprocess.run(cmd, check=True)
    return output

bam_to_bigwig("sample.bam", "sample.bw")
compute_matrix(["sample.bw"], "genes.bed",
               "matrix.gz")

Real-World Examples

import subprocess

class ChIPseqPipeline:
    def __init__(self, output_dir: str):
        self.output_dir = output_dir
        os.makedirs(output_dir, exist_ok=True)

    def generate_heatmap(
            self, bigwigs: list[str],
            regions: str,
            labels: list[str]) -> str:
        matrix = os.path.join(
            self.output_dir, "matrix.gz")
        cmd = ["computeMatrix", "reference-point",
               "-S"] + bigwigs + [
               "-R", regions,
               "-o", matrix,
               "-a", "3000", "-b", "3000"]
        subprocess.run(cmd, check=True)
        plot = os.path.join(
            self.output_dir, "heatmap.png")
        cmd = ["plotHeatmap",
               "-m", matrix,
               "-o", plot,
               "--samplesLabel"] + labels + [
               "--colorMap", "RdBu_r"]
        subprocess.run(cmd, check=True)
        return plot

    def sample_correlation(
            self, bam_files: list[str],
            labels: list[str]) -> str:
        summary = os.path.join(
            self.output_dir, "summary.npz")
        cmd = ["multiBamSummary", "bins",
               "-b"] + bam_files + [
               "-o", summary,
               "--binSize", "10000"]
        subprocess.run(cmd, check=True)
        plot = os.path.join(
            self.output_dir, "correlation.png")
        cmd = ["plotCorrelation",
               "-in", summary,
               "-o", plot,
               "--corMethod", "pearson",
               "--labels"] + labels
        subprocess.run(cmd, check=True)
        return plot

Advanced Tips

Use the effectiveGenomeSize parameter in bamCoverage for accurate normalization of mappable regions. Combine multiple region files in computeMatrix to compare signal across different genomic feature classes. Run plotFingerprint before analysis to assess ChIP enrichment quality and identify failed experiments.

When to Use It?

Use Cases

Build a ChIP-seq quality control pipeline that assesses enrichment and generates correlation plots between replicates. Create a promoter signal visualization that displays histone modification patterns around transcription start sites. Implement a differential binding analysis workflow that compares ChIP-seq signals between conditions.

Related Topics

ChIP-seq analysis, ATAC-seq processing, epigenomics, genome browser visualization, and next-generation sequencing quality control.

Important Notes

Requirements

deepTools installed via pip or conda. Sorted and indexed BAM files as input. BED files defining genomic regions of interest for heatmap generation.

Usage Recommendations

Do: normalize bigWig files with appropriate methods for the experiment type, using RPKM for ChIP-seq and CPM for ATAC-seq. Index BAM files before running deepTools commands. Use consistent bin sizes across samples for fair signal comparison.

Don't: compare bigWig files generated with different normalization methods. Use reference-point mode for regions with highly variable lengths where scale-region mode is appropriate. Skip quality control steps like plotFingerprint that identify problematic samples early.

Limitations

deepTools operates on aligned reads and does not perform read mapping. Heatmap rendering time scales with the number of regions and samples. Some normalization methods require knowing the effective genome size for the reference assembly.