Pydeseq2

Advanced Pydeseq2 automation and integration for differential gene expression analysis

Source: K-Dense-AI/claude-scientific-skills

PyDESeq2 is a community skill for performing differential gene expression analysis using the PyDESeq2 Python library, covering count normalization, dispersion estimation, statistical testing, results filtering, and visualization for RNA-seq data analysis.

What Is This?

Overview

PyDESeq2 provides tools for identifying genes that are differentially expressed between experimental conditions from RNA-seq count data. It covers count normalization that adjusts for library size differences between samples using median-of-ratios scaling, dispersion estimation that models gene-level variance using negative binomial distributions with shrinkage, statistical testing that performs Wald tests to identify significantly changed genes, results filtering that applies log fold change and adjusted p-value cutoffs, and visualization that generates volcano plots and MA plots for result interpretation. The skill enables bioinformaticians to run differential expression analysis entirely in Python.

Who Should Use This

This skill serves bioinformaticians analyzing RNA-seq experiments for differentially expressed genes, computational biologists building analysis pipelines in Python, and researchers comparing gene expression between treatment and control conditions.

Why Use It?

Problems It Solves

The original DESeq2 implementation requires R making it difficult to integrate into Python-based analysis pipelines. Raw RNA-seq counts need normalization for library size and composition bias before comparison. Genes with low expression counts exhibit high variance requiring statistical models that account for overdispersion. Multiple testing across thousands of genes inflates false positive rates without proper correction.

Core Highlights

Normalizer adjusts count data for library size using median-of-ratios method. Dispersion estimator models gene variance with negative binomial shrinkage. Wald tester identifies differentially expressed genes with multiple testing correction. Result plotter generates volcano and MA plots for interpretation.

How to Use It?

Basic Usage

import pandas as pd
from pydeseq2.dds import (
  DeseqDataSet)
from pydeseq2.ds import (
  DeseqStats)

counts = pd.read_csv(
  'counts.csv',
  index_col=0)
metadata = pd.DataFrame({
  'condition': [
    'control'] * 3 +
    ['treated'] * 3},
  index=counts.columns)

dds = DeseqDataSet(
  counts=counts.T,
  metadata=metadata,
  design_factors=
    'condition')

dds.deseq2()

stat = DeseqStats(
  dds,
  contrast=[
    'condition',
    'treated',
    'control'])
stat.summary()

results = (
  stat.results_df)
sig = results[
  (results['padj']
   < 0.05) &
  (results['log2FoldChange']
   .abs() > 1)]
print(
  f'Significant: '
  f'{len(sig)} genes')

Real-World Examples

import pandas as pd
from pydeseq2.dds import (
  DeseqDataSet)
from pydeseq2.ds import (
  DeseqStats)

class DEAnalysis:
  def __init__(
    self,
    counts: pd.DataFrame,
    metadata:
      pd.DataFrame,
    factor: str
  ):
    self.dds = (
      DeseqDataSet(
        counts=counts,
        metadata=
          metadata,
        design_factors=
          factor))
    self.factor = factor
    self.dds.deseq2()

  def contrast(
    self,
    test: str,
    ref: str,
    padj: float = 0.05,
    lfc: float = 1.0
  ) -> pd.DataFrame:
    stat = DeseqStats(
      self.dds,
      contrast=[
        self.factor,
        test, ref])
    stat.summary()
    df = stat.results_df
    sig = df[
      (df['padj']
       < padj) &
      (df[
        'log2FoldChange']
       .abs() > lfc)]
    return sig

Advanced Tips

Pre-filter genes with very low counts across all samples before running the analysis to reduce multiple testing burden and speed up computation. Use log fold change shrinkage for ranking genes when the goal is to identify the most biologically meaningful changes rather than just statistically significant ones. Compare results with different contrast specifications to verify that the reference level does not affect biological conclusions.

When to Use It?

Use Cases

Identify genes differentially expressed between drug-treated and control samples in an RNA-seq experiment. Compare expression across multiple conditions by running pairwise contrasts from a single fitted model. Build a Python pipeline that combines count quantification with differential expression analysis.

Important Notes

Requirements

PyDESeq2 Python package with pandas and NumPy dependencies. Gene count matrix with samples as columns and genes as rows. Sample metadata table linking sample names to experimental conditions.

Usage Recommendations

Do: filter out genes with fewer than 10 total counts across samples before analysis. Use the Benjamini-Hochberg adjusted p-values for significance thresholds to control the false discovery rate. Inspect dispersion plots to verify that the shrinkage estimation converged properly.

Don't: apply differential expression analysis to normalized values since PyDESeq2 expects raw integer counts as input. Use fold change alone without statistical testing since highly variable low-count genes produce large fold changes by chance. Compare results directly between separate PyDESeq2 runs with different sample sets.

Limitations

PyDESeq2 is designed for bulk RNA-seq and does not support single-cell count data analysis. The negative binomial model assumes biological replicates which limits analysis of unreplicated designs. Large datasets with many samples increase computation time for dispersion estimation.

More Skills You Might Like

Explore similar skills to enhance your workflow