Pydeseq2
Advanced Pydeseq2 automation and integration for differential gene expression analysis
PyDESeq2 is a community skill for performing differential gene expression analysis using the PyDESeq2 Python library, covering count normalization, dispersion estimation, statistical testing, results filtering, and visualization for RNA-seq data analysis.
What Is This?
Overview
PyDESeq2 provides tools for identifying genes that are differentially expressed between experimental conditions from RNA-seq count data. It covers count normalization that adjusts for library size differences between samples using median-of-ratios scaling, dispersion estimation that models gene-level variance using negative binomial distributions with shrinkage, statistical testing that performs Wald tests to identify significantly changed genes, results filtering that applies log fold change and adjusted p-value cutoffs, and visualization that generates volcano plots and MA plots for result interpretation. The skill enables bioinformaticians to run differential expression analysis entirely in Python.
Who Should Use This
This skill serves bioinformaticians analyzing RNA-seq experiments for differentially expressed genes, computational biologists building analysis pipelines in Python, and researchers comparing gene expression between treatment and control conditions.
Why Use It?
Problems It Solves
The original DESeq2 implementation requires R making it difficult to integrate into Python-based analysis pipelines. Raw RNA-seq counts need normalization for library size and composition bias before comparison. Genes with low expression counts exhibit high variance requiring statistical models that account for overdispersion. Multiple testing across thousands of genes inflates false positive rates without proper correction.
Core Highlights
Normalizer adjusts count data for library size using median-of-ratios method. Dispersion estimator models gene variance with negative binomial shrinkage. Wald tester identifies differentially expressed genes with multiple testing correction. Result plotter generates volcano and MA plots for interpretation.
How to Use It?
Basic Usage
import pandas as pd
from pydeseq2.dds import (
DeseqDataSet)
from pydeseq2.ds import (
DeseqStats)
counts = pd.read_csv(
'counts.csv',
index_col=0)
metadata = pd.DataFrame({
'condition': [
'control'] * 3 +
['treated'] * 3},
index=counts.columns)
dds = DeseqDataSet(
counts=counts.T,
metadata=metadata,
design_factors=
'condition')
dds.deseq2()
stat = DeseqStats(
dds,
contrast=[
'condition',
'treated',
'control'])
stat.summary()
results = (
stat.results_df)
sig = results[
(results['padj']
< 0.05) &
(results['log2FoldChange']
.abs() > 1)]
print(
f'Significant: '
f'{len(sig)} genes')Real-World Examples
import pandas as pd
from pydeseq2.dds import (
DeseqDataSet)
from pydeseq2.ds import (
DeseqStats)
class DEAnalysis:
def __init__(
self,
counts: pd.DataFrame,
metadata:
pd.DataFrame,
factor: str
):
self.dds = (
DeseqDataSet(
counts=counts,
metadata=
metadata,
design_factors=
factor))
self.factor = factor
self.dds.deseq2()
def contrast(
self,
test: str,
ref: str,
padj: float = 0.05,
lfc: float = 1.0
) -> pd.DataFrame:
stat = DeseqStats(
self.dds,
contrast=[
self.factor,
test, ref])
stat.summary()
df = stat.results_df
sig = df[
(df['padj']
< padj) &
(df[
'log2FoldChange']
.abs() > lfc)]
return sigAdvanced Tips
Pre-filter genes with very low counts across all samples before running the analysis to reduce multiple testing burden and speed up computation. Use log fold change shrinkage for ranking genes when the goal is to identify the most biologically meaningful changes rather than just statistically significant ones. Compare results with different contrast specifications to verify that the reference level does not affect biological conclusions.
When to Use It?
Use Cases
Identify genes differentially expressed between drug-treated and control samples in an RNA-seq experiment. Compare expression across multiple conditions by running pairwise contrasts from a single fitted model. Build a Python pipeline that combines count quantification with differential expression analysis.
Related Topics
PyDESeq2, RNA-seq, differential expression, bioinformatics, gene expression, statistical testing, and transcriptomics.
Important Notes
Requirements
PyDESeq2 Python package with pandas and NumPy dependencies. Gene count matrix with samples as columns and genes as rows. Sample metadata table linking sample names to experimental conditions.
Usage Recommendations
Do: filter out genes with fewer than 10 total counts across samples before analysis. Use the Benjamini-Hochberg adjusted p-values for significance thresholds to control the false discovery rate. Inspect dispersion plots to verify that the shrinkage estimation converged properly.
Don't: apply differential expression analysis to normalized values since PyDESeq2 expects raw integer counts as input. Use fold change alone without statistical testing since highly variable low-count genes produce large fold changes by chance. Compare results directly between separate PyDESeq2 runs with different sample sets.
Limitations
PyDESeq2 is designed for bulk RNA-seq and does not support single-cell count data analysis. The negative binomial model assumes biological replicates which limits analysis of unreplicated designs. Large datasets with many samples increase computation time for dispersion estimation.
More Skills You Might Like
Explore similar skills to enhance your workflow
Scientific Critical Thinking
Scientific Critical Thinking automation and integration
Sentencepiece
SentencePiece tokenization automation and integration for NLP pipelines
Using Agent Skills
When a task arrives, identify the development phase and apply the corresponding skill:
Qiskit
Advanced Qiskit automation and integration for quantum computing and circuit development
Aero Workflow Automation
Automate Aero Workflow tasks via Rube MCP (Composio)
Ambient Weather Automation
Automate Ambient Weather tasks via Rube MCP (Composio)