Scanpy

Automate and integrate Scanpy for scalable single-cell gene expression analysis

What Is This?

Overview

Scanpy provides tools for analyzing single-cell RNA sequencing data through a comprehensive computational pipeline. It covers data preprocessing that filters cells and genes, normalizes expression counts, and identifies highly variable features, dimensionality reduction that applies PCA and UMAP to visualize cell populations in low-dimensional space, clustering that groups cells by expression similarity using community detection algorithms, differential expression that identifies marker genes distinguishing cell clusters and conditions, and trajectory analysis that infers developmental paths and pseudotime ordering across cell states. The skill helps researchers analyze single-cell data from diverse tissue types and experimental conditions.

Who Should Use This

This skill serves bioinformatics researchers processing scRNA-seq experiments, cell biologists studying tissue composition and cell types, and computational biologists building single-cell analysis pipelines.

Why Use It?

Problems It Solves

Single-cell datasets contain thousands of cells and genes requiring efficient computational methods for analysis. Raw count data needs quality filtering, normalization, and feature selection before meaningful biological analysis. Identifying cell types from expression profiles requires dimensionality reduction and unsupervised clustering. Detecting genes that characterize specific cell populations requires statistical testing across clusters. Without a structured pipeline, inconsistent preprocessing choices can introduce technical artifacts that obscure genuine biological variation.

Core Highlights

Preprocessor filters, normalizes, and selects variable genes from count matrices. Embedding generator reduces dimensions with PCA and UMAP for visualization. Cell clusterer groups cells using graph-based community detection. Marker finder identifies differentially expressed genes per cluster.

How to Use It?

Basic Usage

import scanpy as sc

adata = sc.read_10x_h5(
  'filtered_matrix.h5')
adata.var_names_make_unique()

sc.pp.filter_cells(
  adata, min_genes=200)
sc.pp.filter_genes(
  adata, min_cells=3)

sc.pp.normalize_total(
  adata,
  target_sum=1e4)
sc.pp.log1p(adata)

sc.pp.highly_variable_genes(
  adata,
  n_top_genes=2000)
adata = adata[
  :, adata.var
    .highly_variable]

sc.tl.pca(adata)
sc.pp.neighbors(
  adata, n_pcs=30)

sc.tl.leiden(
  adata,
  resolution=0.5)
sc.tl.umap(adata)
sc.pl.umap(
  adata,
  color='leiden',
  save='_clusters.png')

Real-World Examples

import scanpy as sc

class ScRNAPipeline:
  def __init__(
    self,
    data_path: str
  ):
    self.adata = (
      sc.read_10x_h5(
        data_path))
    self.adata\
      .var_names_make_unique()

  def preprocess(
    self,
    min_genes: int = 200,
    n_top: int = 2000
  ):
    sc.pp.filter_cells(
      self.adata,
      min_genes=min_genes)
    sc.pp.filter_genes(
      self.adata,
      min_cells=3)
    sc.pp.normalize_total(
      self.adata,
      target_sum=1e4)
    sc.pp.log1p(
      self.adata)
    sc.pp\
      .highly_variable_genes(
        self.adata,
        n_top_genes=n_top)
    return self

  def cluster(
    self,
    resolution: float = 0.5
  ):
    sc.tl.pca(
      self.adata)
    sc.pp.neighbors(
      self.adata)
    sc.tl.leiden(
      self.adata,
      resolution=
        resolution)
    sc.tl.umap(
      self.adata)
    return self

  def markers(
    self, n_genes: int = 5
  ) -> dict:
    sc.tl.rank_genes_groups(
      self.adata,
      'leiden')
    return sc.get\
      .rank_genes_groups_df(
        self.adata,
        group=None)\
      .head(n_genes)

pipe = ScRNAPipeline(
  'matrix.h5')
pipe.preprocess()\
  .cluster()
markers = pipe.markers()
print(markers)

Advanced Tips

Adjust leiden resolution parameter to control the granularity of cell clustering based on expected cell type diversity. For example, immune cell datasets with many subtypes may benefit from higher resolution values around 1.0, while simpler tissues may cluster well at 0.3. Use rank_genes_groups with the Wilcoxon test for robust marker gene identification across clusters. Save processed AnnData objects to h5ad format for efficient storage and reloading. Additionally, store raw counts in adata.raw before normalization to preserve the original expression values for downstream differential expression workflows.

When to Use It?

Use Cases

Process a 10x Genomics scRNA-seq dataset through filtering, normalization, and clustering. Identify marker genes for each cell cluster to annotate cell types. Visualize cell populations on UMAP embeddings colored by cluster identity.

Related Topics

Scanpy, single-cell RNA-seq, scRNA-seq, bioinformatics, cell clustering, gene expression, and AnnData.

Important Notes

Requirements

Scanpy Python package with AnnData for data representation and storage. Single-cell count matrix data in supported formats such as 10x Genomics H5, CSV, or loom files for input. Sufficient system memory for processing large cell-by-gene matrices from modern high-throughput experiments with thousands of cells.

Usage Recommendations

Do: inspect quality metrics before filtering to set appropriate thresholds for each dataset. Use multiple resolution values when clustering to explore different levels of cell type granularity. Validate marker genes against known cell type signatures from curated databases such as CellMarker or PanglaoDB to support confident cell type annotation.

Don't: apply the same filtering thresholds to all datasets since tissue types have different quality distributions. Skip normalization since raw counts are not directly comparable across cells. Over-cluster by using high resolution values that split biologically coherent populations.

Limitations

Memory requirements scale with dataset size and cell count and may require out-of-core or backed-mode processing for very large experiments. Clustering results depend on preprocessing choices and resolution parameters. Marker gene analysis identifies statistical associations and requires experimental biological validation to confirm functional relevance of discovered expression patterns.