Scvi Tools

Automate and integrate scVI Tools for deep learning-based single-cell data analysis

scvi-tools is a Python library for probabilistic analysis of single-cell omics data using deep generative models. It covers variational inference for dimensionality reduction, batch effect correction, differential expression analysis, cell type annotation, and multi-modal data integration workflows that enable scalable analysis of large single-cell datasets.

What Is This?

Overview

scvi-tools provides a unified framework for applying deep learning-based probabilistic models to single-cell genomics data. It addresses dimensionality reduction through variational autoencoders that learn latent representations of cells, batch effect removal that harmonizes data from different experiments, differential expression testing using Bayesian statistical methods, automated cell type classification through transfer learning, and multi-modal integration that combines RNA, protein, and chromatin accessibility measurements from the same cells. The library is built on PyTorch and integrates directly with the AnnData and Scanpy ecosystem, making it straightforward to incorporate into existing single-cell analysis pipelines.

Who Should Use This

This skill serves bioinformaticians analyzing single-cell RNA sequencing experiments, computational biologists integrating datasets from multiple studies, genomics researchers performing differential expression analysis, and data scientists working with high-dimensional biological datasets. Researchers building large-scale cell atlases or performing cross-study comparisons will find particular value in its batch correction and transfer learning capabilities.

Why Use It?

Problems It Solves

Single-cell datasets contain thousands of genes measured across hundreds of thousands to millions of cells, creating extremely high-dimensional data that traditional statistical methods struggle to analyze efficiently and accurately. Batch effects from different sequencing runs confound biological signals. Standard normalization approaches make assumptions that may not hold for count-based sequencing data. Processing large datasets with conventional tools hits memory and computation limits, making GPU-accelerated deep learning approaches a practical necessity for modern atlas-scale studies.

Core Highlights

The library uses GPU-accelerated variational inference for fast model training on large datasets. Its probabilistic framework properly accounts for the count nature of sequencing data. Pre-trained models enable transfer learning for cell type annotation without retraining. The modular architecture lets researchers combine different model components for custom analysis pipelines.

How to Use It?

Basic Usage

import scvi
import scanpy as sc

adata = sc.read_h5ad("pbmc_dataset.h5ad")
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)

scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
model = scvi.model.SCVI(adata, n_latent=30, n_layers=2)
model.train(max_epochs=200, early_stopping=True)

latent = model.get_latent_representation()
adata.obsm["X_scVI"] = latent
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)

Real-World Examples

de_results = model.differential_expression(
    groupby="cell_type",
    group1="CD4 T cells",
    group2="CD8 T cells"
)

significant = de_results[
    (de_results["bayes_factor"] > 3) &
    (abs(de_results["lfc_mean"]) > 0.5)
].sort_values("bayes_factor", ascending=False)

print(f"Significant DE genes: {len(significant)}")
print(significant[["lfc_mean", "bayes_factor"]].head(10))

scvi.model.TOTALVI.setup_anndata(
    adata, layer="counts",
    protein_expression_obsm_key="protein_expression",
    batch_key="batch"
)
totalvi_model = scvi.model.TOTALVI(adata)
totalvi_model.train(max_epochs=200)
protein_fore, protein_back = totalvi_model.get_protein_foreground_probability()

Advanced Tips

Use early stopping with a validation set to prevent overfitting, especially with smaller datasets. Save trained models for reproducibility and transfer learning using model.save(), which preserves all model parameters and training configuration. Leverage the scArches extension for mapping new query datasets onto existing reference atlases without retraining the full model. When working with very large datasets, consider using the train_size parameter to control the proportion of cells used for validation, and monitor training loss curves to confirm convergence before downstream analysis.

When to Use It?

Use Cases

Use scvi-tools when integrating single-cell datasets from multiple experiments that have batch effects, when performing differential expression analysis that properly models count data, when building cell type reference atlases for automated annotation, or when analyzing multi-modal single-cell data combining gene expression with protein measurements.

Related Topics

Scanpy for preprocessing and visualization, AnnData for data storage, PyTorch for model customization, single-cell genomics analysis workflows, and Bayesian statistical methods all complement scvi-tools usage.

Important Notes

Requirements

Python 3.9 or later with PyTorch installed. AnnData formatted datasets with raw count matrices. A GPU is strongly recommended for training on datasets with more than 100,000 cells to achieve reasonable training times, but CPU training is supported for smaller datasets.

Usage Recommendations

Do: use raw count data as input rather than normalized data, since scvi-tools models the count distribution directly. Include batch information during model setup to enable batch correction. Validate results using established marker genes for known cell types.

Don't: normalize or log-transform counts before passing them to scVI, as the model expects raw integers. Train models on highly variable genes selected from a single batch, as this biases gene selection. Skip hyperparameter tuning for production analyses.

Limitations

Model training time scales with dataset size and may require hours for datasets exceeding one million cells. The probabilistic framework assumes specific data distributions that may not fit all assay types perfectly. Transfer learning works best when query and reference datasets use similar sequencing technologies.