Anndata
Automate AnnData processing and integrate large-scale genomic data analysis into your research workflows
Anndata is a community skill for working with annotated data matrices in single-cell genomics, covering AnnData object creation, data manipulation, metadata management, file I/O, and integration with analysis frameworks for computational biology workflows.
What Is This?
Overview
Anndata provides patterns for managing annotated data matrices commonly used in single-cell RNA sequencing analysis. It covers AnnData object construction from expression matrices with observation and variable annotations, data subsetting and filtering operations, layer management for storing multiple representations of the same data, unstructured annotation storage for analysis results, and file serialization in H5AD format. The skill enables bioinformaticians to build reproducible single-cell analysis pipelines with properly structured data objects that remain consistent across analysis steps.
Who Should Use This
This skill serves computational biologists analyzing single-cell genomics datasets, bioinformatics developers building analysis pipelines with Scanpy, and researchers managing large annotated expression matrices. It is particularly relevant for teams working with multi-sample or multi-condition experiments where consistent data organization is critical.
Why Use It?
Problems It Solves
Gene expression matrices lack standardized structure for attaching cell and gene metadata. Storing multiple data representations like raw counts and normalized values requires separate objects without a unified container. Analysis results scattered across separate files make reproducibility difficult. Large datasets exceed memory when loaded entirely into dense matrix formats.
Core Highlights
The AnnData object stores expression data alongside cell and gene annotations in a single structure. Layers hold multiple matrix representations such as raw, normalized, and scaled data. Sparse matrix support enables memory-efficient handling of large datasets. H5AD file format provides persistent storage with fast partial loading, making it practical to share complete analysis states between collaborators or pipeline stages.
How to Use It?
Basic Usage
import anndata as ad
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
n_cells, n_genes = 500, 2000
counts = csr_matrix(
np.random.poisson(0.5, (n_cells, n_genes))
)
adata = ad.AnnData(
X=counts,
obs=pd.DataFrame(
{"cell_type": np.random.choice(
["T-cell", "B-cell", "Monocyte"],
n_cells),
"sample": np.random.choice(
["S1", "S2"], n_cells)},
index=[f"cell_{i}" for i in range(n_cells)]),
var=pd.DataFrame(
{"gene_name": [f"Gene_{i}"
for i in range(n_genes)]},
index=[f"ENSG{i:05d}"
for i in range(n_genes)]))
adata.layers["raw"] = adata.X.copy()
print(f"Shape: {adata.shape}")
print(f"Cell types: {adata.obs['cell_type'].unique()}")Real-World Examples
import anndata as ad
import numpy as np
t_cells = adata[adata.obs["cell_type"] == "T-cell"]
print(f"T-cells: {t_cells.shape[0]}")
gene_means = np.array(adata.X.mean(axis=0)).flatten()
expressed = gene_means > 0.1
adata_filtered = adata[:, expressed]
print(f"Genes after filter: {adata_filtered.shape[1]}")
adata.obsm["X_pca"] = np.random.randn(
adata.shape[0], 50)
adata.obsm["X_umap"] = np.random.randn(
adata.shape[0], 2)
adata.uns["analysis_params"] = {
"n_pcs": 50, "n_neighbors": 15}
adata.write("dataset.h5ad")
adata_loaded = ad.read_h5ad("dataset.h5ad")
print(f"Loaded: {adata_loaded.shape}")Advanced Tips
Use backed mode to read large H5AD files without loading the full matrix into memory, which is essential when working with datasets that have millions of cells. Pass backed="r" to read_h5ad to enable this mode and access only the slices needed for a given operation. Store raw counts in a layer before normalization to preserve the original data for downstream analyses that require it. Concatenate multiple AnnData objects with batch keys to merge datasets from different experiments while tracking their origin. When concatenating, set join="outer" to retain all genes across datasets and fill missing values with zeros.
When to Use It?
Use Cases
Build a single-cell analysis pipeline that processes expression data through normalization, dimensionality reduction, and clustering. Create a data integration workflow that merges multiple scRNA-seq datasets with batch correction. Implement a gene expression visualization tool that reads AnnData objects and generates plots.
Related Topics
Scanpy analysis framework, single-cell RNA sequencing, sparse matrix operations, H5AD file format, and bioinformatics data management.
Important Notes
Requirements
Python with the anndata package installed. NumPy and pandas for matrix operations, metadata management, and data frame indexing. SciPy for sparse matrix support when handling large datasets efficiently.
Usage Recommendations
Do: use sparse matrices for expression data to reduce memory usage. Store raw counts in a separate layer before applying normalization. Add descriptive metadata to obs and var for reproducible analysis tracking.
Don't: load large datasets in dense format when sparse representation is available. Modify the X matrix without preserving the original in a layer. Skip writing analysis parameters to uns, which makes results difficult to reproduce.
Limitations
Very large datasets may still exceed available memory even with sparse matrices. H5AD files do not support concurrent write access from multiple processes, requiring coordination in parallel pipelines. Complex nested metadata structures may not serialize cleanly to the H5AD format, requiring flattening before saving.
More Skills You Might Like
Explore similar skills to enhance your workflow
Twitter Thread Creation
Twitter Thread Creation automation and integration
Interleaved Thinking
Optimize cognitive workflows by automating interleaved thinking patterns and task management integration
Fund
Automate and integrate funding workflows for efficient financial management and transaction processing
Conversation Analyzer
Analyzes your Claude Code conversation history to identify patterns, common mistakes, and opportunities for workflow improvement. Use when user wants
Certifier Automation
Automate Certifier operations through Composio's Certifier toolkit via
Sandbox Sdk
Automate and integrate Sandbox SDK for safe and efficient development environment testing