Anndata

Automate AnnData processing and integrate large-scale genomic data analysis into your research workflows

Source: K-Dense-AI/claude-scientific-skills

Anndata is a community skill for working with annotated data matrices in single-cell genomics, covering AnnData object creation, data manipulation, metadata management, file I/O, and integration with analysis frameworks for computational biology workflows.

What Is This?

Overview

Anndata provides patterns for managing annotated data matrices commonly used in single-cell RNA sequencing analysis. It covers AnnData object construction from expression matrices with observation and variable annotations, data subsetting and filtering operations, layer management for storing multiple representations of the same data, unstructured annotation storage for analysis results, and file serialization in H5AD format. The skill enables bioinformaticians to build reproducible single-cell analysis pipelines with properly structured data objects that remain consistent across analysis steps.

Who Should Use This

This skill serves computational biologists analyzing single-cell genomics datasets, bioinformatics developers building analysis pipelines with Scanpy, and researchers managing large annotated expression matrices. It is particularly relevant for teams working with multi-sample or multi-condition experiments where consistent data organization is critical.

Why Use It?

Problems It Solves

Gene expression matrices lack standardized structure for attaching cell and gene metadata. Storing multiple data representations like raw counts and normalized values requires separate objects without a unified container. Analysis results scattered across separate files make reproducibility difficult. Large datasets exceed memory when loaded entirely into dense matrix formats.

Core Highlights

The AnnData object stores expression data alongside cell and gene annotations in a single structure. Layers hold multiple matrix representations such as raw, normalized, and scaled data. Sparse matrix support enables memory-efficient handling of large datasets. H5AD file format provides persistent storage with fast partial loading, making it practical to share complete analysis states between collaborators or pipeline stages.

How to Use It?

Basic Usage

import anndata as ad
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

n_cells, n_genes = 500, 2000
counts = csr_matrix(
    np.random.poisson(0.5, (n_cells, n_genes))
)

adata = ad.AnnData(
    X=counts,
    obs=pd.DataFrame(
        {"cell_type": np.random.choice(
            ["T-cell", "B-cell", "Monocyte"],
            n_cells),
         "sample": np.random.choice(
            ["S1", "S2"], n_cells)},
        index=[f"cell_{i}" for i in range(n_cells)]),
    var=pd.DataFrame(
        {"gene_name": [f"Gene_{i}"
                       for i in range(n_genes)]},
        index=[f"ENSG{i:05d}"
               for i in range(n_genes)]))

adata.layers["raw"] = adata.X.copy()
print(f"Shape: {adata.shape}")
print(f"Cell types: {adata.obs['cell_type'].unique()}")

Real-World Examples

import anndata as ad
import numpy as np

t_cells = adata[adata.obs["cell_type"] == "T-cell"]
print(f"T-cells: {t_cells.shape[0]}")

gene_means = np.array(adata.X.mean(axis=0)).flatten()
expressed = gene_means > 0.1
adata_filtered = adata[:, expressed]
print(f"Genes after filter: {adata_filtered.shape[1]}")

adata.obsm["X_pca"] = np.random.randn(
    adata.shape[0], 50)
adata.obsm["X_umap"] = np.random.randn(
    adata.shape[0], 2)
adata.uns["analysis_params"] = {
    "n_pcs": 50, "n_neighbors": 15}

adata.write("dataset.h5ad")
adata_loaded = ad.read_h5ad("dataset.h5ad")
print(f"Loaded: {adata_loaded.shape}")

Advanced Tips

Use backed mode to read large H5AD files without loading the full matrix into memory, which is essential when working with datasets that have millions of cells. Pass backed="r" to read_h5ad to enable this mode and access only the slices needed for a given operation. Store raw counts in a layer before normalization to preserve the original data for downstream analyses that require it. Concatenate multiple AnnData objects with batch keys to merge datasets from different experiments while tracking their origin. When concatenating, set join="outer" to retain all genes across datasets and fill missing values with zeros.

When to Use It?

Use Cases

Build a single-cell analysis pipeline that processes expression data through normalization, dimensionality reduction, and clustering. Create a data integration workflow that merges multiple scRNA-seq datasets with batch correction. Implement a gene expression visualization tool that reads AnnData objects and generates plots.

Important Notes

Requirements

Python with the anndata package installed. NumPy and pandas for matrix operations, metadata management, and data frame indexing. SciPy for sparse matrix support when handling large datasets efficiently.

Usage Recommendations

Do: use sparse matrices for expression data to reduce memory usage. Store raw counts in a separate layer before applying normalization. Add descriptive metadata to obs and var for reproducible analysis tracking.

Don't: load large datasets in dense format when sparse representation is available. Modify the X matrix without preserving the original in a layer. Skip writing analysis parameters to uns, which makes results difficult to reproduce.

Limitations

Very large datasets may still exceed available memory even with sparse matrices. H5AD files do not support concurrent write access from multiple processes, requiring coordination in parallel pipelines. Complex nested metadata structures may not serialize cleanly to the H5AD format, requiring flattening before saving.

More Skills You Might Like

Explore similar skills to enhance your workflow