Cellxgene Census

Efficiently access and integrate single-cell RNA sequencing data using the Cellxgene Census API for research

Cellxgene Census is a community skill for accessing the CZ CELLxGENE Discover Census, a large-scale single-cell RNA sequencing data repository, covering data queries, cell metadata filtering, gene expression retrieval, and integration with analysis frameworks.

What Is This?

Overview

Cellxgene Census provides patterns for programmatically accessing the CELLxGENE Discover Census, which aggregates millions of single-cell observations from published studies. It covers opening census connections with version pinning, cell metadata queries filtered by organism, tissue, and cell type, gene expression matrix retrieval as sparse arrays or AnnData objects, dataset-level metadata for tracking data provenance, and memory-efficient streaming for large query results. The skill enables researchers to build analysis pipelines that leverage the largest curated single-cell data collection.

Who Should Use This

This skill serves computational biologists performing large-scale single-cell meta-analyses, researchers building reference atlases from aggregated public data, and developers creating tools that query single-cell databases.

Why Use It?

Problems It Solves

Downloading and processing individual single-cell datasets from multiple studies is time-consuming and inconsistent. Querying specific cell types across organisms and tissues requires standardized metadata that individual datasets often lack. Loading millions of cells into memory exceeds available RAM without streaming access. Ensuring reproducibility across analyses requires pinning to specific data versions.

Core Highlights

Census API provides standardized access to millions of cells with curated metadata. Cell filtering queries select specific organisms, tissues, and cell types without downloading full datasets. Expression retrieval returns sparse matrices or AnnData objects compatible with Scanpy. Version pinning ensures reproducibility by locking queries to a specific census release.

How to Use It?

Basic Usage

import cellxgene_census
import tiledbsoma

census = cellxgene_census.open_soma(
    census_version="2024-07-01")

human = census["census_data"]["homo_sapiens"]
obs_df = human.obs.read(
    value_filter=(
        "tissue_general == 'brain' and "
        "cell_type == 'neuron'"),
    column_names=["soma_joinid", "cell_type",
                  "tissue", "dataset_id"]
).concat().to_pandas()

print(f"Neurons found: {len(obs_df)}")
print(f"Tissues: {obs_df['tissue'].nunique()}")
print(f"Datasets: {obs_df['dataset_id'].nunique()}")

census.close()

Real-World Examples

import cellxgene_census

def get_cell_counts(organism: str,
                    tissue: str,
                    version: str = "2024-07-01"
                    ) -> dict:
    census = cellxgene_census.open_soma(
        census_version=version)
    exp = census["census_data"][organism]
    obs_df = exp.obs.read(
        value_filter=f"tissue_general == '{tissue}'",
        column_names=["cell_type"]
    ).concat().to_pandas()
    counts = obs_df["cell_type"].value_counts()
    census.close()
    return counts.to_dict()

def get_expression_anndata(
        organism: str,
        cell_filter: str,
        gene_filter: str = "",
        version: str = "2024-07-01"):
    adata = cellxgene_census.get_anndata(
        census=cellxgene_census.open_soma(
            census_version=version),
        organism=organism,
        obs_value_filter=cell_filter,
        var_value_filter=gene_filter
            if gene_filter else None)
    return adata

counts = get_cell_counts(
    "homo_sapiens", "lung")
print(f"Cell types: {len(counts)}")

Advanced Tips

Use column_names in obs.read to request only the metadata fields you need, reducing memory usage for large queries. Pin census versions in your analysis scripts to ensure results are reproducible across runs. Filter genes with var_value_filter to retrieve expression data for specific genes of interest rather than the full transcriptome.

When to Use It?

Use Cases

Build a reference atlas by aggregating cells from multiple tissues and studies into a unified analysis. Create a cell type composition tool that compares tissue cellularity across published datasets. Extract expression signatures for specific cell types to use as markers in new experiments.

Related Topics

Single-cell RNA sequencing, CELLxGENE platform, TileDB-SOMA storage, Scanpy analysis framework, and cell atlas construction.

Important Notes

Requirements

Python with the cellxgene-census package installed. Network access for streaming data from the census cloud storage. Sufficient memory for the query results you request. The tiledbsoma and pyarrow packages are required dependencies for streaming access.

Usage Recommendations

Do: pin census versions for reproducible analyses. Use value_filter to query only the cells you need rather than loading entire organisms. Close census connections when finished to release resources. Convert results to AnnData objects for seamless integration with Scanpy analysis workflows.

Don't: attempt to load the entire census into memory, which exceeds available RAM on most machines. Run analyses without version pinning, which may produce different results after census updates. Ignore the dataset_id column that tracks which study each cell originated from.

Limitations

Census data reflects the curation state at each release and may lag behind the latest publications. Very large queries may require machines with substantial RAM or streaming processing approaches. Cell type annotations are standardized but may not match all analysis conventions. Query performance depends on network bandwidth when streaming large expression matrices from cloud storage. Some rare cell types may have limited representation depending on the studies included in each census release.