Imaging Data Commons

Access and manage medical imaging datasets through automated Imaging Data Commons cloud integration

Imaging Data Commons is a community skill for accessing the NCI Imaging Data Commons platform, covering DICOM data queries, cohort selection, image retrieval, metadata analysis, and integration with medical imaging research workflows.

What Is This?

Overview

Imaging Data Commons provides patterns for accessing cancer imaging data from the NCI Imaging Data Commons (IDC) cloud platform. It covers BigQuery-based metadata queries for exploring available imaging collections, cohort selection by modality, body part, and diagnosis criteria, DICOM image retrieval from cloud storage buckets, series-level metadata analysis for understanding imaging parameters, and integration with analysis tools for building radiology research pipelines. The skill enables medical imaging researchers to discover and access large-scale cancer imaging datasets for AI model training and clinical research.

Who Should Use This

This skill serves medical imaging researchers building AI models from public cancer imaging data, radiologists studying imaging biomarkers across large patient cohorts, and developers creating tools that consume DICOM data from cloud repositories.

Why Use It?

Problems It Solves

Cancer imaging datasets are distributed across multiple repositories with different access methods. Discovering which imaging data exists for specific cancer types and modalities requires searching multiple catalogs. Downloading DICOM files from cloud storage needs efficient transfer handling for large imaging datasets. Filtering imaging series by technical parameters requires structured metadata queries.

Core Highlights

BigQuery interface queries IDC metadata tables for cohort discovery and selection. Collection browser lists available imaging datasets with modality and body part information. DICOM retriever downloads selected series from Google Cloud Storage. Metadata analyzer extracts imaging parameters for quality assessment and filtering.

How to Use It?

Basic Usage

from google.cloud import bigquery

class IDCClient:
    def __init__(self, project_id: str):
        self.bq = bigquery.Client(
            project=project_id)

    def list_collections(self) -> list[dict]:
        query = """
            SELECT collection_id,
                   COUNT(DISTINCT PatientID) as patients,
                   COUNT(DISTINCT SeriesInstanceUID) as series
            FROM `bigquery-public-data.idc_current.dicom_all`
            GROUP BY collection_id
            ORDER BY patients DESC
            LIMIT 20
        """
        result = self.bq.query(query).to_dataframe()
        return result.to_dict(orient="records")

    def find_series(self, modality: str,
                     body_part: str = "",
                     limit: int = 100
                     ) -> list[dict]:
        where = f"WHERE Modality = '{modality}'"
        if body_part:
            where += (f" AND BodyPartExamined "
                      f"= '{body_part}'")
        query = f"""
            SELECT SeriesInstanceUID,
                   collection_id, PatientID,
                   Modality, BodyPartExamined
            FROM `bigquery-public-data.idc_current.dicom_all`
            {where}
            LIMIT {limit}
        """
        result = self.bq.query(query).to_dataframe()
        return result.to_dict(orient="records")

Real-World Examples

from google.cloud import bigquery, storage

class CohortBuilder:
    def __init__(self, client: IDCClient):
        self.client = client

    def build_cohort(
            self, modality: str,
            body_part: str,
            min_slices: int = 50
            ) -> list[dict]:
        series = self.client.find_series(
            modality, body_part, limit=500)
        return [s for s in series
                if s.get("instance_count", 0)
                >= min_slices]

    def cohort_summary(
            self, cohort: list[dict]) -> dict:
        collections = set(
            s["collection_id"] for s in cohort)
        patients = set(
            s["PatientID"] for s in cohort)
        return {"series_count": len(cohort),
                "patient_count": len(patients),
                "collection_count": len(collections)}

    def download_series(
            self, series_uid: str,
            output_dir: str) -> str:
        import subprocess
        cmd = ["idc", "download",
               "--series-uid", series_uid,
               "--output-dir", output_dir]
        subprocess.run(cmd, check=True)
        return output_dir

builder = CohortBuilder(idc)
cohort = builder.build_cohort("CT", "CHEST")
summary = builder.cohort_summary(cohort)
print(f"Cohort: {summary}")

Advanced Tips

Use BigQuery cost controls to limit query costs when exploring large metadata tables. Filter by DICOM series description to select specific imaging protocols within a modality. Combine IDC queries with clinical annotations from TCIA for multimodal research datasets.

When to Use It?

Use Cases

Build a cohort selection pipeline that assembles CT imaging datasets for training lung nodule detection models. Create a metadata explorer that summarizes available imaging data by cancer type and modality. Implement a bulk download tool that retrieves DICOM series for offline analysis.

Related Topics

Medical imaging, DICOM format, cancer research data, NCI data commons, and radiology AI development.

Important Notes

Requirements

Google Cloud account with BigQuery access for metadata queries. Python with the google-cloud-bigquery package installed. The idc CLI tool for DICOM file downloads.

Usage Recommendations

Do: use BigQuery metadata exploration before downloading images to understand dataset scope and relevance. Apply DICOM metadata filters to select series with appropriate imaging parameters. Cite the IDC and source collections in publications.

Don't: download entire collections without filtering, which consumes excessive storage and bandwidth. Assume all series in a collection use identical imaging protocols. Ignore patient privacy considerations when working with medical imaging data.

Limitations

BigQuery queries incur Google Cloud costs for data scanned. Image download speeds depend on cloud storage egress bandwidth. Some collections have restricted access that requires data use agreements beyond basic authentication.