Imaging Data Commons
Access and manage medical imaging datasets through automated Imaging Data Commons cloud integration
Imaging Data Commons is a community skill for accessing the NCI Imaging Data Commons platform, covering DICOM data queries, cohort selection, image retrieval, metadata analysis, and integration with medical imaging research workflows.
What Is This?
Overview
Imaging Data Commons provides patterns for accessing cancer imaging data from the NCI Imaging Data Commons (IDC) cloud platform. It covers BigQuery-based metadata queries for exploring available imaging collections, cohort selection by modality, body part, and diagnosis criteria, DICOM image retrieval from cloud storage buckets, series-level metadata analysis for understanding imaging parameters, and integration with analysis tools for building radiology research pipelines. The skill enables medical imaging researchers to discover and access large-scale cancer imaging datasets for AI model training and clinical research.
Who Should Use This
This skill serves medical imaging researchers building AI models from public cancer imaging data, radiologists studying imaging biomarkers across large patient cohorts, and developers creating tools that consume DICOM data from cloud repositories.
Why Use It?
Problems It Solves
Cancer imaging datasets are distributed across multiple repositories with different access methods. Discovering which imaging data exists for specific cancer types and modalities requires searching multiple catalogs. Downloading DICOM files from cloud storage needs efficient transfer handling for large imaging datasets. Filtering imaging series by technical parameters requires structured metadata queries.
Core Highlights
BigQuery interface queries IDC metadata tables for cohort discovery and selection. Collection browser lists available imaging datasets with modality and body part information. DICOM retriever downloads selected series from Google Cloud Storage. Metadata analyzer extracts imaging parameters for quality assessment and filtering.
How to Use It?
Basic Usage
from google.cloud import bigquery
class IDCClient:
def __init__(self, project_id: str):
self.bq = bigquery.Client(
project=project_id)
def list_collections(self) -> list[dict]:
query = """
SELECT collection_id,
COUNT(DISTINCT PatientID) as patients,
COUNT(DISTINCT SeriesInstanceUID) as series
FROM `bigquery-public-data.idc_current.dicom_all`
GROUP BY collection_id
ORDER BY patients DESC
LIMIT 20
"""
result = self.bq.query(query).to_dataframe()
return result.to_dict(orient="records")
def find_series(self, modality: str,
body_part: str = "",
limit: int = 100
) -> list[dict]:
where = f"WHERE Modality = '{modality}'"
if body_part:
where += (f" AND BodyPartExamined "
f"= '{body_part}'")
query = f"""
SELECT SeriesInstanceUID,
collection_id, PatientID,
Modality, BodyPartExamined
FROM `bigquery-public-data.idc_current.dicom_all`
{where}
LIMIT {limit}
"""
result = self.bq.query(query).to_dataframe()
return result.to_dict(orient="records")Real-World Examples
from google.cloud import bigquery, storage
class CohortBuilder:
def __init__(self, client: IDCClient):
self.client = client
def build_cohort(
self, modality: str,
body_part: str,
min_slices: int = 50
) -> list[dict]:
series = self.client.find_series(
modality, body_part, limit=500)
return [s for s in series
if s.get("instance_count", 0)
>= min_slices]
def cohort_summary(
self, cohort: list[dict]) -> dict:
collections = set(
s["collection_id"] for s in cohort)
patients = set(
s["PatientID"] for s in cohort)
return {"series_count": len(cohort),
"patient_count": len(patients),
"collection_count": len(collections)}
def download_series(
self, series_uid: str,
output_dir: str) -> str:
import subprocess
cmd = ["idc", "download",
"--series-uid", series_uid,
"--output-dir", output_dir]
subprocess.run(cmd, check=True)
return output_dir
builder = CohortBuilder(idc)
cohort = builder.build_cohort("CT", "CHEST")
summary = builder.cohort_summary(cohort)
print(f"Cohort: {summary}")Advanced Tips
Use BigQuery cost controls to limit query costs when exploring large metadata tables. Filter by DICOM series description to select specific imaging protocols within a modality. Combine IDC queries with clinical annotations from TCIA for multimodal research datasets.
When to Use It?
Use Cases
Build a cohort selection pipeline that assembles CT imaging datasets for training lung nodule detection models. Create a metadata explorer that summarizes available imaging data by cancer type and modality. Implement a bulk download tool that retrieves DICOM series for offline analysis.
Related Topics
Medical imaging, DICOM format, cancer research data, NCI data commons, and radiology AI development.
Important Notes
Requirements
Google Cloud account with BigQuery access for metadata queries. Python with the google-cloud-bigquery package installed. The idc CLI tool for DICOM file downloads.
Usage Recommendations
Do: use BigQuery metadata exploration before downloading images to understand dataset scope and relevance. Apply DICOM metadata filters to select series with appropriate imaging parameters. Cite the IDC and source collections in publications.
Don't: download entire collections without filtering, which consumes excessive storage and bandwidth. Assume all series in a collection use identical imaging protocols. Ignore patient privacy considerations when working with medical imaging data.
Limitations
BigQuery queries incur Google Cloud costs for data scanned. Image download speeds depend on cloud storage egress bandwidth. Some collections have restricted access that requires data use agreements beyond basic authentication.
More Skills You Might Like
Explore similar skills to enhance your workflow
Ginkgo Cloud Lab
Ginkgo Cloud Lab automation and integration for cloud-based lab workflows
Meituan Coupon Assistant
Automatically claim Meituan coupons across food delivery, dining, hotel, tickets, and pharmacy categories with one command
Bitbucket Automation
Automate Bitbucket repositories, pull requests, branches, issues, and workspace management via Rube MCP (Composio). Always search tools first for curr
Pyhealth
Comprehensive Pyhealth automation and integration for healthcare AI and data science
Googlephotos Automation
Automate Google Photos tasks via Rube MCP (Composio): upload media,
Dotnet Core Expert
Automate and integrate .NET Core Expert for robust application development workflows