Dnanexus Integration
Automate and integrate DNAnexus for scalable genomic data processing pipelines
DNAnexus Integration is a community skill for building bioinformatics workflows on the DNAnexus cloud platform, covering job execution, data management, app development, project organization, and pipeline orchestration for genomics research.
What Is This?
Overview
DNAnexus Integration provides patterns for programmatically interacting with the DNAnexus platform for genomics data analysis. It covers authentication and project management through the platform API, data upload and download operations for sequencing files and results, app and workflow execution with parameter configuration, job monitoring and status tracking for long-running analyses, and workflow chaining that connects analysis steps into reproducible pipelines. The skill enables bioinformaticians to automate genomics workflows on a secure cloud platform designed for regulated research environments, including those subject to HIPAA or GxP compliance requirements.
Who Should Use This
This skill serves bioinformaticians automating genomics pipelines on the DNAnexus platform, clinical genomics teams running validated analysis workflows in regulated environments, and developers building custom apps for the DNAnexus ecosystem.
Why Use It?
Problems It Solves
Running genomics analyses manually through the DNAnexus web interface does not scale to large sample batches. Tracking job status and collecting outputs across many concurrent analyses requires automation. Uploading and organizing large sequencing datasets needs efficient transfer with metadata management. Chaining analysis steps into reproducible pipelines demands programmatic workflow construction, particularly when coordinating tools such as BWA, GATK, and downstream variant annotation stages.
Core Highlights
Python SDK wraps the DNAnexus API for authentication, project access, and data operations. Job runner launches analysis apps with configured parameters and monitors execution status. File manager handles upload, download, and organization of genomics data. Workflow builder chains analysis stages into executable pipelines.
How to Use It?
Basic Usage
import dxpy
dxpy.set_workspace_id("project-XXXX")
files = list(dxpy.find_data_objects(
classname="file",
folder="/raw_data",
name="*.fastq.gz",
name_mode="glob"))
print(f"Files found: {len(files)}")
uploaded = dxpy.upload_local_file(
"sample.fastq.gz",
folder="/raw_data",
wait_on_close=True)
print(f"Uploaded: {uploaded.get_id()}")
job = dxpy.DXApp(name="bwa_mem").run({
"reads": dxpy.dxlink(files[0]["id"]),
"reference": dxpy.dxlink("file-YYYY")})
print(f"Job: {job.get_id()}")
job.wait_on_done()
output = job.describe()["output"]
print(f"Output: {output}")Real-World Examples
import dxpy
class BatchRunner:
def __init__(self, project_id: str,
app_name: str):
dxpy.set_workspace_id(project_id)
self.app = dxpy.DXApp(name=app_name)
def submit_batch(self, samples: list[dict]
) -> list[str]:
job_ids = []
for sample in samples:
job = self.app.run(sample)
job_ids.append(job.get_id())
print(f"Submitted: {job.get_id()}")
return job_ids
def collect_results(self, job_ids: list[str]
) -> list[dict]:
results = []
for jid in job_ids:
job = dxpy.DXJob(jid)
job.wait_on_done()
desc = job.describe()
results.append({
"job": jid,
"state": desc["state"],
"output": desc.get("output", {})})
return results
runner = BatchRunner("project-XXXX", "bwa_mem")
samples = [
{"reads": dxpy.dxlink("file-A"),
"reference": dxpy.dxlink("file-REF")},
{"reads": dxpy.dxlink("file-B"),
"reference": dxpy.dxlink("file-REF")}]
jobs = runner.submit_batch(samples)
results = runner.collect_results(jobs)Advanced Tips
Use instance type specifications in job inputs to optimize compute cost for each analysis step, selecting memory-optimized instances for variant calling and CPU-optimized instances for alignment. Tag jobs with metadata to enable filtering and tracking across large batch submissions. Use dxpy.bindings for streaming file access when processing outputs without full download. Group related files with properties and tags for efficient batch queries across large projects.
When to Use It?
Use Cases
Build a batch alignment pipeline that processes hundreds of sequencing samples in parallel on DNAnexus. Create a data management tool that organizes project files with metadata tags and folder structures. Implement a job monitoring dashboard that tracks analysis progress across active projects.
Related Topics
Cloud genomics platforms, bioinformatics pipeline automation, sequencing data management, clinical genomics, and reproducible research workflows.
Important Notes
Requirements
Python with the dxpy package installed. A DNAnexus account with project access and API tokens. Network access to the DNAnexus platform API. Familiarity with the DNAnexus data model for projects, files, and executions.
Usage Recommendations
Do: use project folders and tags to organize data systematically. Monitor job costs and instance usage to optimize cloud spending. Set appropriate timeout values for long-running genomics analyses.
Don't: hard-code API tokens in scripts instead of using environment variables or login sessions. Submit large batches without checking available compute quotas. Ignore job failure states that indicate input data or configuration issues.
Limitations
Platform API rate limits restrict the frequency of status polling for large batch submissions. Data transfer speeds depend on network bandwidth between local storage and the DNAnexus platform. Custom app development requires familiarity with the DNAnexus app specification and Docker container packaging. Data egress costs apply when downloading results from the platform to local storage.
More Skills You Might Like
Explore similar skills to enhance your workflow
Langsmith
Automate and integrate LangSmith observability and testing into your LLM pipelines
PDFtk Server
Enhance productivity with PDFtk Server for powerful PDF manipulation and tools
Security Ownership Map
Automate and integrate Security Ownership Map tracking and management
Microsoft Clarity Automation
Automate Microsoft Clarity tasks via Rube MCP (Composio):
Dask
Automate and integrate Dask workflows for scalable parallel computing and large-scale data processing
Google Admin Automation
Automate Google Workspace Admin tasks via Rube MCP (Composio):