Dnanexus Integration

Automate and integrate DNAnexus for scalable genomic data processing pipelines

DNAnexus Integration is a community skill for building bioinformatics workflows on the DNAnexus cloud platform, covering job execution, data management, app development, project organization, and pipeline orchestration for genomics research.

What Is This?

Overview

DNAnexus Integration provides patterns for programmatically interacting with the DNAnexus platform for genomics data analysis. It covers authentication and project management through the platform API, data upload and download operations for sequencing files and results, app and workflow execution with parameter configuration, job monitoring and status tracking for long-running analyses, and workflow chaining that connects analysis steps into reproducible pipelines. The skill enables bioinformaticians to automate genomics workflows on a secure cloud platform designed for regulated research environments, including those subject to HIPAA or GxP compliance requirements.

Who Should Use This

This skill serves bioinformaticians automating genomics pipelines on the DNAnexus platform, clinical genomics teams running validated analysis workflows in regulated environments, and developers building custom apps for the DNAnexus ecosystem.

Why Use It?

Problems It Solves

Running genomics analyses manually through the DNAnexus web interface does not scale to large sample batches. Tracking job status and collecting outputs across many concurrent analyses requires automation. Uploading and organizing large sequencing datasets needs efficient transfer with metadata management. Chaining analysis steps into reproducible pipelines demands programmatic workflow construction, particularly when coordinating tools such as BWA, GATK, and downstream variant annotation stages.

Core Highlights

Python SDK wraps the DNAnexus API for authentication, project access, and data operations. Job runner launches analysis apps with configured parameters and monitors execution status. File manager handles upload, download, and organization of genomics data. Workflow builder chains analysis stages into executable pipelines.

How to Use It?

Basic Usage

import dxpy

dxpy.set_workspace_id("project-XXXX")

files = list(dxpy.find_data_objects(
    classname="file",
    folder="/raw_data",
    name="*.fastq.gz",
    name_mode="glob"))
print(f"Files found: {len(files)}")

uploaded = dxpy.upload_local_file(
    "sample.fastq.gz",
    folder="/raw_data",
    wait_on_close=True)
print(f"Uploaded: {uploaded.get_id()}")

job = dxpy.DXApp(name="bwa_mem").run({
    "reads": dxpy.dxlink(files[0]["id"]),
    "reference": dxpy.dxlink("file-YYYY")})
print(f"Job: {job.get_id()}")

job.wait_on_done()
output = job.describe()["output"]
print(f"Output: {output}")

Real-World Examples

import dxpy

class BatchRunner:
    def __init__(self, project_id: str,
                 app_name: str):
        dxpy.set_workspace_id(project_id)
        self.app = dxpy.DXApp(name=app_name)

    def submit_batch(self, samples: list[dict]
                     ) -> list[str]:
        job_ids = []
        for sample in samples:
            job = self.app.run(sample)
            job_ids.append(job.get_id())
            print(f"Submitted: {job.get_id()}")
        return job_ids

    def collect_results(self, job_ids: list[str]
                        ) -> list[dict]:
        results = []
        for jid in job_ids:
            job = dxpy.DXJob(jid)
            job.wait_on_done()
            desc = job.describe()
            results.append({
                "job": jid,
                "state": desc["state"],
                "output": desc.get("output", {})})
        return results

runner = BatchRunner("project-XXXX", "bwa_mem")
samples = [
    {"reads": dxpy.dxlink("file-A"),
     "reference": dxpy.dxlink("file-REF")},
    {"reads": dxpy.dxlink("file-B"),
     "reference": dxpy.dxlink("file-REF")}]
jobs = runner.submit_batch(samples)
results = runner.collect_results(jobs)

Advanced Tips

Use instance type specifications in job inputs to optimize compute cost for each analysis step, selecting memory-optimized instances for variant calling and CPU-optimized instances for alignment. Tag jobs with metadata to enable filtering and tracking across large batch submissions. Use dxpy.bindings for streaming file access when processing outputs without full download. Group related files with properties and tags for efficient batch queries across large projects.

When to Use It?

Use Cases

Build a batch alignment pipeline that processes hundreds of sequencing samples in parallel on DNAnexus. Create a data management tool that organizes project files with metadata tags and folder structures. Implement a job monitoring dashboard that tracks analysis progress across active projects.

Related Topics

Cloud genomics platforms, bioinformatics pipeline automation, sequencing data management, clinical genomics, and reproducible research workflows.

Important Notes

Requirements

Python with the dxpy package installed. A DNAnexus account with project access and API tokens. Network access to the DNAnexus platform API. Familiarity with the DNAnexus data model for projects, files, and executions.

Usage Recommendations

Do: use project folders and tags to organize data systematically. Monitor job costs and instance usage to optimize cloud spending. Set appropriate timeout values for long-running genomics analyses.

Don't: hard-code API tokens in scripts instead of using environment variables or login sessions. Submit large batches without checking available compute quotas. Ignore job failure states that indicate input data or configuration issues.

Limitations

Platform API rate limits restrict the frequency of status polling for large batch submissions. Data transfer speeds depend on network bandwidth between local storage and the DNAnexus platform. Custom app development requires familiarity with the DNAnexus app specification and Docker container packaging. Data egress costs apply when downloading results from the platform to local storage.