Lamindb

Manage biological data and computational workflows with automated LaminDB database integration

Lamindb is a community skill for managing biological research data with LaminDB, covering data ingestion, artifact tracking, dataset versioning, metadata annotation, and lineage querying for computational biology data management.

What Is This?

Overview

Lamindb provides tools for organizing and tracking biological research data using the LaminDB framework. It covers data ingestion that registers files, arrays, and dataframes as tracked artifacts with automatic format detection and checksum validation, artifact tracking that maintains provenance records linking data to source experiments, processing scripts, and pipeline runs, dataset versioning that manages data iterations with immutable snapshots and version history, metadata annotation that tags artifacts with biological ontology terms from CellTypist, Gene Ontology, and other standardized vocabularies, and lineage querying that traces data transformations from raw inputs through processed outputs. The skill enables researchers to build reproducible data management workflows for biological research.

Who Should Use This

This skill serves bioinformatics researchers managing multi-omics datasets, computational biology teams building data pipelines, and research organizations establishing data governance for biological data assets. It is particularly valuable for teams working across multiple projects who need consistent, queryable records of how datasets were produced and transformed.

Why Use It?

Problems It Solves

Research data files scattered across storage systems lack provenance tracking making reproduction impossible. Dataset versions are managed through filename conventions that become ambiguous over time. Biological metadata uses inconsistent terminology when ontology-based annotation is not enforced. Data lineage from raw sequencing reads to processed results requires manual documentation that becomes stale.

Core Highlights

Artifact registrar ingests data files with automatic format detection and integrity checking. Provenance tracker links artifacts to source experiments and processing runs. Version manager creates immutable snapshots with queryable version history. Ontology annotator tags data with standardized biological terms, enabling consistent cross-dataset search and comparison.

How to Use It?

Basic Usage

import lamindb as ln

class DataManager:
  def __init__(self):
    ln.setup.init(
      storage='./data')
    ln.setup.register()

  def register_file(
    self,
    path: str,
    description: str,
    key: str = None
  ) -> ln.Artifact:
    artifact = (
      ln.Artifact(
        path,
        description=(
          description),
        key=key))
    artifact.save()
    return artifact

  def annotate(
    self,
    artifact:
      ln.Artifact,
    labels: dict
  ):
    for field, values\
        in labels.items():
      for val in values:
        label = (
          ln.ULabel(
            name=val))
        label.save()
        artifact.ulabels\
          .add(label)

  def query(
    self,
    description:
      str = None
  ):
    qs = ln.Artifact\
      .filter()
    if description:
      qs = qs.filter(
        description\
          __contains=(
            description))
    return list(qs.all())

Real-World Examples

class PipelineTracker:
  def __init__(
    self,
    pipeline_name: str
  ):
    self.run = ln.Run(
      transform=(
        ln.Transform(
          name=(
            pipeline_name),
          type='pipeline')))
    self.run.save()

  def add_input(
    self,
    artifact:
      ln.Artifact
  ):
    self.run\
      .input_artifacts\
        .add(artifact)

  def save_output(
    self,
    path: str,
    description: str
  ) -> ln.Artifact:
    output = ln.Artifact(
      path,
      description=(
        description),
      run=self.run)
    output.save()
    return output

  def get_lineage(
    self,
    artifact:
      ln.Artifact
  ) -> dict:
    run = artifact.run
    inputs = list(
      run.input_artifacts\
        .all()) if run\
          else []
    return {
      'artifact':
        artifact
          .description,
      'run': run.transform
        .name if run\
          else None,
      'inputs': [
        a.description
        for a in inputs]}

Advanced Tips

Use LaminDB with cloud storage backends like S3 for scalable artifact storage while maintaining local metadata queries. Register AnnData objects directly for single-cell datasets to preserve observation and variable annotations. Build collection objects that group related artifacts into queryable datasets with shared metadata. When working with large cohorts, apply consistent naming conventions for transform objects so pipeline lineage remains navigable across hundreds of runs.

When to Use It?

Use Cases

Track input and output artifacts through a bioinformatics processing pipeline with full lineage. Annotate single-cell datasets with cell type ontology terms for searchable metadata. Version a curated reference dataset with immutable snapshots for reproducible analyses.

Related Topics

Data management, biological databases, provenance tracking, artifact versioning, metadata annotation, data lineage, and research reproducibility.

Important Notes

Requirements

LaminDB Python package installed with storage backend configured. Database backend for metadata storage such as SQLite or PostgreSQL. Storage system for artifact files either local or cloud-based.

Usage Recommendations

Do: register all pipeline inputs and outputs as artifacts to maintain complete data lineage. Use standardized ontology terms for annotation to enable consistent cross-dataset queries. Save artifacts with descriptive keys and documentation.

Don't: modify registered artifacts in place since this breaks provenance integrity. Skip artifact registration for intermediate files that are needed for lineage reconstruction. Use free-text metadata instead of ontology labels where standardized terms are available.

Limitations

LaminDB is designed for biological data and may lack features for other data domains. Large artifact storage requires appropriate cloud infrastructure configuration. Ontology annotation depends on vocabulary coverage which may not include all specialized terms.