Lamindb
Manage biological data and computational workflows with automated LaminDB database integration
Lamindb is a community skill for managing biological research data with LaminDB, covering data ingestion, artifact tracking, dataset versioning, metadata annotation, and lineage querying for computational biology data management.
What Is This?
Overview
Lamindb provides tools for organizing and tracking biological research data using the LaminDB framework. It covers data ingestion that registers files, arrays, and dataframes as tracked artifacts with automatic format detection and checksum validation, artifact tracking that maintains provenance records linking data to source experiments, processing scripts, and pipeline runs, dataset versioning that manages data iterations with immutable snapshots and version history, metadata annotation that tags artifacts with biological ontology terms from CellTypist, Gene Ontology, and other standardized vocabularies, and lineage querying that traces data transformations from raw inputs through processed outputs. The skill enables researchers to build reproducible data management workflows for biological research.
Who Should Use This
This skill serves bioinformatics researchers managing multi-omics datasets, computational biology teams building data pipelines, and research organizations establishing data governance for biological data assets. It is particularly valuable for teams working across multiple projects who need consistent, queryable records of how datasets were produced and transformed.
Why Use It?
Problems It Solves
Research data files scattered across storage systems lack provenance tracking making reproduction impossible. Dataset versions are managed through filename conventions that become ambiguous over time. Biological metadata uses inconsistent terminology when ontology-based annotation is not enforced. Data lineage from raw sequencing reads to processed results requires manual documentation that becomes stale.
Core Highlights
Artifact registrar ingests data files with automatic format detection and integrity checking. Provenance tracker links artifacts to source experiments and processing runs. Version manager creates immutable snapshots with queryable version history. Ontology annotator tags data with standardized biological terms, enabling consistent cross-dataset search and comparison.
How to Use It?
Basic Usage
import lamindb as ln
class DataManager:
def __init__(self):
ln.setup.init(
storage='./data')
ln.setup.register()
def register_file(
self,
path: str,
description: str,
key: str = None
) -> ln.Artifact:
artifact = (
ln.Artifact(
path,
description=(
description),
key=key))
artifact.save()
return artifact
def annotate(
self,
artifact:
ln.Artifact,
labels: dict
):
for field, values\
in labels.items():
for val in values:
label = (
ln.ULabel(
name=val))
label.save()
artifact.ulabels\
.add(label)
def query(
self,
description:
str = None
):
qs = ln.Artifact\
.filter()
if description:
qs = qs.filter(
description\
__contains=(
description))
return list(qs.all())Real-World Examples
class PipelineTracker:
def __init__(
self,
pipeline_name: str
):
self.run = ln.Run(
transform=(
ln.Transform(
name=(
pipeline_name),
type='pipeline')))
self.run.save()
def add_input(
self,
artifact:
ln.Artifact
):
self.run\
.input_artifacts\
.add(artifact)
def save_output(
self,
path: str,
description: str
) -> ln.Artifact:
output = ln.Artifact(
path,
description=(
description),
run=self.run)
output.save()
return output
def get_lineage(
self,
artifact:
ln.Artifact
) -> dict:
run = artifact.run
inputs = list(
run.input_artifacts\
.all()) if run\
else []
return {
'artifact':
artifact
.description,
'run': run.transform
.name if run\
else None,
'inputs': [
a.description
for a in inputs]}Advanced Tips
Use LaminDB with cloud storage backends like S3 for scalable artifact storage while maintaining local metadata queries. Register AnnData objects directly for single-cell datasets to preserve observation and variable annotations. Build collection objects that group related artifacts into queryable datasets with shared metadata. When working with large cohorts, apply consistent naming conventions for transform objects so pipeline lineage remains navigable across hundreds of runs.
When to Use It?
Use Cases
Track input and output artifacts through a bioinformatics processing pipeline with full lineage. Annotate single-cell datasets with cell type ontology terms for searchable metadata. Version a curated reference dataset with immutable snapshots for reproducible analyses.
Related Topics
Data management, biological databases, provenance tracking, artifact versioning, metadata annotation, data lineage, and research reproducibility.
Important Notes
Requirements
LaminDB Python package installed with storage backend configured. Database backend for metadata storage such as SQLite or PostgreSQL. Storage system for artifact files either local or cloud-based.
Usage Recommendations
Do: register all pipeline inputs and outputs as artifacts to maintain complete data lineage. Use standardized ontology terms for annotation to enable consistent cross-dataset queries. Save artifacts with descriptive keys and documentation.
Don't: modify registered artifacts in place since this breaks provenance integrity. Skip artifact registration for intermediate files that are needed for lineage reconstruction. Use free-text metadata instead of ontology labels where standardized terms are available.
Limitations
LaminDB is designed for biological data and may lack features for other data domains. Large artifact storage requires appropriate cloud infrastructure configuration. Ontology annotation depends on vocabulary coverage which may not include all specialized terms.
More Skills You Might Like
Explore similar skills to enhance your workflow
Clinical Decision Support
Clinical Decision Support automation and integration
Hubspot Automation
Automate HubSpot CRM operations (contacts, companies, deals, tickets, properties) via Rube MCP using Composio integration
Searxng
Privacy-respecting metasearch using your local SearXNG instance. Search the web, images, news
Attio Automation
1. Add the Composio MCP server to your client configuration:
Globalping Automation
Automate Globalping operations through Composio's Globalping toolkit
Supabase Automation
Automate Supabase database queries, table management, project administration, storage, edge functions, and SQL execution via Rube MCP (Composio). Alwa