Geniml

Geniml automation and integration for machine learning model workflows

GeniML is a community skill for applying machine learning to genomic interval data using the GeniML library, covering region set embeddings, similarity analysis, classification, clustering, and integration with BED file workflows for epigenomics research.

What Is This?

Overview

GeniML provides patterns for building machine learning models that operate on genomic region sets such as ChIP-seq peaks and ATAC-seq accessible regions. It covers region set embedding that converts BED files into fixed-dimensional vectors for ML consumption, region set similarity computation using Jaccard index and embedding distance metrics, classification models that predict biological labels from region set features, clustering workflows that group similar region sets by their genomic distribution patterns, and integration with standard genomic file formats including BED, narrowPeak, and broadPeak. The skill enables researchers to apply ML techniques to collections of genomic intervals for comparative epigenomics analysis across large experimental cohorts.

Who Should Use This

This skill serves bioinformaticians analyzing collections of genomic region sets from epigenomic experiments, researchers comparing regulatory landscapes across cell types and conditions, and ML engineers building predictive models from genomic interval data.

Why Use It?

Problems It Solves

Genomic region sets exist as variable-length lists of intervals that standard ML algorithms cannot consume directly. Comparing thousands of region sets pairwise with overlap statistics is computationally expensive and scales poorly. Feature engineering from raw BED files requires domain knowledge about genome structure and annotation. Clustering and classifying region sets needs fixed-dimensional representations that preserve biological similarity.

Core Highlights

Region2Vec embeds BED files into dense vectors using learned genomic context. Similarity functions compute overlap and embedding-based distances between region sets. Classification pipelines predict cell type or tissue labels from region set embeddings. Clustering algorithms group region sets by their genomic distribution patterns.

How to Use It?

Basic Usage

import numpy as np
from geniml.region2vec import Region2VecExModel
from geniml.io import RegionSet

rs1 = RegionSet("peaks_celltype_a.bed")
rs2 = RegionSet("peaks_celltype_b.bed")
print(f"Regions in set 1: {len(rs1)}")
print(f"Regions in set 2: {len(rs2)}")

model = Region2VecExModel(
    "path/to/region2vec_model")

emb1 = model.encode(rs1)
emb2 = model.encode(rs2)
print(f"Embedding dim: {emb1.shape}")

cosine_sim = np.dot(emb1, emb2) / (
    np.linalg.norm(emb1)
    * np.linalg.norm(emb2))
print(f"Similarity: {cosine_sim:.4f}")

Real-World Examples

import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

class RegionSetAnalyzer:
    def __init__(self, model):
        self.model = model

    def embed_collection(
            self, bed_files: list[str]
            ) -> np.ndarray:
        embeddings = []
        for path in bed_files:
            rs = RegionSet(path)
            emb = self.model.encode(rs)
            embeddings.append(emb)
        return np.array(embeddings)

    def cluster(self, embeddings: np.ndarray,
                n_clusters: int = 5) -> dict:
        km = KMeans(n_clusters=n_clusters,
                    random_state=42)
        labels = km.fit_predict(embeddings)
        score = silhouette_score(
            embeddings, labels)
        return {"labels": labels.tolist(),
                "n_clusters": n_clusters,
                "silhouette": round(score, 4)}

    def pairwise_similarity(
            self, embeddings: np.ndarray
            ) -> np.ndarray:
        norms = np.linalg.norm(
            embeddings, axis=1, keepdims=True)
        normalized = embeddings / norms
        return normalized @ normalized.T

analyzer = RegionSetAnalyzer(model)
embs = analyzer.embed_collection(
    ["peaks_a.bed", "peaks_b.bed",
     "peaks_c.bed"])
clusters = analyzer.cluster(embs, n_clusters=2)
print(f"Silhouette: {clusters['silhouette']}")

Advanced Tips

Use universe files that define the complete set of possible genomic regions for consistent embedding across experiments. Normalizing embeddings before similarity computation ensures cosine distance is well-behaved and prevents high-signal region sets from dominating comparisons. Evaluate clustering quality across a range of cluster counts using silhouette scores to identify the most biologically meaningful grouping. Batch embedding generation across large file collections reduces redundant model loading overhead.

When to Use It?

Use Cases

Build a cell type classifier that predicts tissue identity from ATAC-seq peak sets using region embeddings. Create a similarity search tool that finds the most similar reference epigenomes for a query sample. Implement a clustering pipeline that groups ChIP-seq experiments by transcription factor binding patterns.

Related Topics

Genomic machine learning, region set analysis, epigenomics, BED file processing, and computational genomics.

Important Notes

Requirements

Python with the geniml package installed. Pretrained Region2Vec models for embedding generation. BED files containing genomic regions as input data.

Usage Recommendations

Do: use consistent genome assemblies across all region sets in an analysis. Validate embeddings with known biological relationships before downstream analysis. Apply appropriate clustering evaluation metrics to assess grouping quality.

Don't: mix region sets from different genome assemblies without liftover conversion. Assume embedding similarity directly implies biological function without further validation. Ignore the effect of region set size on embedding quality.

Limitations

Embedding quality depends on the pretraining data and may not generalize to all genomic contexts. Region sets with very few intervals may produce unreliable embeddings. Large-scale pairwise comparison of many region sets requires significant compute for embedding generation.