Datamol

Automate and integrate Datamol workflows for efficient molecular data processing and cheminformatics tasks

Datamol is a community skill for molecular data processing using the datamol library, covering molecular parsing, descriptor calculation, fingerprint generation, scaffold analysis, and chemical data pipeline construction for cheminformatics workflows.

What Is This?

Overview

Datamol provides patterns for manipulating molecular data in Python with a simplified API built on top of RDKit. It covers SMILES parsing and standardization for consistent molecular representations, molecular descriptor calculation for quantitative structure-activity modeling, fingerprint generation including Morgan, MACCS, and topological types, scaffold decomposition and analysis for chemical series identification, and batch processing utilities for high-throughput molecular data pipelines. The skill enables cheminformatics developers to build molecular processing workflows with less boilerplate than raw RDKit, reducing common tasks like standardization from several explicit steps to a single function call.

Who Should Use This

This skill serves cheminformatics developers building molecular data processing pipelines, medicinal chemists analyzing compound libraries programmatically, and machine learning engineers preparing molecular features for property prediction models. It is particularly valuable for teams working with large, heterogeneous compound datasets sourced from multiple vendors or public databases.

Why Use It?

Problems It Solves

RDKit provides comprehensive chemistry functionality but requires verbose code for common operations like standardization and fingerprint generation. Molecular data from different sources uses inconsistent SMILES representations that need normalization before any meaningful comparison. Calculating molecular descriptors and fingerprints for large compound collections requires efficient batch processing. Converting between molecular formats and integrating with pandas DataFrames requires repetitive adapter code.

Core Highlights

Single-function SMILES parsing and standardization handles sanitization and canonicalization. Fingerprint generation wraps multiple algorithms behind a unified interface. Descriptor calculators return pandas DataFrames ready for machine learning pipelines. Scaffold operations extract Murcko scaffolds and generic frameworks from molecules.

How to Use It?

Basic Usage

import datamol as dm
import pandas as pd

smiles = "CC(=O)Oc1ccccc1C(=O)O"
mol = dm.to_mol(smiles)
std_mol = dm.standardize_mol(mol)
canon = dm.to_smiles(std_mol)
print(f"Canonical: {canon}")

desc = dm.descriptors.compute_many_descriptors(
    mol)
print(f"MW: {desc.get('mw', 0):.1f}")
print(f"LogP: {desc.get('clogp', 0):.2f}")
print(f"HBD: {desc.get('n_hbd', 0)}")

fp = dm.to_fp(mol, fp_type="morgan",
              n_bits=2048)
print(f"Fingerprint shape: {fp.shape}")

scaffold = dm.to_scaffold_smiles(mol)
print(f"Scaffold: {scaffold}")

Real-World Examples

import datamol as dm
import numpy as np

class MolecularPipeline:
    def __init__(self, fp_type: str = "morgan",
                 n_bits: int = 2048):
        self.fp_type = fp_type
        self.n_bits = n_bits

    def process_smiles(self, smiles_list:
                        list[str]) -> dict:
        mols = [dm.to_mol(s) for s in smiles_list]
        valid = [(s, m) for s, m in
                 zip(smiles_list, mols) if m]
        std = [(s, dm.standardize_mol(m))
               for s, m in valid]
        fps = np.array([dm.to_fp(
            m, fp_type=self.fp_type,
            n_bits=self.n_bits)
            for _, m in std])
        scaffolds = [dm.to_scaffold_smiles(m)
                     for _, m in std]
        return {"n_valid": len(valid),
                "n_invalid": len(smiles_list)
                             - len(valid),
                "fingerprints": fps,
                "scaffolds": scaffolds}

    def similarity_matrix(
            self, fps: np.ndarray) -> np.ndarray:
        from sklearn.metrics.pairwise import (
            cosine_similarity)
        return cosine_similarity(fps)

pipeline = MolecularPipeline()
result = pipeline.process_smiles([
    "CC(=O)Oc1ccccc1C(=O)O",
    "c1ccccc1",
    "CC(C)Cc1ccc(cc1)C(C)C(=O)O"])
print(f"Valid: {result['n_valid']}")
print(f"FP shape: {result['fingerprints'].shape}")

Advanced Tips

Use dm.parallelized for batch processing large SMILES lists across multiple CPU cores, which can reduce processing time significantly for libraries exceeding tens of thousands of compounds. Combine Morgan fingerprints with computed descriptors as features for improved property prediction models. Prefilter molecules with dm.to_mol before expensive operations to skip invalid entries early in the pipeline.

When to Use It?

Use Cases

Build a compound library preprocessing pipeline that standardizes SMILES and generates fingerprints for similarity searching. Create a scaffold analysis tool that groups compounds by chemical series for medicinal chemistry review. Implement a molecular feature extraction module for machine learning property prediction, such as ADMET modeling or solubility estimation.

Related Topics

Cheminformatics, molecular fingerprints, RDKit, structure-activity relationships, and computational chemistry.

Important Notes

Requirements

Python with the datamol package installed, which depends on RDKit. NumPy for fingerprint array operations. Pandas for descriptor DataFrame handling in analysis workflows. A working RDKit installation is required as the underlying chemistry engine.

Usage Recommendations

Do: standardize molecules before fingerprint generation to ensure consistent representations. Use canonical SMILES as unique compound identifiers after standardization. Validate SMILES parsing results before processing to handle invalid input gracefully.

Don't: skip SMILES standardization, which can produce different fingerprints for equivalent molecules. Mix fingerprint types or parameters within a single similarity analysis. Process very large libraries on a single thread when dm.parallelized is available.

Limitations

Datamol wraps RDKit and inherits its limitations in handling certain complex chemistry like organometallics. Fingerprint similarity does not capture all aspects of molecular activity relevant to drug discovery. Descriptor calculations assume valid molecular structures and may produce misleading values for unusual chemistries. Performance on very large libraries may require parallelization and chunked processing strategies to achieve acceptable processing times.