Datamol
Automate and integrate Datamol workflows for efficient molecular data processing and cheminformatics tasks
Datamol is a community skill for molecular data processing using the datamol library, covering molecular parsing, descriptor calculation, fingerprint generation, scaffold analysis, and chemical data pipeline construction for cheminformatics workflows.
What Is This?
Overview
Datamol provides patterns for manipulating molecular data in Python with a simplified API built on top of RDKit. It covers SMILES parsing and standardization for consistent molecular representations, molecular descriptor calculation for quantitative structure-activity modeling, fingerprint generation including Morgan, MACCS, and topological types, scaffold decomposition and analysis for chemical series identification, and batch processing utilities for high-throughput molecular data pipelines. The skill enables cheminformatics developers to build molecular processing workflows with less boilerplate than raw RDKit, reducing common tasks like standardization from several explicit steps to a single function call.
Who Should Use This
This skill serves cheminformatics developers building molecular data processing pipelines, medicinal chemists analyzing compound libraries programmatically, and machine learning engineers preparing molecular features for property prediction models. It is particularly valuable for teams working with large, heterogeneous compound datasets sourced from multiple vendors or public databases.
Why Use It?
Problems It Solves
RDKit provides comprehensive chemistry functionality but requires verbose code for common operations like standardization and fingerprint generation. Molecular data from different sources uses inconsistent SMILES representations that need normalization before any meaningful comparison. Calculating molecular descriptors and fingerprints for large compound collections requires efficient batch processing. Converting between molecular formats and integrating with pandas DataFrames requires repetitive adapter code.
Core Highlights
Single-function SMILES parsing and standardization handles sanitization and canonicalization. Fingerprint generation wraps multiple algorithms behind a unified interface. Descriptor calculators return pandas DataFrames ready for machine learning pipelines. Scaffold operations extract Murcko scaffolds and generic frameworks from molecules.
How to Use It?
Basic Usage
import datamol as dm
import pandas as pd
smiles = "CC(=O)Oc1ccccc1C(=O)O"
mol = dm.to_mol(smiles)
std_mol = dm.standardize_mol(mol)
canon = dm.to_smiles(std_mol)
print(f"Canonical: {canon}")
desc = dm.descriptors.compute_many_descriptors(
mol)
print(f"MW: {desc.get('mw', 0):.1f}")
print(f"LogP: {desc.get('clogp', 0):.2f}")
print(f"HBD: {desc.get('n_hbd', 0)}")
fp = dm.to_fp(mol, fp_type="morgan",
n_bits=2048)
print(f"Fingerprint shape: {fp.shape}")
scaffold = dm.to_scaffold_smiles(mol)
print(f"Scaffold: {scaffold}")Real-World Examples
import datamol as dm
import numpy as np
class MolecularPipeline:
def __init__(self, fp_type: str = "morgan",
n_bits: int = 2048):
self.fp_type = fp_type
self.n_bits = n_bits
def process_smiles(self, smiles_list:
list[str]) -> dict:
mols = [dm.to_mol(s) for s in smiles_list]
valid = [(s, m) for s, m in
zip(smiles_list, mols) if m]
std = [(s, dm.standardize_mol(m))
for s, m in valid]
fps = np.array([dm.to_fp(
m, fp_type=self.fp_type,
n_bits=self.n_bits)
for _, m in std])
scaffolds = [dm.to_scaffold_smiles(m)
for _, m in std]
return {"n_valid": len(valid),
"n_invalid": len(smiles_list)
- len(valid),
"fingerprints": fps,
"scaffolds": scaffolds}
def similarity_matrix(
self, fps: np.ndarray) -> np.ndarray:
from sklearn.metrics.pairwise import (
cosine_similarity)
return cosine_similarity(fps)
pipeline = MolecularPipeline()
result = pipeline.process_smiles([
"CC(=O)Oc1ccccc1C(=O)O",
"c1ccccc1",
"CC(C)Cc1ccc(cc1)C(C)C(=O)O"])
print(f"Valid: {result['n_valid']}")
print(f"FP shape: {result['fingerprints'].shape}")Advanced Tips
Use dm.parallelized for batch processing large SMILES lists across multiple CPU cores, which can reduce processing time significantly for libraries exceeding tens of thousands of compounds. Combine Morgan fingerprints with computed descriptors as features for improved property prediction models. Prefilter molecules with dm.to_mol before expensive operations to skip invalid entries early in the pipeline.
When to Use It?
Use Cases
Build a compound library preprocessing pipeline that standardizes SMILES and generates fingerprints for similarity searching. Create a scaffold analysis tool that groups compounds by chemical series for medicinal chemistry review. Implement a molecular feature extraction module for machine learning property prediction, such as ADMET modeling or solubility estimation.
Related Topics
Cheminformatics, molecular fingerprints, RDKit, structure-activity relationships, and computational chemistry.
Important Notes
Requirements
Python with the datamol package installed, which depends on RDKit. NumPy for fingerprint array operations. Pandas for descriptor DataFrame handling in analysis workflows. A working RDKit installation is required as the underlying chemistry engine.
Usage Recommendations
Do: standardize molecules before fingerprint generation to ensure consistent representations. Use canonical SMILES as unique compound identifiers after standardization. Validate SMILES parsing results before processing to handle invalid input gracefully.
Don't: skip SMILES standardization, which can produce different fingerprints for equivalent molecules. Mix fingerprint types or parameters within a single similarity analysis. Process very large libraries on a single thread when dm.parallelized is available.
Limitations
Datamol wraps RDKit and inherits its limitations in handling certain complex chemistry like organometallics. Fingerprint similarity does not capture all aspects of molecular activity relevant to drug discovery. Descriptor calculations assume valid molecular structures and may produce misleading values for unusual chemistries. Performance on very large libraries may require parallelization and chunked processing strategies to achieve acceptable processing times.
More Skills You Might Like
Explore similar skills to enhance your workflow
Naming Analyzer
Automate the analysis of naming conventions to ensure consistency and clarity across large-scale software projects
Scientific Brainstorming
Scientific Brainstorming automation and integration
Felt Automation
Automate Felt operations through Composio's Felt toolkit via Rube MCP
Fixer Automation
Automate Fixer operations through Composio's Fixer toolkit via Rube MCP
Senior Data Scientist
Advanced automation of machine learning workflows and integration of predictive models for data-driven insights
Networkx
Automate complex network analysis and graph theory computations using the NetworkX library for