Rdkit
Streamline cheminformatics workflows by automating RDKit molecule processing and chemical data analysis
RDKit is a community skill for cheminformatics using the RDKit Python library, covering molecular representation, descriptor calculation, substructure search, chemical fingerprints, and reaction processing for computational chemistry and drug discovery.
What Is This?
Overview
RDKit provides tools for working with chemical structures and molecular data through a comprehensive cheminformatics toolkit. It covers molecular representation that parses SMILES strings, MOL files, and SDF databases into molecular objects with atom and bond properties, descriptor calculation that computes molecular weight, LogP, hydrogen bond counts, and hundreds of other chemical descriptors, substructure search that finds molecules containing specific chemical patterns using SMARTS queries, chemical fingerprints that generate Morgan, topological, and MACCS fingerprints for similarity searching, and reaction processing that applies chemical transformations using SMIRKS reaction templates. The skill enables researchers to analyze and manipulate chemical data programmatically.
Who Should Use This
This skill serves medicinal chemists analyzing structure-activity relationships, cheminformatics scientists building molecular property prediction models, and drug discovery teams processing chemical libraries for virtual screening.
Why Use It?
Problems It Solves
Chemical structures stored as text strings need parsing into molecular graphs before any property calculation or comparison. Computing molecular descriptors for thousands of compounds requires efficient batch processing with standardized implementations. Searching chemical libraries for molecules containing specific substructures needs pattern matching algorithms. Comparing molecular similarity requires fingerprint generation and distance metric computation.
Core Highlights
Molecule parser creates molecular objects from SMILES and file formats. Descriptor calculator computes hundreds of molecular properties in batch. Substructure matcher finds pattern matches using SMARTS chemical queries. Fingerprint generator creates molecular fingerprints for similarity searching.
How to Use It?
Basic Usage
from rdkit import Chem
from rdkit.Chem import (
Descriptors,
AllChem,
Draw)
mol = Chem.MolFromSmiles(
'CC(=O)Oc1ccccc1'
'C(=O)O')
mw = Descriptors\
.MolWt(mol)
logp = Descriptors\
.MolLogP(mol)
hbd = Descriptors\
.NumHDonors(mol)
hba = Descriptors\
.NumHAcceptors(mol)
print(
f'MW: {mw:.1f}')
print(
f'LogP: {logp:.2f}')
print(
f'HBD: {hbd} '
f'HBA: {hba}')
fp = AllChem\
.GetMorganFingerprintAsBitVect(
mol,
radius=2,
nBits=2048)
print(
f'FP bits: '
f'{fp.GetNumOnBits()}')
pattern = (
Chem.MolFromSmarts(
'c1ccccc1'))
has_ring = mol\
.HasSubstructMatch(
pattern)
print(
f'Has benzene: '
f'{has_ring}')Real-World Examples
from rdkit import Chem
from rdkit.Chem import (
Descriptors, AllChem)
from rdkit import (
DataStructs)
class MoleculeFilter:
def __init__(
self,
smiles_list:
list[str]
):
self.mols = [
(s, Chem
.MolFromSmiles(s))
for s in
smiles_list]
self.mols = [
(s, m) for s, m
in self.mols
if m is not None]
def lipinski(
self
) -> list[str]:
passed = []
for smi, mol in (
self.mols
):
mw = Descriptors\
.MolWt(mol)
logp = Descriptors\
.MolLogP(mol)
hbd = Descriptors\
.NumHDonors(mol)
hba = Descriptors\
.NumHAcceptors(
mol)
if (mw <= 500
and logp <= 5
and hbd <= 5
and hba <= 10):
passed.append(
smi)
return passed
def similarity(
self,
query: str,
threshold:
float = 0.7
) -> list[tuple]:
qmol = (
Chem.MolFromSmiles(
query))
qfp = AllChem\
.GetMorganFingerprintAsBitVect(
qmol, 2, 2048)
hits = []
for smi, mol in (
self.mols
):
fp = AllChem\
.GetMorganFingerprintAsBitVect(
mol, 2, 2048)
sim = DataStructs\
.TanimotoSimilarity(
qfp, fp)
if sim >= threshold:
hits.append(
(smi, sim))
return sorted(
hits,
key=lambda x:
-x[1])Advanced Tips
Canonicalize SMILES strings before comparing or storing molecules to ensure consistent representation across different input sources. Use Morgan fingerprints with radius 2 as a general-purpose similarity metric for most drug discovery applications. Pre-compute and cache fingerprints for library compounds to speed up repeated similarity searches.
When to Use It?
Use Cases
Filter a chemical library using Lipinski's Rule of Five to identify drug-like molecules for screening. Search for compounds similar to a lead molecule using Tanimoto similarity on Morgan fingerprints. Compute molecular descriptors for a dataset to build a QSAR prediction model.
Related Topics
RDKit, cheminformatics, molecular descriptors, SMILES, fingerprints, drug discovery, and chemical informatics.
Important Notes
Requirements
RDKit Python package installed via conda or pip. Valid SMILES strings or MOL files for molecular input. NumPy for array operations on computed descriptors and fingerprints.
Usage Recommendations
Do: check MolFromSmiles return values for None since invalid SMILES produce null molecules. Canonicalize SMILES before storing to avoid duplicate representations of the same molecule. Use SMARTS patterns for flexible substructure queries that match chemical functional groups.
Don't: compare molecules using SMILES string equality since different valid SMILES can represent the same structure. Use raw descriptor values without normalization when building ML models across diverse property ranges. Skip sanitization checks when reading molecules from untrusted sources since this can produce invalid molecular graphs.
Limitations
RDKit installation through pip may lack some features available in the conda-forge build. Processing millions of molecules sequentially can be slow without parallelization. Some stereochemistry and tautomer handling requires explicit configuration for consistent results.
More Skills You Might Like
Explore similar skills to enhance your workflow
Kraken Io Automation
Automate Kraken IO operations through Composio's Kraken IO toolkit via
Eversign Automation
Automate Eversign operations through Composio's Eversign toolkit via
Temporal Python Testing Strategies
Comprehensive testing approaches for Temporal workflows using pytest, progressive disclosure resources for specific testing scenarios
Twitter Thread Creation
Twitter Thread Creation automation and integration
Ceo Advisor
Provide strategic executive insights and automated decision support for CEOs and high-level leadership
Image To Video
Convert static images into dynamic video content using automated animation and rendering integrations