Deepchem

Automate and integrate DeepChem for advanced deep learning in chemistry

Source: K-Dense-AI/claude-scientific-skills

DeepChem is a community skill for applying deep learning to chemistry and biology using the DeepChem library, covering molecular featurization, model training, property prediction, dataset handling, and evaluation for scientific machine learning workflows.

What Is This?

Overview

DeepChem provides patterns for building machine learning models that operate on chemical and biological data. It covers molecular featurization that converts SMILES strings into graph, fingerprint, and descriptor representations, model architectures including graph neural networks, random forests, and multitask networks, MoleculeNet benchmark datasets for training and evaluating molecular property predictors, dataset management with splitting strategies for chemical data, and hyperparameter tuning for scientific ML models. The skill enables researchers to build predictive models for drug discovery, materials science, and computational biology.

Who Should Use This

This skill serves computational chemists building property prediction models for drug candidates, materials scientists predicting material properties from molecular structure, and bioinformatics researchers applying deep learning to protein and genomic data. It is also well-suited for data scientists entering the cheminformatics domain who need a structured starting point with established benchmarks.

Why Use It?

Problems It Solves

Converting molecular structures into machine-learnable features requires specialized featurization that standard ML libraries lack. Chemical datasets have unique splitting requirements to avoid data leakage from similar molecules. Evaluating models on chemical prediction tasks needs domain-specific metrics like enrichment factors. Building graph neural networks for molecules requires custom graph construction from atomic connectivity.

Core Highlights

Featurizers convert SMILES to graphs, fingerprints, Coulomb matrices, and other molecular representations. Model zoo includes graph convolutional, attentive FP, and multitask architectures ready for training. MoleculeNet loaders provide curated benchmark datasets with standard splits. Evaluation metrics cover ROC-AUC, enrichment, and regression accuracy for chemical predictions.

How to Use It?

Basic Usage

import deepchem as dc

tasks, datasets, transformers = (
    dc.molnet.load_delaney(
        featurizer="GraphConv",
        splitter="scaffold"))
train, valid, test = datasets

print(f"Tasks: {tasks}")
print(f"Train: {len(train)}")
print(f"Valid: {len(valid)}")
print(f"Test: {len(test)}")

model = dc.models.GraphConvModel(
    n_tasks=len(tasks),
    mode="regression",
    batch_size=32,
    learning_rate=0.001)
model.fit(train, nb_epoch=50)

metric = dc.metrics.Metric(
    dc.metrics.pearson_r2_score)
scores = model.evaluate(test, [metric])
print(f"Test R2: {scores}")

Real-World Examples

import deepchem as dc
import numpy as np

def train_property_predictor(
        smiles: list[str],
        labels: list[float],
        featurizer: str = "ECFP",
        model_type: str = "rf") -> dict:
    feat = dc.feat.CircularFingerprint(
        size=2048) if featurizer == "ECFP" else (
        dc.feat.ConvMolFeaturizer())
    dataset = dc.data.NumpyDataset(
        X=feat.featurize(smiles),
        y=np.array(labels))
    splitter = dc.splits.ScaffoldSplitter()
    train, test = splitter.train_test_split(
        dataset, frac_train=0.8)
    if model_type == "rf":
        from sklearn.ensemble import (
            RandomForestRegressor)
        sk_model = RandomForestRegressor(
            n_estimators=100)
        model = dc.models.SklearnModel(
            model=sk_model)
    else:
        model = dc.models.GraphConvModel(
            n_tasks=1, mode="regression")
    model.fit(train)
    metric = dc.metrics.Metric(
        dc.metrics.pearson_r2_score)
    train_score = model.evaluate(train, [metric])
    test_score = model.evaluate(test, [metric])
    return {"train_r2": train_score,
            "test_r2": test_score,
            "n_train": len(train),
            "n_test": len(test)}

Advanced Tips

Use scaffold splitting instead of random splitting to test model generalization to novel chemical series. Combine multiple featurization strategies as input channels for ensemble models. Apply transformers like NormalizationTransformer to labels before training regression models for improved convergence. When working with imbalanced classification tasks such as toxicity prediction, apply BalancingTransformer to reweight minority class samples during training.

When to Use It?

Use Cases

Build a solubility predictor that estimates aqueous solubility from molecular structure for drug candidates. Create a toxicity screening model that flags potentially toxic compounds early in discovery. Implement a virtual screening pipeline that ranks compounds by predicted binding affinity to a target protein.

Important Notes

Requirements

Python with the deepchem package installed. TensorFlow or PyTorch backend depending on the chosen model architecture. RDKit for molecular featurization operations. GPU access is recommended for training graph neural network architectures on large datasets.

Usage Recommendations

Do: use scaffold splitting for chemical datasets to realistically evaluate model generalization. Start with MoleculeNet benchmarks to compare model architectures before training on custom data. Report metrics on held-out scaffold splits for fair comparison.

Don't: use random splitting on molecular datasets, which inflates test scores due to similar molecules appearing in both splits. Train graph models on very small datasets where simpler fingerprint methods outperform. Ignore class imbalance in activity prediction tasks.

Limitations

Model performance depends heavily on training data quality and chemical diversity. Graph neural networks require more data and compute than fingerprint models to achieve comparable accuracy. Predictions for molecules outside the training distribution are unreliable, making applicability domain assessment an important step before deploying any model to production. Training graph architectures requires significantly more compute time than fingerprint-based methods for comparable dataset sizes.

More Skills You Might Like

Explore similar skills to enhance your workflow