Torchdrug

Automate and integrate TorchDrug for scalable drug discovery and molecular learning pipelines

TorchDrug is a community skill for machine learning on drug discovery tasks with PyTorch, covering molecular property prediction, retrosynthesis planning, protein representation learning, and generative chemistry for computational drug design.

What Is This?

Overview

TorchDrug provides guidance on applying deep learning to drug discovery and molecular science using the TorchDrug library. It covers molecular property prediction that trains graph neural networks on molecular structures to estimate chemical properties like solubility and toxicity, retrosynthesis planning that decomposes target molecules into purchasable building blocks through learned reaction rules, protein representation learning that encodes amino acid sequences and 3D structures into embeddings for function prediction, generative molecule design that creates novel compounds optimized for desired properties through reinforcement learning, and knowledge graph reasoning that predicts drug-target interactions from biomedical relationship data. The skill helps researchers and pharmaceutical teams accelerate computational drug discovery workflows from hit identification through lead optimization.

Who Should Use This

This skill serves computational chemists building molecular property models, drug discovery researchers automating synthesis planning, and bioinformatics engineers predicting protein-ligand interactions. It is also well suited for machine learning engineers integrating graph neural network pipelines into existing cheminformatics workflows.

Why Use It?

Problems It Solves

Molecular property prediction requires specialized graph representations that encode atom types and bond connectivity. Retrosynthesis planning involves searching large reaction spaces that are intractable without learned heuristics. Protein function prediction from sequence alone misses structural information that determines binding behavior. Generating chemically valid molecules with desired target properties requires constrained optimization over discrete chemical graph structures. Without a unified framework, assembling these components from separate libraries introduces significant integration overhead and inconsistent data handling.

Core Highlights

Molecular encoder builds graph-based representations from SMILES strings. Property predictor trains on extracted molecular graph features. Retrosynthesis planner decomposes target molecules into reaction steps. Generator designs novel candidate molecules with target properties.

How to Use It?

Basic Usage

import torch
from torchdrug import (
    datasets, models,
    tasks, core
)

dataset = datasets.ClinTox(
    'data/',
    atom_feature='default',
    bond_feature='default')

train, valid, test = (
    dataset.split())

model = models.GIN(
    input_dim=(
        dataset
        .node_feature_dim),
    hidden_dims=[
        256, 256, 256, 256],
    batch_norm=True,
    readout='mean')

task = tasks.PropertyPrediction(
    model,
    task=dataset.tasks,
    criterion='bce',
    metric=('auprc',
            'auroc'))

optimizer = torch.optim.Adam(
    task.parameters(),
    lr=1e-3)

solver = core.Engine(
    task, train, valid,
    test, optimizer,
    batch_size=256,
    gpus=[0])
solver.train(num_epoch=100)
solver.evaluate('test')

Real-World Examples

from torchdrug import (
    datasets, models,
    tasks, core
)

dataset = datasets.USPTO50k(
    'data/',
    atom_feature='center',
    as_synthon=True)

train, valid, test = (
    dataset.split())

model = models.RGCN(
    input_dim=(
        dataset
        .node_feature_dim),
    num_relation=(
        dataset
        .num_bond_type),
    hidden_dims=[
        256, 256, 256])

task = tasks.CenterIdentification(
    model,
    feature=(
        'graph',
        'atom',
        'bond'))

optimizer = torch.optim.Adam(
    task.parameters(),
    lr=1e-3)

solver = core.Engine(
    task, train, valid,
    test, optimizer,
    batch_size=128)
solver.train(num_epoch=50)

Advanced Tips

Pretrain molecular encoders on large unlabeled datasets using self-supervised contrastive learning before fine-tuning on small labeled property datasets. Use multi-task learning to predict related properties jointly, such as solubility alongside toxicity, for improved generalization across chemical space. Combine 2D graph and 3D coordinate features for conformational property prediction.

When to Use It?

Use Cases

Predict toxicity and solubility of drug candidate molecules from SMILES representations. Plan synthesis routes for target molecules using learned retrosynthesis models. Generate novel molecular structures optimized for binding affinity.

Related Topics

PyTorch, drug discovery, molecular graphs, GNN, retrosynthesis, cheminformatics, and protein structure prediction.

Important Notes

Requirements

PyTorch with TorchDrug and RDKit installed for molecular graph construction and chemical data processing. SMILES strings or molecular structure files such as SDF and MOL2 as input for building graph-based molecular representations with atom and bond features. GPU resources for training deep graph networks on large molecular datasets with many atom features and bond descriptors that require substantial memory during batch processing.

Usage Recommendations

Do: use scaffold splitting instead of random splitting to evaluate generalization to structurally novel molecules. Normalize molecular features and property targets before training. Validate predictions against experimental data before making synthesis decisions.

Don't: train on small datasets without pretraining or transfer learning since molecular property data is expensive to generate. Ignore stereochemistry in molecular representations when it affects the target property. Trust single-model predictions for critical drug design decisions without ensemble validation.

Limitations

Model accuracy depends heavily on training data quality and chemical diversity of the dataset. Generative models may produce chemically valid but synthetically inaccessible molecules, making synthetic accessibility scoring an important post-generation filter. 3D conformer-dependent properties require explicit geometry handling and conformer ensemble generation that adds significant computational cost and complexity to the entire training pipeline.