Pytdc
Comprehensive PyTDC automation and integration for therapeutic data commons research
PyTDC is a community skill for accessing therapeutic data commons using the PyTDC Python library, covering drug discovery datasets, molecular property prediction, drug-target interaction data, ADMET benchmarks, and dataset splitting for machine learning in drug development.
What Is This?
Overview
PyTDC provides tools for accessing curated datasets and benchmarks for therapeutic science through a unified Python interface. It covers drug discovery datasets that provide molecular structures with activity labels for target-specific screening, molecular property prediction that supplies training data for solubility, toxicity, and permeability models, drug-target interaction data that pairs compounds with protein targets and binding affinities, ADMET benchmarks that evaluate absorption, distribution, metabolism, excretion, and toxicity prediction models, and dataset splitting that creates scaffold-aware train/test partitions for realistic evaluation. The skill enables researchers to benchmark ML models on therapeutic tasks and compare results against published leaderboard entries using consistent protocols.
Who Should Use This
This skill serves computational chemists building drug discovery ML models, researchers benchmarking molecular property prediction methods, and data scientists working on therapeutic applications with standardized datasets.
Why Use It?
Problems It Solves
Drug discovery datasets are scattered across publications and databases requiring manual collection and curation. Comparing ML models across papers is unreliable when different data splits and preprocessing are used. ADMET property datasets lack standardized benchmarks for fair method comparison. Random data splitting in molecular ML ignores scaffold bias leading to overly optimistic evaluation results, sometimes inflating reported performance by ten percentage points or more compared to scaffold-based evaluation.
Core Highlights
Dataset loader provides curated therapeutic data through a unified API. Benchmark suite standardizes evaluation across drug discovery tasks. Splitter creates scaffold-aware partitions for realistic model assessment. Task organizer groups datasets by therapeutic prediction category.
How to Use It?
Basic Usage
from tdc.single_pred\
import ADME, Tox
from tdc.utils import (
retrieve_label_name_list)
data = ADME(
name='Caco2_Wang')
df = data.get_data()
print(
f'Samples: '
f'{len(df)}')
print(
f'Columns: '
f'{list(df.columns)}')
split = data.get_split(
method='scaffold')
train = split['train']
valid = split['valid']
test = split['test']
print(
f'Train: {len(train)}'
f' Val: {len(valid)}'
f' Test: {len(test)}')
names = (
retrieve_label_name_list(
'Caco2_Wang'))
print(f'Labels: {names}')
tox = Tox(
name='hERG')
tox_df = tox.get_data()
print(
f'Tox samples: '
f'{len(tox_df)}')Real-World Examples
from tdc.single_pred\
import ADME
from tdc.benchmark_group\
import admet_group
class ADMETBenchmark:
def __init__(self):
self.group = (
admet_group(
path='data/'))
self.results = {}
def evaluate(
self,
model_fn,
model_name: str
) -> dict:
predictions = {}
for task in (
self.group
.dataset_names
):
benchmark = (
self.group.get(
task))
name, train, \
test = benchmark
preds = model_fn(
train, test)
predictions[
name] = preds
results = (
self.group
.evaluate(
predictions))
self.results[
model_name
] = results
return resultsAdvanced Tips
Use scaffold splitting consistently across all experiments since random splits inflate performance metrics by leaking structural information between train and test sets. Combine multiple TDC datasets to train multi-task models that share representations across related ADMET endpoints, which is particularly effective for endpoints with limited labeled data such as human bioavailability. Use the benchmark group API for standardized evaluation that matches published leaderboard protocols.
When to Use It?
Use Cases
Benchmark a molecular property prediction model on standardized ADMET datasets with scaffold splits for fair comparison. Access drug-target interaction data to train binding affinity prediction models. Download curated toxicity datasets for building safety screening models in drug development. Retrieve permeability and solubility data to support early-stage candidate prioritization workflows.
Related Topics
PyTDC, drug discovery, molecular ML, ADMET, therapeutic data, benchmark datasets, and cheminformatics.
Important Notes
Requirements
PyTDC Python package with rdkit for molecular processing. Internet access for initial dataset downloads from the TDC servers. Sufficient disk space for caching downloaded datasets locally.
Usage Recommendations
Do: use scaffold splits rather than random splits for molecular datasets to get realistic performance estimates. Report results using the standard benchmark group metrics for comparability with published methods. Cache downloaded datasets locally to avoid repeated downloads.
Don't: train and evaluate on random splits then compare with leaderboard results that use scaffold splits. Combine datasets from different TDC tasks without checking for overlapping molecules. Use raw SMILES strings without canonicalization since different representations of the same molecule cause data leakage.
Limitations
Dataset sizes for some therapeutic tasks are small limiting deep learning model training. Benchmark datasets may not reflect the chemical diversity of real drug discovery screening libraries. Some experimental labels have measurement noise that sets an upper bound on achievable prediction accuracy.
More Skills You Might Like
Explore similar skills to enhance your workflow
Pptx Posters
Automated PPTX posters creation and integration for professional visual presentations
Threejs Loaders
Efficiently load and manage Three.js assets with automation and integration tools
Bonsai Automation
Automate Bonsai operations through Composio's Bonsai toolkit via Rube MCP
Census Bureau Automation
Automate Census Bureau tasks via Rube MCP (Composio)
Landing Page Copywriter
Landing Page Copywriter automation and integration
Chatgpt Apps
Automate and integrate ChatGPT Apps into your existing workflows