Pytdc

Comprehensive PyTDC automation and integration for therapeutic data commons research

Source: K-Dense-AI/claude-scientific-skills

PyTDC is a community skill for accessing therapeutic data commons using the PyTDC Python library, covering drug discovery datasets, molecular property prediction, drug-target interaction data, ADMET benchmarks, and dataset splitting for machine learning in drug development.

What Is This?

Overview

PyTDC provides tools for accessing curated datasets and benchmarks for therapeutic science through a unified Python interface. It covers drug discovery datasets that provide molecular structures with activity labels for target-specific screening, molecular property prediction that supplies training data for solubility, toxicity, and permeability models, drug-target interaction data that pairs compounds with protein targets and binding affinities, ADMET benchmarks that evaluate absorption, distribution, metabolism, excretion, and toxicity prediction models, and dataset splitting that creates scaffold-aware train/test partitions for realistic evaluation. The skill enables researchers to benchmark ML models on therapeutic tasks and compare results against published leaderboard entries using consistent protocols.

Who Should Use This

This skill serves computational chemists building drug discovery ML models, researchers benchmarking molecular property prediction methods, and data scientists working on therapeutic applications with standardized datasets.

Why Use It?

Problems It Solves

Drug discovery datasets are scattered across publications and databases requiring manual collection and curation. Comparing ML models across papers is unreliable when different data splits and preprocessing are used. ADMET property datasets lack standardized benchmarks for fair method comparison. Random data splitting in molecular ML ignores scaffold bias leading to overly optimistic evaluation results, sometimes inflating reported performance by ten percentage points or more compared to scaffold-based evaluation.

Core Highlights

Dataset loader provides curated therapeutic data through a unified API. Benchmark suite standardizes evaluation across drug discovery tasks. Splitter creates scaffold-aware partitions for realistic model assessment. Task organizer groups datasets by therapeutic prediction category.

How to Use It?

Basic Usage

from tdc.single_pred\
  import ADME, Tox
from tdc.utils import (
  retrieve_label_name_list)

data = ADME(
  name='Caco2_Wang')
df = data.get_data()
print(
  f'Samples: '
  f'{len(df)}')
print(
  f'Columns: '
  f'{list(df.columns)}')

split = data.get_split(
  method='scaffold')
train = split['train']
valid = split['valid']
test = split['test']
print(
  f'Train: {len(train)}'
  f' Val: {len(valid)}'
  f' Test: {len(test)}')

names = (
  retrieve_label_name_list(
    'Caco2_Wang'))
print(f'Labels: {names}')

tox = Tox(
  name='hERG')
tox_df = tox.get_data()
print(
  f'Tox samples: '
  f'{len(tox_df)}')

Real-World Examples

from tdc.single_pred\
  import ADME
from tdc.benchmark_group\
  import admet_group

class ADMETBenchmark:
  def __init__(self):
    self.group = (
      admet_group(
        path='data/'))
    self.results = {}

  def evaluate(
    self,
    model_fn,
    model_name: str
  ) -> dict:
    predictions = {}
    for task in (
      self.group
        .dataset_names
    ):
      benchmark = (
        self.group.get(
          task))
      name, train, \
        test = benchmark
      preds = model_fn(
        train, test)
      predictions[
        name] = preds
    results = (
      self.group
        .evaluate(
          predictions))
    self.results[
      model_name
    ] = results
    return results

Advanced Tips

Use scaffold splitting consistently across all experiments since random splits inflate performance metrics by leaking structural information between train and test sets. Combine multiple TDC datasets to train multi-task models that share representations across related ADMET endpoints, which is particularly effective for endpoints with limited labeled data such as human bioavailability. Use the benchmark group API for standardized evaluation that matches published leaderboard protocols.

When to Use It?

Use Cases

Benchmark a molecular property prediction model on standardized ADMET datasets with scaffold splits for fair comparison. Access drug-target interaction data to train binding affinity prediction models. Download curated toxicity datasets for building safety screening models in drug development. Retrieve permeability and solubility data to support early-stage candidate prioritization workflows.

Important Notes

Requirements

PyTDC Python package with rdkit for molecular processing. Internet access for initial dataset downloads from the TDC servers. Sufficient disk space for caching downloaded datasets locally.

Usage Recommendations

Do: use scaffold splits rather than random splits for molecular datasets to get realistic performance estimates. Report results using the standard benchmark group metrics for comparability with published methods. Cache downloaded datasets locally to avoid repeated downloads.

Don't: train and evaluate on random splits then compare with leaderboard results that use scaffold splits. Combine datasets from different TDC tasks without checking for overlapping molecules. Use raw SMILES strings without canonicalization since different representations of the same molecule cause data leakage.

Limitations

Dataset sizes for some therapeutic tasks are small limiting deep learning model training. Benchmark datasets may not reflect the chemical diversity of real drug discovery screening libraries. Some experimental labels have measurement noise that sets an upper bound on achievable prediction accuracy.

More Skills You Might Like

Explore similar skills to enhance your workflow