Pathml

Streamline pathology machine learning workflows with PathML automation and integration

PathML is a community skill for analyzing whole slide pathology images using the PathML library, covering slide loading, tile extraction, preprocessing pipelines, feature computation, and machine learning integration for computational pathology workflows.

What Is This?

Overview

PathML provides tools for processing digital pathology images in computational research. It covers slide loading that reads whole slide image formats including SVS, NDPI, and TIFF at multiple magnification levels, tile extraction that divides large slides into manageable patches for processing and analysis, preprocessing pipelines that apply stain normalization, tissue detection, and artifact removal, feature computation that extracts morphological and texture descriptors from tissue regions, and machine learning integration that connects processed tiles with classification and segmentation models. The skill enables pathology researchers to build reproducible, scalable computational analysis pipelines suitable for both exploratory research and production diagnostic workflows.

Who Should Use This

This skill serves computational pathology researchers analyzing tissue slides, biomedical engineers building diagnostic AI models from histology images, and pathology labs automating slide analysis workflows.

Why Use It?

Problems It Solves

Whole slide images are gigapixel-scale files that cannot be loaded into memory at once for processing. Different slide scanners produce files in incompatible formats requiring format-specific readers. Staining variability across slides and labs introduces batch effects that degrade model performance. Manual tissue region selection for analysis is time-consuming and subjective, making large cohort studies impractical without automated tooling.

Core Highlights

Slide reader loads whole slide images from multiple scanner formats. Tile extractor generates image patches at specified magnification levels. Preprocessing engine normalizes stains and detects tissue regions. Feature extractor computes morphological descriptors from processed tiles.

How to Use It?

Basic Usage

from pathml.core import (
  SlideData, Tile)
from pathml.preprocessing\
  import (
    StainNormalization,
    TissueDetection)

slide = SlideData(
  'slide.svs',
  name='sample_001')

print(
  f'Dimensions: '
  f'{slide.slide'
  f'.dimensions}')
print(
  f'Levels: '
  f'{slide.slide'
  f'.level_count}')

slide.run(
  TissueDetection(
    use_saturation=True,
    threshold=0.05))

slide.run(
  StainNormalization(
    target='macenko'))

for tile_key in list(
  slide.tiles.keys()
)[:5]:
  tile = slide.tiles[
    tile_key]
  print(
    f'Tile: {tile_key} '
    f'shape: '
    f'{tile.image.shape}')

Real-World Examples

from pathml.core import (
  SlideData)
from pathml.preprocessing\
  import (
    TissueDetection,
    StainNormalization)
import numpy as np

class PathologyPipeline:
  def __init__(
    self,
    tile_size: int = 256,
    level: int = 0
  ):
    self.tile_size = (
      tile_size)
    self.level = level

  def process_slide(
    self,
    slide_path: str
  ) -> dict:
    slide = SlideData(
      slide_path)
    slide.run(
      TissueDetection())
    slide.run(
      StainNormalization())

    features = []
    for key in\
        slide.tiles.keys():
      tile = slide.tiles[
        key]
      feat = self\
        .extract_features(
          tile.image)
      features.append(feat)

    return {
      'slide': slide_path,
      'tiles':
        len(features),
      'features':
        np.array(features)}

  def extract_features(
    self, image
  ) -> list:
    return [
      float(
        image.mean()),
      float(
        image.std()),
      float(
        np.median(image))]

Advanced Tips

Use lower magnification levels for initial tissue detection to speed up the region selection step before extracting high-resolution tiles. Apply stain normalization consistently across all slides in a study to reduce batch effects in downstream analysis. Save processed tile datasets in HDF5 format for efficient loading during model training iterations. When processing large cohorts, consider parallelizing slide-level operations across multiple workers to reduce total pipeline runtime significantly.

When to Use It?

Use Cases

Process whole slide images into normalized tile datasets for training a tissue classification model. Build an automated pipeline that detects tissue regions and extracts features from pathology slides. Prepare a multi-slide dataset with consistent preprocessing for a computational pathology study.

Related Topics

Computational pathology, whole slide images, digital pathology, stain normalization, tile extraction, histology, and biomedical imaging.

Important Notes

Requirements

PathML Python package with OpenSlide dependency for slide format support. Whole slide image files in supported scanner formats. Sufficient disk space for extracted tile datasets, particularly when working with high-resolution tiles across large slide cohorts.

Usage Recommendations

Do: inspect sample tiles visually after preprocessing to verify stain normalization and tissue detection quality. Record preprocessing parameters for all slides in a study to ensure reproducibility. Use tissue detection to skip background tiles and reduce processing time.

Don't: process slides at maximum resolution when lower magnification suffices for the analysis task since this wastes compute and storage. Apply stain normalization targets from one tissue type to different tissue types without validation. Ignore slide quality issues like tissue folds and pen marks that affect analysis results.

Limitations

Whole slide image processing requires significant compute time and disk space for large cohort studies. Stain normalization quality varies across tissue types and staining protocols. Tile-based analysis loses spatial context between neighboring tissue regions that may be diagnostically relevant.