Scikit Survival

Automate and integrate Scikit Survival for advanced survival analysis and modeling

Source: K-Dense-AI/claude-scientific-skills

Scikit-survival is a community skill for survival analysis using the scikit-survival Python library, covering time-to-event modeling, Kaplan-Meier estimation, Cox regression, random survival forests, and concordance evaluation for censored data analysis.

What Is This?

Overview

Scikit-survival provides tools for analyzing time-to-event data with censoring using a scikit-learn compatible interface. It covers time-to-event modeling that predicts when events like death, failure, or churn will occur from feature data, Kaplan-Meier estimation that computes non-parametric survival curves from observed event times, Cox regression that fits proportional hazards models relating features to event risk, random survival forests that trains ensemble models for survival prediction without proportionality assumptions, and concordance evaluation that measures model discrimination using the concordance index. The skill helps analysts model censored event data where observations end before the event of interest occurs.

Who Should Use This

This skill serves biostatisticians analyzing clinical trial survival data, data scientists modeling customer churn and retention, and reliability engineers predicting equipment failure times. It is also useful for actuaries estimating policyholder risk and researchers studying time-to-event outcomes in social science studies.

Why Use It?

Problems It Solves

Standard classification and regression cannot handle censored observations where the event has not yet occurred at the time of data collection. Estimating survival probabilities requires specialized methods that account for varying observation periods. Comparing treatment effects on survival needs statistical tests designed for censored time-to-event data. Evaluating prediction models for survival data requires concordance metrics rather than standard accuracy measures, which would otherwise produce misleading performance estimates.

Core Highlights

Survival estimator fits Kaplan-Meier curves from censored event data. Cox modeler relates features to hazard rates with proportional hazards. Forest predictor builds ensemble survival models without parametric assumptions. Concordance scorer evaluates discrimination on censored outcomes.

How to Use It?

Basic Usage

import numpy as np
from sksurv.linear_model\
  import CoxPHSurvivalAnalysis
from sksurv.preprocessing\
  import OneHotEncoder
from sksurv.metrics import (
  concordance_index_censored)

n = 200
rng = np.random.default_rng(
  42)
X = rng.standard_normal(
  (n, 5))
time = rng.exponential(
  10, n)
event = rng.choice(
  [True, False], n,
  p=[0.7, 0.3])
y = np.array([
  (e, t) for e, t in
    zip(event, time)],
  dtype=[('event', bool),
    ('time', float)])

cox = CoxPHSurvivalAnalysis()
cox.fit(X, y)

pred = cox.predict(X)
ci = concordance_index_censored(
  y['event'], y['time'],
  pred)
print(f'C-index: '
  f'{ci[0]:.3f}')

Real-World Examples

from sksurv.ensemble import (
  RandomSurvivalForest)
from sksurv.metrics import (
  concordance_index_censored)
from sklearn.model_selection\
  import train_test_split
import numpy as np

class SurvivalPipeline:
  def __init__(
    self,
    n_estimators:
      int = 100
  ):
    self.model = (
      RandomSurvivalForest(
        n_estimators=
          n_estimators,
        random_state=42))

  def fit_eval(
    self,
    X: np.ndarray,
    y: np.ndarray
  ) -> dict:
    X_tr, X_te, y_tr,\
      y_te = (
        train_test_split(
          X, y,
          test_size=0.2,
          random_state=42))
    self.model.fit(
      X_tr, y_tr)

    pred_tr = (
      self.model.predict(
        X_tr))
    pred_te = (
      self.model.predict(
        X_te))

    ci_tr = (
      concordance_index_censored(
        y_tr['event'],
        y_tr['time'],
        pred_tr)[0])
    ci_te = (
      concordance_index_censored(
        y_te['event'],
        y_te['time'],
        pred_te)[0])
    return {
      'train_ci': ci_tr,
      'test_ci': ci_te}

pipe = SurvivalPipeline()
results = pipe.fit_eval(
  X, y)
print(f'Train: '
  f'{results["train_ci"]:.3f}')
print(f'Test: '
  f'{results["test_ci"]:.3f}')

Advanced Tips

Use RandomSurvivalForest when the proportional hazards assumption of Cox regression may not hold for the data, for example when hazard ratios change over time. Evaluate models with time-dependent concordance when survival predictions vary across time horizons. Combine scikit-survival with scikit-learn pipelines for integrated preprocessing and survival modeling. When working with high-dimensional data, apply variance thresholding or regularized Cox regression to reduce noise before fitting ensemble models.

When to Use It?

Use Cases

Fit a Cox proportional hazards model to clinical trial data with censored outcomes. Train a random survival forest for customer churn prediction with time-to-event targets. Compare survival models using concordance index on held-out test data.

Important Notes

Requirements

Scikit-survival Python package with scikit-learn and numpy dependencies. Structured survival outcome arrays with event indicator and time fields. Feature data without missing values or with prior imputation applied.

Usage Recommendations

Do: use structured numpy arrays with named event and time fields as required by the scikit-survival API. Check the proportional hazards assumption before using Cox regression. Report the concordance index with confidence intervals for model evaluation.

Don't: treat censored observations as non-events since this biases survival estimates. Use standard classification metrics like accuracy for evaluating survival models. Ignore the proportion of censored observations since heavy censoring reduces the reliability of model evaluation.

Limitations

Scikit-survival requires specific structured array formats that differ from standard numpy and pandas data structures, which can require additional data preparation steps. Some advanced survival methods like competing risks are not included in the library. Large datasets with many features may require feature selection before fitting survival models.

More Skills You Might Like

Explore similar skills to enhance your workflow