Skypilot

Automate and integrate SkyPilot for running AI and cloud workloads across providers

SkyPilot is a community skill for running workloads across cloud providers using the SkyPilot framework, covering multi-cloud job submission, spot instance management, cost optimization, cluster autoscaling, and reproducible environment setup for ML training and batch processing.

What Is This?

Overview

SkyPilot provides tools for launching and managing compute workloads across multiple cloud providers. It covers multi-cloud job submission that runs tasks on AWS, GCP, Azure, or Lambda Cloud with a unified interface, spot instance management that uses preemptible instances with automatic recovery from interruptions, cost optimization that selects the cheapest available cloud and instance type for each job, cluster autoscaling that adjusts compute resources based on workload demand, and reproducible environments that define dependencies and setup in declarative YAML configs. The skill helps teams run cloud workloads efficiently without managing provider-specific tooling for each platform.

Who Should Use This

This skill serves ML engineers running training jobs across cloud providers, research teams needing flexible GPU access, and platform engineers building cost-efficient compute infrastructure. It is particularly valuable for teams that frequently switch between providers based on availability or pricing.

Why Use It?

Problems It Solves

Cloud provider lock-in prevents teams from using the cheapest available resources. Managing spot instance interruptions and recovery manually is complex and error-prone. Comparing costs across cloud providers and instance types requires constant monitoring. Reproducing compute environments across different clouds demands provider-specific configuration, increasing maintenance overhead significantly.

Core Highlights

Multi-cloud launcher runs jobs on any supported provider. Spot manager handles preemptible instances with automatic recovery. Cost optimizer selects the cheapest cloud and instance type. Environment builder creates reproducible setups from YAML configs.

How to Use It?

Basic Usage

name: train-model

resources:
  accelerators: A100:1
  use_spot: true
  cloud: aws

setup: |
  pip install torch
  pip install transformers
  pip install datasets

run: |
  python train.py \\
    --model bert-base \\
    --epochs 10 \\
    --output /output

file_mounts:
  /data:
    source: ./data/
  /output:
    name: my-bucket
    store: s3
    mode: MOUNT

Real-World Examples

import subprocess
import json

class SkyManager:
  def launch(
    self,
    config: str,
    cluster: str = None
  ) -> dict:
    cmd = ['sky', 'launch',
      config, '-y']
    if cluster:
      cmd.extend(
        ['-c', cluster])
    result = subprocess.run(
      cmd,
      capture_output=True,
      text=True)
    return {
      'success': result
        .returncode == 0,
      'output':
        result.stdout}

  def status(self) -> list:
    result = subprocess.run(
      ['sky', 'status',
       '--format', 'json'],
      capture_output=True,
      text=True)
    return json.loads(
      result.stdout)

  def cost_report(
    self
  ) -> dict:
    clusters = self.status()
    total = sum(
      c.get('cost', 0)
      for c in clusters)
    return {
      'clusters':
        len(clusters),
      'total_cost':
        total,
      'details': [
        {'name':
           c['name'],
         'cost':
           c.get(
             'cost', 0)}
        for c in
        clusters]}

  def down(
    self, cluster: str
  ) -> bool:
    result = subprocess.run(
      ['sky', 'down',
       cluster, '-y'],
      capture_output=True)
    return (
      result.returncode
      == 0)

mgr = SkyManager()
mgr.launch(
  'train.yaml',
  'my-cluster')
report = mgr.cost_report()
print(
  f'Total: $'
  f'{report["total_cost"]'
  f':.2f}')

Advanced Tips

Use spot instances with SkyPilot's automatic recovery to reduce training costs by up to 70 percent. Set cloud-agnostic resource requirements and let SkyPilot select the cheapest provider automatically. Mount cloud storage buckets as local paths to simplify data access across providers. Enable checkpointing in your training scripts so that spot interruptions resume from the last saved state rather than restarting from scratch.

When to Use It?

Use Cases

Launch a GPU training job on the cheapest available cloud provider with automatic spot recovery. Run a batch processing pipeline across multiple clouds to avoid capacity constraints. Set up reproducible training environments that work identically on AWS, GCP, or Azure.

Related Topics

Cloud computing, multi-cloud, spot instances, ML training, cost optimization, GPU computing, and infrastructure automation.

Important Notes

Requirements

SkyPilot CLI installed with cloud provider credentials configured for each target platform. Active accounts on target cloud providers with sufficient compute and GPU quotas approved. YAML task configuration files defining resource requirements, setup commands, and run instructions.

Usage Recommendations

Do: use spot instances for fault-tolerant workloads to minimize costs. Define resource requirements generically to enable cross-cloud optimization. Tear down clusters after job completion to avoid idle compute charges.

Don't: run long-running services on spot instances without checkpointing since interruptions cause progress loss. Ignore cloud provider quotas since jobs will fail if instance limits are exceeded. Leave clusters running after experiments since idle GPUs incur significant hourly costs.

Limitations

Cross-cloud portability is limited to features supported by all target providers. Spot instance availability varies by region and instance type with no availability guarantees. SkyPilot adds an abstraction layer that may not expose all provider-specific configuration options, which can be a constraint for workloads requiring fine-grained networking or storage settings.