Speculative Decoding

Speculative Decoding automation, integration, and inference optimization workflows

Source: Orchestra-Research/AI-Research-SKILLs

Speculative Decoding is a community skill for accelerating large language model inference using draft-then-verify techniques, covering draft model selection, token speculation, verification strategies, acceptance sampling, and latency optimization for faster text generation.

What Is This?

Overview

Speculative Decoding provides guidance on implementing inference acceleration techniques that use a smaller draft model to propose candidate tokens verified by the target model in parallel. It covers draft model selection that chooses efficient models matching the target vocabulary and distribution, token speculation that generates multiple candidate continuations in a single forward pass, verification strategies that batch-validate speculated tokens against the target model output distribution, acceptance sampling that decides which draft tokens to keep based on probability ratios, and latency optimization that tunes speculation length and batch sizes for throughput. The skill helps engineers reduce inference latency without sacrificing output quality.

Who Should Use This

This skill serves ML engineers optimizing LLM serving latency, inference platform teams building low-latency APIs, and researchers exploring efficient decoding methods for large language models.

Why Use It?

Problems It Solves

Autoregressive decoding generates one token at a time, creating high latency for long sequences. Large models have slow per-token generation due to memory bandwidth bottlenecks. Naive batching increases throughput but does not reduce individual request latency. Users experience noticeable delays when interacting with large models in real-time applications.

Core Highlights

Draft selector matches small models to target model distributions. Token speculator generates candidate continuations efficiently. Verification engine validates speculated tokens in parallel. Acceptance sampler preserves output quality while accepting valid draft tokens.

How to Use It?

Basic Usage

import torch
from transformers import (
  AutoModelForCausalLM,
  AutoTokenizer)

class SpecDecoder:
  def __init__(
    self,
    target_name: str,
    draft_name: str,
    k: int = 5
  ):
    self.target = (
      AutoModelForCausalLM
      .from_pretrained(
        target_name))
    self.draft = (
      AutoModelForCausalLM
      .from_pretrained(
        draft_name))
    self.tokenizer = (
      AutoTokenizer
      .from_pretrained(
        target_name))
    self.k = k

  def speculate(
    self, input_ids
  ):
    draft_ids = input_ids
    tokens = []
    for _ in range(self.k):
      with torch.no_grad():
        out = self.draft(
          draft_ids)
      next_id = out.logits[
        :, -1].argmax(-1,
        keepdim=True)
      tokens.append(
        next_id)
      draft_ids = (
        torch.cat([
          draft_ids,
          next_id],
          dim=-1))
    return torch.cat(
      tokens, dim=-1)

  def verify(
    self, input_ids,
    draft_tokens
  ):
    combined = torch.cat(
      [input_ids,
       draft_tokens],
      dim=-1)
    with torch.no_grad():
      out = self.target(
        combined)
    return out.logits

dec = SpecDecoder(
  'meta-llama/Llama-7b',
  'TinyLlama/1.1B',
  k=5)
ids = dec.tokenizer(
  'Explain AI',
  return_tensors='pt'
)['input_ids']
draft = dec.speculate(ids)
logits = dec.verify(
  ids, draft)
print(f'Speculated: '
  f'{draft.shape[-1]} '
  f'tokens')

Real-World Examples

import torch
import torch.nn.functional\
  as Fn

class AcceptanceSampler:
  def __init__(
    self, temperature=1.0
  ):
    self.temp = temperature

  def compute_probs(
    self, logits
  ):
    return Fn.softmax(
      logits / self.temp,
      dim=-1)

  def accept_tokens(
    self,
    draft_logits,
    target_logits,
    draft_tokens
  ):
    n = draft_tokens.size(-1)
    accepted = []
    for i in range(n):
      p_draft = (
        self.compute_probs(
          draft_logits[
            :, i]))
      p_target = (
        self.compute_probs(
          target_logits[
            :, i]))
      token = (
        draft_tokens[:, i])
      ratio = (
        p_target[0, token]
        / p_draft[
            0, token]
        .clamp(min=1e-8))
      if torch.rand(1) < (
        ratio.clamp(max=1.0)
      ):
        accepted.append(
          token)
      else:
        break
    return accepted

  def stats(
    self,
    proposed: int,
    accepted: int
  ) -> dict:
    return {
      'proposed': proposed,
      'accepted': accepted,
      'rate': accepted /
        max(proposed, 1)}

sampler = AcceptanceSampler(
  temperature=0.8)
result = sampler.stats(
  proposed=5, accepted=3)
print(
  f'Acceptance rate: '
  f'{result["rate"]:.0%}')

Advanced Tips

Choose draft models that closely match the target distribution to maximize acceptance rates. Tune speculation length based on task complexity since creative tasks accept fewer tokens than factual completions. Use tree-based speculation to explore multiple candidate paths simultaneously for higher acceptance.

When to Use It?

Use Cases

Reduce latency for a chatbot serving a large language model by speculating with a smaller draft model. Accelerate code completion where high acceptance rates make speculation effective. Speed up batch translation by verifying multiple speculated tokens per target model forward pass.

Important Notes

Requirements

A target language model and a compatible smaller draft model sharing the same tokenizer vocabulary. GPU with sufficient VRAM to load both models simultaneously for inference. PyTorch or equivalent framework supporting efficient batched forward passes.

Usage Recommendations

Do: measure acceptance rates across different speculation lengths to find the optimal value for each use case. Benchmark end-to-end latency including both draft and verification passes. Use KV cache sharing between draft and target models when architectures allow.

Don't: assume fixed speculation lengths work across all prompts since optimal length varies by task. Use draft models with different tokenizers since vocabulary mismatch prevents valid speculation. Ignore memory overhead of running two models on the same device.

Limitations

Acceptance rates drop for creative text where draft and target distributions diverge. Running two models requires more GPU memory than single-model inference. Speedup depends on draft model quality and may not improve latency if acceptance rates are consistently low.

More Skills You Might Like

Explore similar skills to enhance your workflow