Speculative Decoding
Speculative Decoding automation, integration, and inference optimization workflows
Speculative Decoding is a community skill for accelerating large language model inference using draft-then-verify techniques, covering draft model selection, token speculation, verification strategies, acceptance sampling, and latency optimization for faster text generation.
What Is This?
Overview
Speculative Decoding provides guidance on implementing inference acceleration techniques that use a smaller draft model to propose candidate tokens verified by the target model in parallel. It covers draft model selection that chooses efficient models matching the target vocabulary and distribution, token speculation that generates multiple candidate continuations in a single forward pass, verification strategies that batch-validate speculated tokens against the target model output distribution, acceptance sampling that decides which draft tokens to keep based on probability ratios, and latency optimization that tunes speculation length and batch sizes for throughput. The skill helps engineers reduce inference latency without sacrificing output quality.
Who Should Use This
This skill serves ML engineers optimizing LLM serving latency, inference platform teams building low-latency APIs, and researchers exploring efficient decoding methods for large language models.
Why Use It?
Problems It Solves
Autoregressive decoding generates one token at a time, creating high latency for long sequences. Large models have slow per-token generation due to memory bandwidth bottlenecks. Naive batching increases throughput but does not reduce individual request latency. Users experience noticeable delays when interacting with large models in real-time applications.
Core Highlights
Draft selector matches small models to target model distributions. Token speculator generates candidate continuations efficiently. Verification engine validates speculated tokens in parallel. Acceptance sampler preserves output quality while accepting valid draft tokens.
How to Use It?
Basic Usage
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer)
class SpecDecoder:
def __init__(
self,
target_name: str,
draft_name: str,
k: int = 5
):
self.target = (
AutoModelForCausalLM
.from_pretrained(
target_name))
self.draft = (
AutoModelForCausalLM
.from_pretrained(
draft_name))
self.tokenizer = (
AutoTokenizer
.from_pretrained(
target_name))
self.k = k
def speculate(
self, input_ids
):
draft_ids = input_ids
tokens = []
for _ in range(self.k):
with torch.no_grad():
out = self.draft(
draft_ids)
next_id = out.logits[
:, -1].argmax(-1,
keepdim=True)
tokens.append(
next_id)
draft_ids = (
torch.cat([
draft_ids,
next_id],
dim=-1))
return torch.cat(
tokens, dim=-1)
def verify(
self, input_ids,
draft_tokens
):
combined = torch.cat(
[input_ids,
draft_tokens],
dim=-1)
with torch.no_grad():
out = self.target(
combined)
return out.logits
dec = SpecDecoder(
'meta-llama/Llama-7b',
'TinyLlama/1.1B',
k=5)
ids = dec.tokenizer(
'Explain AI',
return_tensors='pt'
)['input_ids']
draft = dec.speculate(ids)
logits = dec.verify(
ids, draft)
print(f'Speculated: '
f'{draft.shape[-1]} '
f'tokens')Real-World Examples
import torch
import torch.nn.functional\
as Fn
class AcceptanceSampler:
def __init__(
self, temperature=1.0
):
self.temp = temperature
def compute_probs(
self, logits
):
return Fn.softmax(
logits / self.temp,
dim=-1)
def accept_tokens(
self,
draft_logits,
target_logits,
draft_tokens
):
n = draft_tokens.size(-1)
accepted = []
for i in range(n):
p_draft = (
self.compute_probs(
draft_logits[
:, i]))
p_target = (
self.compute_probs(
target_logits[
:, i]))
token = (
draft_tokens[:, i])
ratio = (
p_target[0, token]
/ p_draft[
0, token]
.clamp(min=1e-8))
if torch.rand(1) < (
ratio.clamp(max=1.0)
):
accepted.append(
token)
else:
break
return accepted
def stats(
self,
proposed: int,
accepted: int
) -> dict:
return {
'proposed': proposed,
'accepted': accepted,
'rate': accepted /
max(proposed, 1)}
sampler = AcceptanceSampler(
temperature=0.8)
result = sampler.stats(
proposed=5, accepted=3)
print(
f'Acceptance rate: '
f'{result["rate"]:.0%}')Advanced Tips
Choose draft models that closely match the target distribution to maximize acceptance rates. Tune speculation length based on task complexity since creative tasks accept fewer tokens than factual completions. Use tree-based speculation to explore multiple candidate paths simultaneously for higher acceptance.
When to Use It?
Use Cases
Reduce latency for a chatbot serving a large language model by speculating with a smaller draft model. Accelerate code completion where high acceptance rates make speculation effective. Speed up batch translation by verifying multiple speculated tokens per target model forward pass.
Related Topics
Language model inference, token generation, draft models, acceptance sampling, latency optimization, and model serving.
Important Notes
Requirements
A target language model and a compatible smaller draft model sharing the same tokenizer vocabulary. GPU with sufficient VRAM to load both models simultaneously for inference. PyTorch or equivalent framework supporting efficient batched forward passes.
Usage Recommendations
Do: measure acceptance rates across different speculation lengths to find the optimal value for each use case. Benchmark end-to-end latency including both draft and verification passes. Use KV cache sharing between draft and target models when architectures allow.
Don't: assume fixed speculation lengths work across all prompts since optimal length varies by task. Use draft models with different tokenizers since vocabulary mismatch prevents valid speculation. Ignore memory overhead of running two models on the same device.
Limitations
Acceptance rates drop for creative text where draft and target distributions diverge. Running two models requires more GPU memory than single-model inference. Speedup depends on draft model quality and may not improve latency if acceptance rates are consistently low.
More Skills You Might Like
Explore similar skills to enhance your workflow
Architecture Diagram Creator
Create comprehensive HTML architecture diagrams showing data flows, business objectives, features, technical architecture, and deployment. Use when us
React Doctor
Diagnose, debug, and fix issues in React applications to improve code quality and performance
Conducting Malware Incident Response
Responds to malware infections across enterprise endpoints by identifying the malware family, determining infection
Configuring Microsegmentation for Zero Trust
Configure microsegmentation policies to enforce least-privilege workload-to-workload access using tools like
Hono Api Scaffolder
Scaffold Hono API routes for Cloudflare Workers. Produces route files, middleware, typed bindings, Zod validation, error handling, and API_ENDPOINTS.m
Building Incident Response Playbooks
Designs and documents structured incident response playbooks that define step-by-step procedures for specific