Dspy

Automate and integrate DSPy for building and optimizing language model pipelines

Source: Orchestra-Research/AI-Research-SKILLs

DSPy is a community skill for building optimized language model programs, covering declarative signature definition, module composition, automatic prompt optimization, evaluation metric design, and compiled pipeline deployment for structured LLM applications.

What Is This?

Overview

DSPy provides patterns for programming language models as composable modules rather than writing manual prompts. It covers signature definition that declares input and output fields with type annotations for each LLM call, module composition that chains multiple LM calls with data flow between them, automatic prompt optimization that tunes instructions and few-shot examples using training data, evaluation metric design that measures pipeline quality with task-specific scoring functions, and compiled pipeline deployment that freezes optimized prompts for production use. The skill enables developers to build LLM applications that improve systematically through optimization rather than manual prompt engineering.

Who Should Use This

This skill serves AI engineers building multi-step LLM pipelines, researchers experimenting with language model program optimization, and product developers creating LLM features that need consistent quality. It is particularly valuable for teams maintaining pipelines across multiple model versions or deployment environments.

Why Use It?

Problems It Solves

Manual prompt engineering is fragile and breaks when models are updated or inputs change. Multi-step LLM pipelines are difficult to optimize because each step affects downstream quality. Evaluating LLM output quality is ad hoc without structured metrics and test datasets. Prompt changes that improve one case often degrade performance on others without systematic optimization.

Core Highlights

Signature system declares typed inputs and outputs for each language model call. Module library provides composable building blocks like ChainOfThought and ReAct for common patterns. Optimizer engine tunes prompts and examples using labeled training data, reducing the need for manual iteration. Evaluator framework measures pipeline quality with custom metrics.

How to Use It?

Basic Usage

import dspy

class QA(dspy.Signature):
  """Answer questions
  with citations."""
  context = dspy.InputField(
    desc='relevant '
      'passages')
  question =\
    dspy.InputField()
  answer = dspy.OutputField(
    desc='cited answer')

class RAGModule(
  dspy.Module
):
  def __init__(self):
    self.retrieve =\
      dspy.Retrieve(k=3)
    self.answer =\
      dspy.ChainOfThought(
        QA)

  def forward(
    self,
    question: str
  ):
    context =\
      self.retrieve(
        question)\
          .passages
    return self.answer(
      context=context,
      question=question)

Real-World Examples

from dspy.evaluate\
  import Evaluate
from dspy.teleprompt\
  import BootstrapFewShot

def answer_metric(
  example, prediction,
  trace=None
) -> float:
  gold = example.answer\
    .lower()
  pred = prediction\
    .answer.lower()
  return float(
    gold in pred)

trainset = [
  dspy.Example(
    question='What is'
      ' DSPy?',
    answer='A framework'
      ' for programming'
      ' LMs'
  ).with_inputs(
    'question')]

optimizer =\
  BootstrapFewShot(
    metric=answer_metric,
    max_bootstrapped=4)
compiled_rag =\
  optimizer.compile(
    RAGModule(),
    trainset=trainset)

evaluator = Evaluate(
  devset=testset,
  metric=answer_metric)
score = evaluator(
  compiled_rag)
print(f'Score: {score}')

Advanced Tips

Start with a small labeled dataset of ten to twenty examples for initial optimization and expand as you identify failure modes. Use the assertion mechanism to enforce output constraints like format or length requirements within DSPy modules. Save compiled programs with dspy.save to freeze optimized prompts for reproducible production deployment. When debugging unexpected outputs, inspect intermediate module predictions directly to isolate which step in the pipeline is underperforming.

When to Use It?

Use Cases

Build a retrieval-augmented generation pipeline with automatically optimized prompts and few-shot examples. Create a multi-step reasoning chain that improves quality through systematic optimization against labeled data. Evaluate and compare different LLM pipeline architectures using standardized metrics.

Important Notes

Requirements

DSPy library installed via pip. Language model API access through OpenAI, Anthropic, or local models. Labeled training examples for the target task to drive prompt optimization.

Usage Recommendations

Do: define clear evaluation metrics before starting optimization to measure improvement objectively. Use typed signatures with descriptive field annotations to give the optimizer clear constraints. Version compiled programs alongside application code for reproducibility.

Don't: skip the evaluation step and deploy optimized programs without measuring quality against a held-out test set. Over-optimize on a small training set which can cause overfitting to specific examples. Mix manual prompt edits with DSPy optimization which can create conflicts between the two approaches.

Limitations

Optimization quality depends on the size and representativeness of the training dataset. Compiled programs are tied to specific model versions and may need reoptimization when the underlying LLM is updated. Complex multi-step pipelines have large optimization search spaces that may require significant compute resources and API calls to explore effectively. The framework abstracts prompt details which can make debugging unexpected outputs more difficult.

More Skills You Might Like

Explore similar skills to enhance your workflow