Instructor

Extract structured data from LLMs using automated Instructor workflows and Pydantic integrations

Source: Orchestra-Research/AI-Research-SKILLs

Instructor is a community skill for extracting structured data from language model outputs using the Instructor library, covering Pydantic model definition, response validation, retry logic, streaming extraction, and multi-field parsing for reliable LLM data pipelines.

What Is This?

Overview

Instructor provides tools for getting structured, validated outputs from language model API calls using Pydantic models as response schemas. It covers Pydantic model definition that specifies expected output structure with typed fields, validators, and descriptions, response validation that checks LLM outputs against the schema and returns typed Python objects, retry logic that re-prompts the model when outputs fail validation with the error context included, streaming extraction that processes partial structured outputs as they generate for responsive applications, and multi-field parsing that extracts multiple related data points from a single model call. The skill enables developers to build reliable data extraction pipelines with type-safe LLM outputs.

Who Should Use This

This skill serves AI application developers building data extraction pipelines, backend engineers integrating LLM outputs into typed systems, and data teams automating information extraction from unstructured text.

Why Use It?

Problems It Solves

Raw LLM outputs require fragile string parsing to extract structured data which breaks when the model varies its response format. JSON mode in APIs does not guarantee the output matches a specific schema with required fields and correct types. Failed extractions without retry logic cause pipeline failures that require manual intervention. Type information is lost when LLM outputs are handled as plain strings rather than validated objects.

Core Highlights

Schema definer uses Pydantic models to specify expected output structure with field types and validators. Response parser validates LLM output against the schema returning typed objects. Retry handler re-prompts with validation errors when output fails schema checks. Stream parser yields partial validated objects during generation.

How to Use It?

Basic Usage

import instructor
from pydantic import (
  BaseModel, Field)
from openai import OpenAI

class ContactInfo(
    BaseModel):
  name: str = Field(
    description='Full name')
  email: str = Field(
    description=(
      'Email address'))
  phone: str | None = (
    Field(
      default=None,
      description=(
        'Phone number')))
  company: str | None = (
    Field(
      default=None,
      description=(
        'Company name')))

client = instructor.patch(
  OpenAI())

def extract_contact(
  text: str
) -> ContactInfo:
  return client.chat\
    .completions.create(
      model='gpt-4o',
      response_model=(
        ContactInfo),
      messages=[{
        'role': 'user',
        'content': (
          f'Extract contact'
          f' info: {text}')
      }])

Real-World Examples

from pydantic import (
  BaseModel, Field,
  field_validator)

class Product(BaseModel):
  name: str
  price: float = Field(
    gt=0)
  currency: str = Field(
    pattern=r'^[A-Z]{3}$')
  category: str

  @field_validator(
      'currency')
  @classmethod
  def valid_currency(
    cls, v: str
  ) -> str:
    allowed = {
      'USD', 'EUR',
      'GBP', 'JPY'}
    if v not in allowed:
      raise ValueError(
        f'{v} not in'
        f' {allowed}')
    return v

class Extraction:
  def __init__(
    self,
    client,
    max_retries: int = 3
  ):
    self.client = client
    self.retries = (
      max_retries)

  def extract_products(
    self,
    texts: list[str]
  ) -> list[Product]:
    results = []
    for text in texts:
      product = (
        self.client.chat
          .completions
          .create(
            model='gpt-4o',
            response_model=(
              Product),
            max_retries=(
              self.retries),
            messages=[{
              'role': 'user',
              'content': (
                f'Extract:'
                f' {text}')
            }]))
      results.append(
        product)
    return results

Advanced Tips

Use Pydantic field descriptions as implicit prompt engineering since the model receives field names and descriptions as schema context. Add custom validators that encode business rules directly in the response model for domain-specific extraction accuracy. Use Instructor with Anthropic or other providers by patching the appropriate client class.

When to Use It?

Use Cases

Extract structured contact information from unstructured email text with validated fields. Build a product data extraction pipeline that outputs Pydantic objects with price and currency validation. Parse meeting notes into structured action items with assignees and deadlines.

Important Notes

Requirements

Instructor library installed with the target LLM provider SDK. Pydantic v2 for model definition and validation. API access to a language model service.

Usage Recommendations

Do: write descriptive field names and Pydantic descriptions that guide the model toward correct extraction. Set appropriate max_retries for production use to handle occasional validation failures. Use optional fields with defaults for information that may not be present in all inputs.

Don't: create deeply nested Pydantic models that exceed the model ability to produce valid complex JSON in one pass. Rely on retries alone without improving the prompt when extraction consistently fails. Skip field validators that catch domain-specific errors the model may produce.

Limitations

Extraction accuracy depends on the underlying language model capability and may vary across providers. Retry logic increases latency and cost when validation failures are frequent. Complex nested schemas with many required fields have higher failure rates that compound across extraction pipeline steps.

More Skills You Might Like

Explore similar skills to enhance your workflow