Instructor
Extract structured data from LLMs using automated Instructor workflows and Pydantic integrations
Instructor is a community skill for extracting structured data from language model outputs using the Instructor library, covering Pydantic model definition, response validation, retry logic, streaming extraction, and multi-field parsing for reliable LLM data pipelines.
What Is This?
Overview
Instructor provides tools for getting structured, validated outputs from language model API calls using Pydantic models as response schemas. It covers Pydantic model definition that specifies expected output structure with typed fields, validators, and descriptions, response validation that checks LLM outputs against the schema and returns typed Python objects, retry logic that re-prompts the model when outputs fail validation with the error context included, streaming extraction that processes partial structured outputs as they generate for responsive applications, and multi-field parsing that extracts multiple related data points from a single model call. The skill enables developers to build reliable data extraction pipelines with type-safe LLM outputs.
Who Should Use This
This skill serves AI application developers building data extraction pipelines, backend engineers integrating LLM outputs into typed systems, and data teams automating information extraction from unstructured text.
Why Use It?
Problems It Solves
Raw LLM outputs require fragile string parsing to extract structured data which breaks when the model varies its response format. JSON mode in APIs does not guarantee the output matches a specific schema with required fields and correct types. Failed extractions without retry logic cause pipeline failures that require manual intervention. Type information is lost when LLM outputs are handled as plain strings rather than validated objects.
Core Highlights
Schema definer uses Pydantic models to specify expected output structure with field types and validators. Response parser validates LLM output against the schema returning typed objects. Retry handler re-prompts with validation errors when output fails schema checks. Stream parser yields partial validated objects during generation.
How to Use It?
Basic Usage
import instructor
from pydantic import (
BaseModel, Field)
from openai import OpenAI
class ContactInfo(
BaseModel):
name: str = Field(
description='Full name')
email: str = Field(
description=(
'Email address'))
phone: str | None = (
Field(
default=None,
description=(
'Phone number')))
company: str | None = (
Field(
default=None,
description=(
'Company name')))
client = instructor.patch(
OpenAI())
def extract_contact(
text: str
) -> ContactInfo:
return client.chat\
.completions.create(
model='gpt-4o',
response_model=(
ContactInfo),
messages=[{
'role': 'user',
'content': (
f'Extract contact'
f' info: {text}')
}])Real-World Examples
from pydantic import (
BaseModel, Field,
field_validator)
class Product(BaseModel):
name: str
price: float = Field(
gt=0)
currency: str = Field(
pattern=r'^[A-Z]{3}$')
category: str
@field_validator(
'currency')
@classmethod
def valid_currency(
cls, v: str
) -> str:
allowed = {
'USD', 'EUR',
'GBP', 'JPY'}
if v not in allowed:
raise ValueError(
f'{v} not in'
f' {allowed}')
return v
class Extraction:
def __init__(
self,
client,
max_retries: int = 3
):
self.client = client
self.retries = (
max_retries)
def extract_products(
self,
texts: list[str]
) -> list[Product]:
results = []
for text in texts:
product = (
self.client.chat
.completions
.create(
model='gpt-4o',
response_model=(
Product),
max_retries=(
self.retries),
messages=[{
'role': 'user',
'content': (
f'Extract:'
f' {text}')
}]))
results.append(
product)
return resultsAdvanced Tips
Use Pydantic field descriptions as implicit prompt engineering since the model receives field names and descriptions as schema context. Add custom validators that encode business rules directly in the response model for domain-specific extraction accuracy. Use Instructor with Anthropic or other providers by patching the appropriate client class.
When to Use It?
Use Cases
Extract structured contact information from unstructured email text with validated fields. Build a product data extraction pipeline that outputs Pydantic objects with price and currency validation. Parse meeting notes into structured action items with assignees and deadlines.
Related Topics
Structured output, Pydantic, data extraction, LLM pipelines, response validation, type safety, and prompt engineering.
Important Notes
Requirements
Instructor library installed with the target LLM provider SDK. Pydantic v2 for model definition and validation. API access to a language model service.
Usage Recommendations
Do: write descriptive field names and Pydantic descriptions that guide the model toward correct extraction. Set appropriate max_retries for production use to handle occasional validation failures. Use optional fields with defaults for information that may not be present in all inputs.
Don't: create deeply nested Pydantic models that exceed the model ability to produce valid complex JSON in one pass. Rely on retries alone without improving the prompt when extraction consistently fails. Skip field validators that catch domain-specific errors the model may produce.
Limitations
Extraction accuracy depends on the underlying language model capability and may vary across providers. Retry logic increases latency and cost when validation failures are frequent. Complex nested schemas with many required fields have higher failure rates that compound across extraction pipeline steps.
More Skills You Might Like
Explore similar skills to enhance your workflow
Factorial Automation
Automate Factorial operations through Composio's Factorial toolkit via
Gws Gmail
Send, read, search, and manage Gmail messages and threads via CLI
File Organizer
Intelligently organizes your files and folders across your computer by understanding context, finding duplicates, suggesting better structures, and au
Obsidian Bases
Automate and integrate Obsidian Bases for streamlined knowledge management workflows
Audiocraft
Automate and integrate Audiocraft audio generation into your projects
Whatsapp Automation
Automate WhatsApp Business tasks via Rube MCP (Composio): send messages, manage templates, upload media, and handle contacts. Always search tools firs