Langsmith
Automate and integrate LangSmith observability and testing into your LLM pipelines
Langsmith is a community skill for monitoring and debugging LLM application pipelines using LangSmith, covering trace logging, prompt versioning, evaluation runs, dataset management, and performance monitoring for production LLM systems.
What Is This?
Overview
Langsmith provides tools for observability and evaluation of language model applications through the LangSmith platform. It covers trace logging that captures complete execution traces of LLM chains including prompts, completions, latency, and token usage, prompt versioning that manages prompt template iterations with comparison across versions, evaluation runs that execute test suites against LLM outputs with configurable scoring criteria, dataset management that creates and maintains evaluation datasets with input-output pairs and ground truth labels, and performance monitoring that tracks latency, cost, and quality metrics across production traffic. The skill enables teams to build reliable LLM applications with systematic observability.
Who Should Use This
This skill serves AI engineers debugging LLM pipeline behavior, MLOps teams monitoring LLM application performance, and prompt engineers iterating on prompt designs with evaluation data.
Why Use It?
Problems It Solves
LLM application debugging is difficult without trace visibility into each individual step of a multi-component chain or agent workflow. Prompt changes lack systematic evaluation against representative test cases making it unclear whether modifications improve or degrade output quality. Production LLM quality degrades silently without monitoring that detects drift in output characteristics over time. Evaluation datasets are managed informally rather than through versioned and curated test suites.
Core Highlights
Trace logger captures full execution traces with prompt, completion, and metadata at each chain step. Prompt manager versions templates with side-by-side comparison tooling across iterations. Evaluator runs test datasets through pipelines with automated scoring against configurable criteria. Monitor tracks latency, token usage, error rates, and quality metrics in production.
How to Use It?
Basic Usage
import os
from langsmith import Client
from langsmith.wrappers\
import wrap_openai
from openai import OpenAI
os.environ[
'LANGCHAIN_TRACING_V2'
] = 'true'
os.environ[
'LANGCHAIN_API_KEY'
] = 'ls-...'
client = Client()
openai_client = (
wrap_openai(OpenAI()))
def traced_call(
prompt: str,
model: str
= 'gpt-4o'
) -> str:
response = (
openai_client.chat
.completions
.create(
model=model,
messages=[{
'role': 'user',
'content':
prompt}]))
return response\
.choices[0]\
.message.contentReal-World Examples
from langsmith import Client
from langsmith.evaluation\
import evaluate
client = Client()
def create_dataset(
name: str,
examples: list[dict]
):
dataset = (
client.create_dataset(
dataset_name=name))
for ex in examples:
client\
.create_example(
inputs=(
ex['input']),
outputs=(
ex['output']),
dataset_id=(
dataset.id))
return dataset
def accuracy_scorer(
run, example
) -> dict:
predicted = (
run.outputs.get(
'result', ''))
expected = (
example.outputs.get(
'result', ''))
score = 1.0 if (
predicted.strip()
== expected.strip()
) else 0.0
return {
'key': 'accuracy',
'score': score}
results = evaluate(
traced_call,
data='my-dataset',
evaluators=[
accuracy_scorer])Advanced Tips
Use LangSmith annotations to label production traces with quality ratings building an ongoing evaluation dataset from real traffic. Set up monitoring rules that alert when output latency or quality scores deviate from baselines. Compare evaluation results across prompt versions to quantify the impact of each change before promoting to production.
When to Use It?
Use Cases
Debug a multi-step LLM chain by inspecting traces at each component step. Run prompt regression tests against a curated evaluation dataset before deploying changes. Monitor production LLM application quality with latency and cost tracking.
Related Topics
LLM observability, LangSmith, prompt evaluation, trace logging, LLM monitoring, dataset management, and AI application debugging.
Important Notes
Requirements
LangSmith account with API key and project workspace configured. LangSmith Python SDK installed. Environment variables configured for tracing enablement in the target runtime.
Usage Recommendations
Do: enable tracing in development and production environments for complete visibility. Build evaluation datasets from real production examples to test against realistic inputs. Version prompts alongside evaluation results for traceability.
Don't: log sensitive user data in traces without appropriate data handling policies. Deploy prompt changes without running evaluation suites to verify quality. Rely solely on automated metrics without periodic human review of trace quality.
Limitations
Trace logging adds latency overhead from network calls to the LangSmith service. Evaluation scoring requires defining quality criteria which can be subjective for open-ended generation tasks. Free tier usage limits may constrain high-volume production monitoring. Self-hosted deployment is not available so all trace data is stored on LangSmith servers.
More Skills You Might Like
Explore similar skills to enhance your workflow
Endorsal Automation
Automate Endorsal operations through Composio's Endorsal toolkit via
Brilliant Directories Automation
Automate Brilliant Directories tasks via Rube MCP server integration
Text Optimizer
Enhance and optimize text content using intelligent automation and integration tools
Continuous Learning V2
Enhanced automation and integration for continuous learning workflows and development pipelines
Email Template Builder
Automate and integrate Email Template Builder to create stunning email designs
Unsloth
Accelerate LLM fine-tuning and model optimization processes using Unsloth automation tools