Langsmith

Automate and integrate LangSmith observability and testing into your LLM pipelines

Source: Orchestra-Research/AI-Research-SKILLs

Langsmith is a community skill for monitoring and debugging LLM application pipelines using LangSmith, covering trace logging, prompt versioning, evaluation runs, dataset management, and performance monitoring for production LLM systems.

What Is This?

Overview

Langsmith provides tools for observability and evaluation of language model applications through the LangSmith platform. It covers trace logging that captures complete execution traces of LLM chains including prompts, completions, latency, and token usage, prompt versioning that manages prompt template iterations with comparison across versions, evaluation runs that execute test suites against LLM outputs with configurable scoring criteria, dataset management that creates and maintains evaluation datasets with input-output pairs and ground truth labels, and performance monitoring that tracks latency, cost, and quality metrics across production traffic. The skill enables teams to build reliable LLM applications with systematic observability.

Who Should Use This

This skill serves AI engineers debugging LLM pipeline behavior, MLOps teams monitoring LLM application performance, and prompt engineers iterating on prompt designs with evaluation data.

Why Use It?

Problems It Solves

LLM application debugging is difficult without trace visibility into each individual step of a multi-component chain or agent workflow. Prompt changes lack systematic evaluation against representative test cases making it unclear whether modifications improve or degrade output quality. Production LLM quality degrades silently without monitoring that detects drift in output characteristics over time. Evaluation datasets are managed informally rather than through versioned and curated test suites.

Core Highlights

Trace logger captures full execution traces with prompt, completion, and metadata at each chain step. Prompt manager versions templates with side-by-side comparison tooling across iterations. Evaluator runs test datasets through pipelines with automated scoring against configurable criteria. Monitor tracks latency, token usage, error rates, and quality metrics in production.

How to Use It?

Basic Usage

import os
from langsmith import Client
from langsmith.wrappers\
  import wrap_openai
from openai import OpenAI

os.environ[
  'LANGCHAIN_TRACING_V2'
] = 'true'
os.environ[
  'LANGCHAIN_API_KEY'
] = 'ls-...'

client = Client()
openai_client = (
  wrap_openai(OpenAI()))

def traced_call(
  prompt: str,
  model: str
    = 'gpt-4o'
) -> str:
  response = (
    openai_client.chat
      .completions
      .create(
        model=model,
        messages=[{
          'role': 'user',
          'content':
            prompt}]))
  return response\
    .choices[0]\
      .message.content

Real-World Examples

from langsmith import Client
from langsmith.evaluation\
  import evaluate

client = Client()

def create_dataset(
  name: str,
  examples: list[dict]
):
  dataset = (
    client.create_dataset(
      dataset_name=name))
  for ex in examples:
    client\
      .create_example(
        inputs=(
          ex['input']),
        outputs=(
          ex['output']),
        dataset_id=(
          dataset.id))
  return dataset

def accuracy_scorer(
  run, example
) -> dict:
  predicted = (
    run.outputs.get(
      'result', ''))
  expected = (
    example.outputs.get(
      'result', ''))
  score = 1.0 if (
    predicted.strip()
    == expected.strip()
  ) else 0.0
  return {
    'key': 'accuracy',
    'score': score}

results = evaluate(
  traced_call,
  data='my-dataset',
  evaluators=[
    accuracy_scorer])

Advanced Tips

Use LangSmith annotations to label production traces with quality ratings building an ongoing evaluation dataset from real traffic. Set up monitoring rules that alert when output latency or quality scores deviate from baselines. Compare evaluation results across prompt versions to quantify the impact of each change before promoting to production.

When to Use It?

Use Cases

Debug a multi-step LLM chain by inspecting traces at each component step. Run prompt regression tests against a curated evaluation dataset before deploying changes. Monitor production LLM application quality with latency and cost tracking.

Important Notes

Requirements

LangSmith account with API key and project workspace configured. LangSmith Python SDK installed. Environment variables configured for tracing enablement in the target runtime.

Usage Recommendations

Do: enable tracing in development and production environments for complete visibility. Build evaluation datasets from real production examples to test against realistic inputs. Version prompts alongside evaluation results for traceability.

Don't: log sensitive user data in traces without appropriate data handling policies. Deploy prompt changes without running evaluation suites to verify quality. Rely solely on automated metrics without periodic human review of trace quality.

Limitations

Trace logging adds latency overhead from network calls to the LangSmith service. Evaluation scoring requires defining quality criteria which can be subjective for open-ended generation tasks. Free tier usage limits may constrain high-volume production monitoring. Self-hosted deployment is not available so all trace data is stored on LangSmith servers.

More Skills You Might Like

Explore similar skills to enhance your workflow