Llamaguard

Llamaguard automation and integration for AI safety and content moderation

Source: Orchestra-Research/AI-Research-SKILLs

Llamaguard is a community skill for implementing content safety classification with Llama Guard models, covering safety policy configuration, input screening, output filtering, risk category taxonomy, and moderation pipeline integration for LLM application safety.

What Is This?

Overview

Llamaguard provides tools for adding content safety checks to language model applications using Meta's Llama Guard classifier models. It covers safety policy configuration that defines custom risk categories and severity thresholds for content classification decisions, input screening that checks user prompts against safety policies before they reach the generation model, output filtering that evaluates model responses for policy violations before delivery to users, risk category taxonomy that maps content to categories including violence, sexual content, criminal activity, and self-harm with configurable granularity, and moderation pipeline integration that adds safety classification as middleware in LLM request processing chains. The skill enables teams to deploy content moderation for LLM applications with customizable safety policies.

Who Should Use This

This skill serves AI safety engineers building content moderation systems, application developers adding safety layers to LLM products, and trust and safety teams configuring content policies.

Why Use It?

Problems It Solves

LLM applications without input screening process adversarial prompts that extract harmful content from the model. Output filtering using keyword matching misses nuanced policy violations that require contextual understanding. Custom safety classifiers require substantial training data and compute while Llama Guard provides a pre-trained baseline. Moderation logic scattered across application code lacks consistent policy enforcement.

Core Highlights

Policy definer configures risk categories with custom labels and severity levels. Input classifier screens user prompts against defined safety policies. Output evaluator checks model responses for policy violations before delivery. Pipeline connector adds safety checks as middleware in request processing flows.

How to Use It?

Basic Usage

from transformers import (
  AutoTokenizer,
  AutoModelForCausalLM)
import torch

class SafetyClassifier:
  def __init__(
    self,
    model_id: str
      = 'meta-llama/'
        'Llama-Guard-3-8B'
  ):
    self.tokenizer = (
      AutoTokenizer
        .from_pretrained(
          model_id))
    self.model = (
      AutoModelForCausalLM
        .from_pretrained(
          model_id,
          torch_dtype=(
            torch.bfloat16),
          device_map=(
            'auto')))

  def classify(
    self,
    conversation:
      list[dict]
  ) -> dict:
    inputs = (
      self.tokenizer
        .apply_chat_template(
          conversation,
          return_tensors=(
            'pt')))
    output = self.model\
      .generate(
        inputs.to(
          self.model
            .device),
        max_new_tokens=20)
    result = (
      self.tokenizer
        .decode(
          output[0],
          skip_special_tokens
            =True))
    return {
      'safe': 'safe'
        in result.lower(),
      'raw': result}

Real-World Examples

class ModerationLayer:
  def __init__(
    self,
    classifier:
      SafetyClassifier
  ):
    self.clf = classifier
    self.blocked = []

  def check_input(
    self,
    user_msg: str
  ) -> dict:
    conv = [{'role':
      'user', 'content':
        user_msg}]
    result = (
      self.clf.classify(
        conv))
    if not result['safe']:
      self.blocked.append(
        {'type': 'input',
         'msg': user_msg})
    return result

  def check_output(
    self,
    user_msg: str,
    assistant_msg: str
  ) -> dict:
    conv = [
      {'role': 'user',
       'content':
         user_msg},
      {'role': 'assistant',
       'content':
         assistant_msg}]
    result = (
      self.clf.classify(
        conv))
    if not result['safe']:
      self.blocked.append(
        {'type': 'output',
         'msg':
           assistant_msg})
    return result

  def stats(self) -> dict:
    return {
      'total_blocked':
        len(self.blocked),
      'input_blocks': sum(
        1 for b
        in self.blocked
        if b['type']
          == 'input'),
      'output_blocks': sum(
        1 for b
        in self.blocked
        if b['type']
          == 'output')}

Advanced Tips

Customize the safety taxonomy by prepending policy definitions to the input prompt so Llama Guard evaluates against domain-specific categories. Run input and output classification in parallel with the generation model to reduce latency in the moderation pipeline. Log all classification decisions for auditing and policy refinement over time.

When to Use It?

Use Cases

Screen user inputs to a chatbot for adversarial prompts before passing them to the generation model. Filter model outputs that contain policy-violating content before displaying to users. Build a moderation dashboard that tracks safety classification statistics across production traffic.

Important Notes

Requirements

GPU with sufficient memory for Llama Guard model inference. Hugging Face Transformers library with model access credentials. PyTorch installation with CUDA support.

Usage Recommendations

Do: apply both input and output classification for comprehensive coverage since unsafe inputs may produce safe outputs and vice versa. Customize the safety taxonomy to match your application domain and user context. Log blocked content for policy refinement and false positive analysis.

Don't: rely solely on Llama Guard without human review processes for edge cases and appeals. Use safety classification as the only defense layer without complementary measures like rate limiting. Assume the default taxonomy covers all risk categories relevant to your application.

Limitations

Classification accuracy depends on the Llama Guard model version and may not cover all content risk types. Inference latency adds processing time to each request in the moderation pipeline. Custom policy categories require prompt engineering that may need iteration to achieve desired classification behavior.

More Skills You Might Like

Explore similar skills to enhance your workflow