Llamaguard
Llamaguard automation and integration for AI safety and content moderation
Category: productivity Source: Orchestra-Research/AI-Research-SKILLsLlamaguard is a community skill for implementing content safety classification with Llama Guard models, covering safety policy configuration, input screening, output filtering, risk category taxonomy, and moderation pipeline integration for LLM application safety.
What Is This?
Overview
Llamaguard provides tools for adding content safety checks to language model applications using Meta's Llama Guard classifier models. It covers safety policy configuration that defines custom risk categories and severity thresholds for content classification decisions, input screening that checks user prompts against safety policies before they reach the generation model, output filtering that evaluates model responses for policy violations before delivery to users, risk category taxonomy that maps content to categories including violence, sexual content, criminal activity, and self-harm with configurable granularity, and moderation pipeline integration that adds safety classification as middleware in LLM request processing chains. The skill enables teams to deploy content moderation for LLM applications with customizable safety policies.
Who Should Use This
This skill serves AI safety engineers building content moderation systems, application developers adding safety layers to LLM products, and trust and safety teams configuring content policies.
Why Use It?
Problems It Solves
LLM applications without input screening process adversarial prompts that extract harmful content from the model. Output filtering using keyword matching misses nuanced policy violations that require contextual understanding. Custom safety classifiers require substantial training data and compute while Llama Guard provides a pre-trained baseline. Moderation logic scattered across application code lacks consistent policy enforcement.
Core Highlights
Policy definer configures risk categories with custom labels and severity levels. Input classifier screens user prompts against defined safety policies. Output evaluator checks model responses for policy violations before delivery. Pipeline connector adds safety checks as middleware in request processing flows.
How to Use It?
Basic Usage
from transformers import (
AutoTokenizer,
AutoModelForCausalLM)
import torch
class SafetyClassifier:
def __init__(
self,
model_id: str
= 'meta-llama/'
'Llama-Guard-3-8B'
):
self.tokenizer = (
AutoTokenizer
.from_pretrained(
model_id))
self.model = (
AutoModelForCausalLM
.from_pretrained(
model_id,
torch_dtype=(
torch.bfloat16),
device_map=(
'auto')))
def classify(
self,
conversation:
list[dict]
) -> dict:
inputs = (
self.tokenizer
.apply_chat_template(
conversation,
return_tensors=(
'pt')))
output = self.model\
.generate(
inputs.to(
self.model
.device),
max_new_tokens=20)
result = (
self.tokenizer
.decode(
output[0],
skip_special_tokens
=True))
return {
'safe': 'safe'
in result.lower(),
'raw': result}
Real-World Examples
class ModerationLayer:
def __init__(
self,
classifier:
SafetyClassifier
):
self.clf = classifier
self.blocked = []
def check_input(
self,
user_msg: str
) -> dict:
conv = [{'role':
'user', 'content':
user_msg}]
result = (
self.clf.classify(
conv))
if not result['safe']:
self.blocked.append(
{'type': 'input',
'msg': user_msg})
return result
def check_output(
self,
user_msg: str,
assistant_msg: str
) -> dict:
conv = [
{'role': 'user',
'content':
user_msg},
{'role': 'assistant',
'content':
assistant_msg}]
result = (
self.clf.classify(
conv))
if not result['safe']:
self.blocked.append(
{'type': 'output',
'msg':
assistant_msg})
return result
def stats(self) -> dict:
return {
'total_blocked':
len(self.blocked),
'input_blocks': sum(
1 for b
in self.blocked
if b['type']
== 'input'),
'output_blocks': sum(
1 for b
in self.blocked
if b['type']
== 'output')}
Advanced Tips
Customize the safety taxonomy by prepending policy definitions to the input prompt so Llama Guard evaluates against domain-specific categories. Run input and output classification in parallel with the generation model to reduce latency in the moderation pipeline. Log all classification decisions for auditing and policy refinement over time.
When to Use It?
Use Cases
Screen user inputs to a chatbot for adversarial prompts before passing them to the generation model. Filter model outputs that contain policy-violating content before displaying to users. Build a moderation dashboard that tracks safety classification statistics across production traffic.
Related Topics
Content moderation, Llama Guard, AI safety, content classification, LLM safety, prompt screening, and trust and safety.
Important Notes
Requirements
GPU with sufficient memory for Llama Guard model inference. Hugging Face Transformers library with model access credentials. PyTorch installation with CUDA support.
Usage Recommendations
Do: apply both input and output classification for comprehensive coverage since unsafe inputs may produce safe outputs and vice versa. Customize the safety taxonomy to match your application domain and user context. Log blocked content for policy refinement and false positive analysis.
Don't: rely solely on Llama Guard without human review processes for edge cases and appeals. Use safety classification as the only defense layer without complementary measures like rate limiting. Assume the default taxonomy covers all risk categories relevant to your application.
Limitations
Classification accuracy depends on the Llama Guard model version and may not cover all content risk types. Inference latency adds processing time to each request in the moderation pipeline. Custom policy categories require prompt engineering that may need iteration to achieve desired classification behavior.