Llama Factory

Llama Factory automation and integration for efficient LLM fine-tuning pipelines

Source: Orchestra-Research/AI-Research-SKILLs

Llama Factory is a community skill for fine-tuning large language models using the LLaMA-Factory framework, covering dataset preparation, training method selection, LoRA configuration, evaluation workflows, and model export.

What Is This?

Overview

Llama Factory provides patterns for fine-tuning language models through the LLaMA-Factory unified training interface. It covers supervised fine-tuning, RLHF training, DPO alignment, LoRA and QLoRA adapter configuration, dataset format requirements, web UI usage for no-code training, and model export to various deployment formats. The skill enables practitioners to fine-tune models using a streamlined workflow that supports over one hundred model architectures.

Who Should Use This

This skill serves ML engineers who need a unified interface for fine-tuning across different model families, researchers comparing training methods such as SFT, RLHF, and DPO on the same base model, and teams without deep infrastructure expertise who want to fine-tune models through a web interface.

Why Use It?

Problems It Solves

Each model family has different training script requirements, making it tedious to switch between architectures. Setting up fine-tuning pipelines from scratch requires writing boilerplate code for data loading, training loops, and checkpointing. Comparing training methods like SFT versus DPO on the same model requires rewriting significant portions of the training infrastructure. Non-technical team members cannot participate in model training without command-line expertise.

Core Highlights

Unified training interface supports supervised fine-tuning, reward modeling, PPO, DPO, and ORPO through configuration changes rather than code changes. Web UI provides a graphical interface for configuring and launching training runs without command-line interaction. Broad model support covers LLaMA, Mistral, Qwen, ChatGLM, and dozens of other architectures. Built-in evaluation runs benchmarks on fine-tuned models automatically after training.

How to Use It?

Basic Usage

from dataclasses import dataclass, field
import json
from pathlib import Path

@dataclass
class LlamaFactoryConfig:
    model_name: str
    dataset: str
    output_dir: str
    stage: str = "sft"
    finetuning_type: str = "lora"
    lora_rank: int = 8
    lora_alpha: int = 16
    learning_rate: float = 5e-5
    num_train_epochs: int = 3
    per_device_batch_size: int = 2
    gradient_accumulation: int = 4
    quantization_bit: int = 0

    def to_args(self) -> dict:
        return {
            "model_name_or_path": self.model_name,
            "dataset": self.dataset,
            "output_dir": self.output_dir,
            "stage": self.stage,
            "finetuning_type": self.finetuning_type,
            "lora_rank": self.lora_rank,
            "lora_alpha": self.lora_alpha,
            "learning_rate": self.learning_rate,
            "num_train_epochs": self.num_train_epochs,
            "per_device_train_batch_size": self.per_device_batch_size,
            "gradient_accumulation_steps": self.gradient_accumulation
        }

    def save(self, path: str):
        Path(path).write_text(json.dumps(self.to_args(), indent=2))

Real-World Examples

from dataclasses import dataclass, field
import json
from pathlib import Path

@dataclass
class DatasetEntry:
    instruction: str
    input: str = ""
    output: str = ""

class DatasetPreparer:
    def __init__(self):
        self.entries: list[DatasetEntry] = []

    def add(self, instruction: str, output: str,
            input_text: str = ""):
        self.entries.append(DatasetEntry(
            instruction=instruction, input=input_text, output=output
        ))

    def save(self, path: str):
        data = [{"instruction": e.instruction, "input": e.input,
                 "output": e.output} for e in self.entries]
        Path(path).write_text(json.dumps(data, indent=2))

    def validate(self) -> dict:
        errors = []
        for i, entry in enumerate(self.entries):
            if not entry.instruction.strip():
                errors.append(f"Row {i}: empty instruction")
            if not entry.output.strip():
                errors.append(f"Row {i}: empty output")
        return {"total": len(self.entries), "errors": errors,
                "valid": len(self.entries) - len(errors)}

preparer = DatasetPreparer()
preparer.add("Summarize the following article.",
             "The article discusses recent advances in NLP.")
preparer.add("Translate to French.",
             "Bonjour le monde.", "Hello world.")
print(preparer.validate())

Advanced Tips

Use QLoRA with 4-bit quantization to fine-tune large models on consumer GPUs with limited VRAM. Enable gradient checkpointing to trade compute time for memory savings on constrained hardware. Register custom datasets in the dataset_info.json configuration file to integrate proprietary training data into the framework.

When to Use It?

Use Cases

Fine-tune a base model on domain-specific instruction data for a specialized chatbot application. Compare SFT and DPO training methods on the same dataset to determine which produces better aligned outputs. Train a model using the web UI for a team member who needs results without writing training scripts.

Important Notes

Requirements

Python with the LLaMA-Factory package and its dependencies installed. GPU access with sufficient VRAM for the chosen model and training configuration. Training dataset formatted according to the framework dataset specification.

Usage Recommendations

Do: start with small LoRA rank values and increase based on evaluation results. Validate dataset format using the framework validation tools before starting training. Export trained adapters to merged model format for simplified production deployment.

Don't: skip the dataset validation step that catches formatting issues before training begins. Use full fine-tuning when LoRA achieves comparable results at a fraction of the compute cost. Ignore evaluation metrics and rely solely on manual inspection of model outputs.

Limitations

Framework updates may lag behind newly released model architectures by days or weeks. Custom training loop modifications require modifying framework source code rather than configuration alone. The web UI provides convenience but offers fewer customization options than the command-line interface for advanced training scenarios.

More Skills You Might Like

Explore similar skills to enhance your workflow