Peft

Automate and integrate parameter-efficient fine-tuning workflows using PEFT

PEFT is a community skill for parameter-efficient fine-tuning of large language models using the HuggingFace PEFT library, covering LoRA adapters, prefix tuning, prompt tuning, adapter configuration, and model merging for efficient LLM customization.

What Is This?

Overview

PEFT provides tools for fine-tuning large models by training only a small number of additional parameters instead of updating all weights. It covers LoRA adapters that add low-rank decomposition matrices to attention layers for efficient weight adaptation, prefix tuning that prepends trainable vectors to transformer layer inputs, prompt tuning that learns soft prompt embeddings concatenated to input tokens, adapter configuration that sets rank, target modules, and scaling parameters for each method, and model merging that combines trained adapters with base models for deployment. The skill enables teams to customize large models with minimal compute.

Who Should Use This

This skill serves ML engineers fine-tuning language models for domain-specific tasks, research teams experimenting with efficient adaptation methods, and organizations customizing foundation models on limited GPU resources.

Why Use It?

Problems It Solves

Full fine-tuning of large models requires GPU memory proportional to the total parameter count which exceeds available hardware for most teams. Storing separate copies of fully fine-tuned models for each task consumes excessive disk space. Training all parameters risks catastrophic forgetting of pre-trained capabilities. Sharing fine-tuned models requires distributing multi-gigabyte weight files.

Core Highlights

LoRA builder configures low-rank adapters for targeted weight adaptation. Prefix tuner adds trainable prefix vectors to transformer layers. Prompt learner trains soft prompts for task-specific behavior. Adapter merger combines trained adapters with base model weights for inference.

How to Use It?

Basic Usage

from peft import (
  LoraConfig,
  get_peft_model,
  TaskType)
from transformers import (
  AutoModelForCausalLM,
  AutoTokenizer)

model_name = (
  'meta-llama/'
  'Llama-2-7b-hf')

model = (
  AutoModelForCausalLM
    .from_pretrained(
      model_name))

lora_config = LoraConfig(
  task_type=
    TaskType.CAUSAL_LM,
  r=16,
  lora_alpha=32,
  lora_dropout=0.05,
  target_modules=[
    'q_proj', 'v_proj',
    'k_proj', 'o_proj'])

peft_model = get_peft_model(
  model, lora_config)

trainable = sum(
  p.numel() for p in
  peft_model.parameters()
  if p.requires_grad)
total = sum(
  p.numel() for p in
  peft_model.parameters())
print(
  f'Trainable: '
  f'{trainable/total*100'
  f':.2f}%')

Real-World Examples

from peft import (
  PeftModel, LoraConfig,
  get_peft_model)

class AdapterManager:
  def __init__(
    self,
    base_model
  ):
    self.base = base_model
    self.adapters = {}

  def add_adapter(
    self,
    name: str,
    config: LoraConfig
  ):
    model = get_peft_model(
      self.base, config)
    self.adapters[
      name] = model

  def save_adapter(
    self,
    name: str,
    path: str
  ):
    self.adapters[name]\
      .save_pretrained(
        path)

  def load_adapter(
    self,
    name: str,
    path: str
  ):
    model = PeftModel\
      .from_pretrained(
        self.base, path)
    self.adapters[
      name] = model

  def merge_adapter(
    self,
    name: str
  ):
    merged = (
      self.adapters[name]
        .merge_and_unload())
    return merged

  def list_adapters(
    self
  ) -> list[str]:
    return list(
      self.adapters.keys())

Advanced Tips

Target both attention projection and MLP layers with LoRA for tasks that require broader model adaptation beyond attention patterns. Use higher LoRA rank for complex tasks that need more adapter capacity while balancing against increased memory. Merge adapters into base weights for production inference to eliminate the adapter overhead during serving.

When to Use It?

Use Cases

Fine-tune a language model on domain text using LoRA adapters that train less than one percent of total parameters. Train multiple task-specific adapters that share a single base model to save storage. Adapt a large model for classification using prompt tuning on limited GPU hardware.

Related Topics

LoRA, parameter-efficient fine-tuning, adapter methods, LLM customization, HuggingFace, prefix tuning, and model adaptation.

Important Notes

Requirements

PEFT Python package from HuggingFace with transformers library. GPU with sufficient memory for the base model plus adapter parameters. Pre-trained base model weights accessible locally or from HuggingFace.

Usage Recommendations

Do: start with default LoRA hyperparameters before tuning rank and alpha values for your specific task. Save only the adapter weights since they are typically megabytes compared to gigabyte base models. Evaluate adapter performance against full fine-tuning baselines to verify quality.

Don't: apply adapters to all model layers without profiling since targeting specific modules is more efficient. Use very high LoRA ranks that approach full fine-tuning parameter counts defeating the efficiency purpose. Mix adapter weights trained on different base model versions since this produces invalid results.

Limitations

LoRA adapters may not match full fine-tuning quality on tasks requiring broad model behavior changes. Adapter performance is sensitive to rank and target module selection requiring experimentation. Some model architectures have limited PEFT method support depending on library version.