Axolotl

Automate and integrate Axolotl fine-tuning into your machine learning workflows

Source: Orchestra-Research/AI-Research-SKILLs

Axolotl is a community skill for fine-tuning large language models using the Axolotl training framework, covering LoRA and QLoRA configuration, dataset preparation, training parameter optimization, multi-GPU training, and model merging for efficient LLM customization.

What Is This?

Overview

Axolotl provides patterns for fine-tuning LLMs with minimal code using YAML configuration files. It covers LoRA and QLoRA configuration that applies parameter-efficient adapters for fine-tuning without modifying full model weights, dataset preparation that formats training data from various sources into supported prompt templates, training parameter optimization that configures learning rate, batch size, gradient accumulation, and scheduler for stable training, multi-GPU training that distributes workloads across GPUs using DeepSpeed or FSDP, and model merging that combines LoRA adapters back into base models for deployment. The skill enables ML engineers to fine-tune models efficiently on hardware ranging from single consumer GPUs to multi-node clusters.

Who Should Use This

This skill serves ML engineers fine-tuning open source LLMs for domain-specific tasks, teams customizing language models with proprietary training data, and researchers running fine-tuning experiments with reproducible configurations. It is particularly valuable for teams that need to iterate quickly across multiple experiments without rewriting training infrastructure.

Why Use It?

Problems It Solves

Fine-tuning LLMs requires extensive boilerplate code for data loading, training loops, and optimization. Full fine-tuning demands GPU resources most teams cannot afford. Dataset format differences across providers need custom preprocessing. Reproducing training experiments requires tracking many configuration parameters, and managing these concerns manually across multiple runs introduces significant overhead and error risk.

Core Highlights

YAML configuration defines the entire training pipeline without custom code. QLoRA support enables fine-tuning large models on single consumer GPUs. Dataset adapters handle multiple formats including Alpaca, ShareGPT, and chat templates. DeepSpeed integration scales training across multiple GPUs.

How to Use It?

Basic Usage

base_model:
  meta-llama/Llama-3-8B
model_type: LlamaForCausalLM
tokenizer_type:
  LlamaTokenizer

load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true

datasets:
  - path: data/train.jsonl
    type: alpaca

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 0.0002
lr_scheduler: cosine
warmup_steps: 100

output_dir: ./output
logging_steps: 10
save_steps: 500
eval_steps: 500

Real-World Examples

base_model:
  meta-llama/Llama-3-70B
model_type: LlamaForCausalLM

load_in_4bit: true
adapter: qlora
lora_r: 64
lora_alpha: 32
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

datasets:
  - path: data/chat.jsonl
    type: sharegpt
    conversation: chatml

deepspeed: deepspeed/zero3.json
micro_batch_size: 1
gradient_accumulation_steps: 8
num_epochs: 2
learning_rate: 0.00015
bf16: true
flash_attention: true

Advanced Tips

Enable sample_packing to combine multiple short examples into single sequences for more efficient GPU utilization. Use flash_attention for significantly faster training on long sequences with compatible hardware. Start with a small lora_r value such as 16 or 32, and increase only if training loss plateaus to find the minimum adapter size needed. Use gradient checkpointing for large models that do not fit in VRAM with standard training. Merge LoRA weights into the base model after training for simplified deployment without adapter overhead. Setting lora_alpha to twice the value of lora_r is a common starting heuristic that often produces stable results.

When to Use It?

Use Cases

Fine-tune an 8B model on domain-specific question-answer pairs using QLoRA on a single GPU. Train a 70B model across four GPUs with DeepSpeed ZeRO-3. Create a custom chat model using ShareGPT-format conversation data.

Important Notes

Requirements

CUDA-capable GPU with sufficient VRAM for the target model. Axolotl installed with compatible PyTorch and transformers versions. Training dataset in a supported format like Alpaca or ShareGPT. Sufficient disk space for model checkpoints and logs saved during training.

Usage Recommendations

Do: start with QLoRA for cost-effective experiments before considering full fine-tuning. Validate training data format before launching long training runs. Monitor training loss and evaluation metrics to detect overfitting early.

Don't: set learning rate too high which causes training instability and divergence. Skip evaluation steps to save time as this prevents early detection of quality issues. Mix incompatible dataset formats in a single training configuration. Train for too many epochs on small datasets which causes severe overfitting.

Limitations

QLoRA training quality may not match full fine-tuning for complex tasks. Sample packing can introduce subtle training artifacts with certain model architectures. DeepSpeed configuration requires careful tuning for optimal multi-GPU performance. Training on very long sequences may require gradient checkpointing which trades speed for memory. Merged LoRA adapters cannot be further fine-tuned without re-extracting the adapter weights.

More Skills You Might Like

Explore similar skills to enhance your workflow