Axolotl
Automate and integrate Axolotl fine-tuning into your machine learning workflows
Axolotl is a community skill for fine-tuning large language models using the Axolotl training framework, covering LoRA and QLoRA configuration, dataset preparation, training parameter optimization, multi-GPU training, and model merging for efficient LLM customization.
What Is This?
Overview
Axolotl provides patterns for fine-tuning LLMs with minimal code using YAML configuration files. It covers LoRA and QLoRA configuration that applies parameter-efficient adapters for fine-tuning without modifying full model weights, dataset preparation that formats training data from various sources into supported prompt templates, training parameter optimization that configures learning rate, batch size, gradient accumulation, and scheduler for stable training, multi-GPU training that distributes workloads across GPUs using DeepSpeed or FSDP, and model merging that combines LoRA adapters back into base models for deployment. The skill enables ML engineers to fine-tune models efficiently on hardware ranging from single consumer GPUs to multi-node clusters.
Who Should Use This
This skill serves ML engineers fine-tuning open source LLMs for domain-specific tasks, teams customizing language models with proprietary training data, and researchers running fine-tuning experiments with reproducible configurations. It is particularly valuable for teams that need to iterate quickly across multiple experiments without rewriting training infrastructure.
Why Use It?
Problems It Solves
Fine-tuning LLMs requires extensive boilerplate code for data loading, training loops, and optimization. Full fine-tuning demands GPU resources most teams cannot afford. Dataset format differences across providers need custom preprocessing. Reproducing training experiments requires tracking many configuration parameters, and managing these concerns manually across multiple runs introduces significant overhead and error risk.
Core Highlights
YAML configuration defines the entire training pipeline without custom code. QLoRA support enables fine-tuning large models on single consumer GPUs. Dataset adapters handle multiple formats including Alpaca, ShareGPT, and chat templates. DeepSpeed integration scales training across multiple GPUs.
How to Use It?
Basic Usage
base_model:
meta-llama/Llama-3-8B
model_type: LlamaForCausalLM
tokenizer_type:
LlamaTokenizer
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
datasets:
- path: data/train.jsonl
type: alpaca
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 0.0002
lr_scheduler: cosine
warmup_steps: 100
output_dir: ./output
logging_steps: 10
save_steps: 500
eval_steps: 500Real-World Examples
base_model:
meta-llama/Llama-3-70B
model_type: LlamaForCausalLM
load_in_4bit: true
adapter: qlora
lora_r: 64
lora_alpha: 32
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
- gate_proj
- up_proj
- down_proj
datasets:
- path: data/chat.jsonl
type: sharegpt
conversation: chatml
deepspeed: deepspeed/zero3.json
micro_batch_size: 1
gradient_accumulation_steps: 8
num_epochs: 2
learning_rate: 0.00015
bf16: true
flash_attention: true
Advanced Tips
Enable sample_packing to combine multiple short examples into single sequences for more efficient GPU utilization. Use flash_attention for significantly faster training on long sequences with compatible hardware. Start with a small lora_r value such as 16 or 32, and increase only if training loss plateaus to find the minimum adapter size needed. Use gradient checkpointing for large models that do not fit in VRAM with standard training. Merge LoRA weights into the base model after training for simplified deployment without adapter overhead. Setting lora_alpha to twice the value of lora_r is a common starting heuristic that often produces stable results.
When to Use It?
Use Cases
Fine-tune an 8B model on domain-specific question-answer pairs using QLoRA on a single GPU. Train a 70B model across four GPUs with DeepSpeed ZeRO-3. Create a custom chat model using ShareGPT-format conversation data.
Related Topics
LLM fine-tuning, LoRA, QLoRA, DeepSpeed, and model training.
Important Notes
Requirements
CUDA-capable GPU with sufficient VRAM for the target model. Axolotl installed with compatible PyTorch and transformers versions. Training dataset in a supported format like Alpaca or ShareGPT. Sufficient disk space for model checkpoints and logs saved during training.
Usage Recommendations
Do: start with QLoRA for cost-effective experiments before considering full fine-tuning. Validate training data format before launching long training runs. Monitor training loss and evaluation metrics to detect overfitting early.
Don't: set learning rate too high which causes training instability and divergence. Skip evaluation steps to save time as this prevents early detection of quality issues. Mix incompatible dataset formats in a single training configuration. Train for too many epochs on small datasets which causes severe overfitting.
Limitations
QLoRA training quality may not match full fine-tuning for complex tasks. Sample packing can introduce subtle training artifacts with certain model architectures. DeepSpeed configuration requires careful tuning for optimal multi-GPU performance. Training on very long sequences may require gradient checkpointing which trades speed for memory. Merged LoRA adapters cannot be further fine-tuned without re-extracting the adapter weights.
More Skills You Might Like
Explore similar skills to enhance your workflow
Esputnik Automation
Automate Esputnik operations through Composio's Esputnik toolkit via
Dictionary Api Automation
Automate Dictionary API tasks via Rube MCP (Composio)
Gender Api Automation
Automate Gender API operations through Composio's Gender API toolkit
Day2 Create Context Sync Skill
Day2 Create Context Sync Skill automation and integration
Calendly
Calendly API integration with managed OAuth. Access event types, scheduled events, invitees
Market Research Reports
Market Research Reports automation and integration