Bitsandbytes

Bitsandbytes automation and integration for optimized low-bit deep learning model training

Source: Orchestra-Research/AI-Research-SKILLs

Bitsandbytes is a community skill for quantizing large language models to lower precision using the bitsandbytes library, covering 8-bit and 4-bit quantization, mixed-precision training, memory optimization, inference acceleration, and integration with Hugging Face Transformers for GPU memory reduction.

What Is This?

Overview

Bitsandbytes provides patterns for reducing GPU memory usage of large language models through quantization. It covers 8-bit quantization that converts model weights from FP16 to INT8 with minimal accuracy loss, 4-bit quantization using NF4 and FP4 data types for aggressive memory savings, mixed-precision training that applies quantization to specific layers while keeping critical weights in higher precision, memory optimization that reduces VRAM usage enough to run large models on consumer GPUs, and Transformers integration that enables quantized loading through Hugging Face configuration classes. The skill enables running models that exceed available GPU memory at full precision, making previously inaccessible model sizes practical on standard hardware.

Who Should Use This

This skill serves ML engineers deploying large models on limited GPU hardware, researchers fine-tuning foundation models with constrained VRAM, and teams running inference on consumer-grade GPUs where full-precision models do not fit. It is particularly relevant for practitioners working with models in the 7B to 70B parameter range.

Why Use It?

Problems It Solves

Large language models require more GPU memory than most hardware provides at full precision. Running 7B parameter models in FP16 needs approximately 14GB VRAM which excludes many consumer GPUs. Fine-tuning with full-precision gradients multiplies memory requirements further, often by a factor of three or more when accounting for optimizer states. Deploying multiple models on a single GPU requires aggressive memory reduction. Gradient checkpointing alone does not sufficiently reduce memory for the largest models.

Core Highlights

8-bit loading reduces model memory by approximately half with negligible accuracy loss. 4-bit NF4 quantization reduces memory by 75 percent enabling 7B models on 6GB GPUs. QLoRA integration combines 4-bit quantization with LoRA for memory-efficient fine-tuning. Hugging Face BitsAndBytesConfig provides one-line quantization setup.

How to Use It?

Basic Usage

from transformers import (
  AutoModelForCausalLM,
  AutoTokenizer,
  BitsAndBytesConfig)
import torch

bnb_config =\
  BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type=\
      'nf4',
    bnb_4bit_compute_dtype=\
      torch.bfloat16,
    bnb_4bit_use_double\
      _quant=True)

model_id = (
  'meta-llama/'
  'Llama-2-7b-hf')

tokenizer =\
  AutoTokenizer\
    .from_pretrained(
      model_id)

model =\
  AutoModelForCausalLM\
    .from_pretrained(
      model_id,
      quantization_config=\
        bnb_config,
      device_map='auto')

mem = torch.cuda\
  .memory_allocated()
print(
  f'GPU memory: '
  f'{mem / 1e9:.1f} GB')

Real-World Examples

from peft import (
  LoraConfig,
  get_peft_model,
  prepare_model_for\
    _kbit_training)

model = prepare_model_for\
  _kbit_training(model)

lora_config = LoraConfig(
  r=16,
  lora_alpha=32,
  target_modules=[
    'q_proj', 'v_proj',
    'k_proj', 'o_proj'],
  lora_dropout=0.05,
  bias='none',
  task_type=\
    'CAUSAL_LM')

model = get_peft_model(
  model, lora_config)

trainable = sum(
  p.numel()
  for p in model\
    .parameters()
  if p.requires_grad)
total = sum(
  p.numel()
  for p in model\
    .parameters())
print(
  f'Trainable: '
  f'{trainable/total:.2%}')

Advanced Tips

Enable double quantization with bnb_4bit_use_double_quant to reduce quantization constant overhead for additional memory savings of roughly 0.4 bits per parameter. Use bfloat16 as the compute dtype on Ampere or newer GPUs for better numerical stability during 4-bit inference. Set device_map to auto to allow the library to distribute layers across available GPUs and CPU memory automatically. Benchmark perplexity on a validation set after quantization to verify acceptable accuracy.

When to Use It?

Use Cases

Load a 7B parameter model on a consumer GPU with 8GB VRAM using 4-bit quantization. Fine-tune a large model with QLoRA to fit training within limited GPU memory. Deploy quantized models for low-latency inference on cost-effective GPU instances.

Important Notes

Requirements

NVIDIA GPU with CUDA support for quantized operations. bitsandbytes library version 0.39 or later. PyTorch and Transformers libraries with compatible versions. Sufficient disk space for downloading full-precision model weights before quantization applies.

Usage Recommendations

Do: start with 8-bit quantization and move to 4-bit only if memory is still insufficient. Evaluate model output quality after quantization before deploying to production. Use NF4 quantization type for language models as it is optimized for normally distributed weights.

Don't: apply 4-bit quantization to small models where full precision fits in available memory since accuracy loss is unnecessary. Skip validation after quantization assuming output quality is preserved. Mix bitsandbytes versions across training and inference environments.

Limitations

Quantization introduces small accuracy degradation that varies by model and task. 4-bit inference can be slower than FP16 on some GPU architectures due to dequantization overhead. AMD GPUs have limited bitsandbytes support compared to NVIDIA CUDA devices. Quantized models cannot be further fine-tuned with standard full-parameter training and require parameter-efficient methods like LoRA.

More Skills You Might Like

Explore similar skills to enhance your workflow