Gptq

Seamlessly automate and integrate GPTQ model quantization into your pipelines

Source: Orchestra-Research/AI-Research-SKILLs

Gptq is a community skill for applying GPTQ post-training quantization to large language models, covering calibration data preparation, quantization configuration, accuracy evaluation, model export, and inference integration for efficient model deployment.

What Is This?

Overview

Gptq provides tools for quantizing large language models using the GPTQ algorithm which achieves high compression ratios with minimal accuracy loss. It covers calibration data preparation that selects representative text samples for guiding the quantization process, quantization configuration that sets bit width, group size, and layer-wise quantization parameters, accuracy evaluation that measures perplexity and task performance before and after quantization to verify quality, model export that saves quantized weights in formats compatible with inference frameworks like vLLM and text-generation-inference, and inference integration that loads quantized models with proper dequantization kernels for accelerated generation. The skill enables practitioners to reduce model memory footprint while preserving output quality.

Who Should Use This

This skill serves ML engineers deploying large models on memory-constrained GPUs, model optimization researchers comparing quantization techniques, and inference platform teams reducing serving costs through model compression.

Why Use It?

Problems It Solves

Large language models require multiple GPUs for inference at full precision which increases serving cost substantially. Naive round-to-nearest quantization degrades model quality significantly at low bit widths. Different inference frameworks expect quantized models in specific formats requiring careful export configuration. Selecting appropriate calibration data and quantization parameters requires systematic experimentation to find optimal quality-size tradeoffs.

Core Highlights

Calibrator prepares representative text datasets for guiding weight quantization decisions. Quantizer applies GPTQ algorithm with configurable group size and bit width per layer. Evaluator measures perplexity on held-out data to verify quantization quality. Exporter saves quantized models in formats compatible with popular inference engines.

How to Use It?

Basic Usage

from auto_gptq import (
  AutoGPTQForCausalLM,
  BaseQuantizeConfig)
from transformers import (
  AutoTokenizer)

class GPTQQuantizer:
  def __init__(
    self,
    model_name: str,
    bits: int = 4,
    group_size: int = 128
  ):
    self.tokenizer = (
      AutoTokenizer
        .from_pretrained(
          model_name))
    self.config = (
      BaseQuantizeConfig(
        bits=bits,
        group_size=(
          group_size),
        desc_act=False))
    self.model = (
      AutoGPTQForCausalLM
        .from_pretrained(
          model_name,
          self.config))

  def quantize(
    self,
    calibration_data:
      list[str]
  ):
    examples = [
      self.tokenizer(
        text,
        return_tensors=(
          'pt'))
      for text
      in calibration_data]
    self.model.quantize(
      examples)

  def save(
    self,
    output_dir: str
  ):
    self.model\
      .save_quantized(
        output_dir)
    self.tokenizer\
      .save_pretrained(
        output_dir)

Real-World Examples

from datasets import (
  load_dataset)

class CalibrationBuilder:
  def __init__(
    self,
    tokenizer,
    max_length: int
      = 2048,
    n_samples: int
      = 128
  ):
    self.tokenizer = (
      tokenizer)
    self.max_length = (
      max_length)
    self.n_samples = (
      n_samples)

  def from_wikitext(
    self
  ) -> list[str]:
    dataset = load_dataset(
      'wikitext',
      'wikitext-2-raw-v1',
      split='train')
    texts = [
      row['text']
      for row in dataset
      if len(
        row['text']
      ) > 100]
    return texts[
      :self.n_samples]

  def from_custom(
    self,
    texts: list[str]
  ) -> list[str]:
    filtered = []
    for text in texts:
      tokens = (
        self.tokenizer(
          text))
      if len(tokens[
          'input_ids']
      ) <= self\
          .max_length:
        filtered.append(
          text)
    return filtered[
      :self.n_samples]

Advanced Tips

Use calibration data that is representative of the target inference domain since calibration on generic text may produce suboptimal results for specialized applications like code or medical text. Enable activation order descending for improved quality at the cost of slower quantization time. Compare perplexity across group sizes to find the setting that balances model size against accuracy retention.

When to Use It?

Use Cases

Quantize a large language model to 4-bit weights for deployment on a single consumer GPU. Prepare calibration datasets from domain-specific text for targeted quantization quality. Export a GPTQ model in safetensors format for serving with vLLM or text-generation-inference.

Important Notes

Requirements

AutoGPTQ library with CUDA support for GPU accelerated quantization. Sufficient GPU memory to load the full-precision model during quantization. Calibration dataset with representative text samples.

Usage Recommendations

Do: evaluate quantized model perplexity on held-out data before deployment to verify acceptable quality. Use at least 128 calibration samples for stable quantization results. Test quantized model outputs on representative prompts from the target use case.

Don't: use random or empty text for calibration since the data directly affects quantization quality decisions. Quantize below 4 bits without carefully evaluating output degradation on downstream tasks. Assume quantization quality transfers across model families since different architectures respond differently to compression.

Limitations

GPTQ quantization requires a full forward pass through the model with calibration data which needs substantial GPU memory. Quantized models require compatible inference kernels and not all frameworks support all GPTQ configurations. Quality degradation at very low bit widths may be unacceptable for tasks requiring precise reasoning or factual accuracy.

More Skills You Might Like

Explore similar skills to enhance your workflow