Hqq

Implement Half-Quadratic Quantization workflows for efficient model compression and automated deployment

Hqq is a community skill for applying Half Quadratic Quantization to language models, covering quantization configuration, calibration free compression, mixed precision settings, inference integration, and quality benchmarking for efficient model deployment.

What Is This?

Overview

Hqq provides tools for quantizing language models using the Half Quadratic Quantization method which achieves competitive quality without requiring calibration data. It covers quantization configuration that sets bit width and group size parameters for weight compression, calibration free compression that quantizes models without needing representative input data reducing setup complexity, mixed precision settings that assign different quantization levels to different layers based on sensitivity analysis, inference integration that loads quantized models with custom CUDA kernels for accelerated generation, and quality benchmarking that evaluates perplexity and task accuracy to verify quantization quality across configurations. The skill enables practitioners to compress models rapidly without calibration dataset preparation.

Who Should Use This

This skill serves ML engineers seeking fast model quantization without calibration overhead, researchers comparing quantization methods across architectures, and deployment teams optimizing models for memory-constrained inference.

Why Use It?

Problems It Solves

Calibration-based quantization methods like GPTQ require preparing representative datasets which adds setup time and introduces dataset bias into the quantization process. Uniform quantization across all layers wastes precision on less sensitive layers while under-allocating bits to critical layers. Quantized model integration into inference pipelines requires compatible dequantization kernels that may not be available for all configurations. Quality evaluation across different bit width and group size combinations is necessary but tedious to automate.

Core Highlights

Quantizer compresses models without calibration data using the half quadratic optimization approach. Mixed precision assigner allocates different bit widths per layer based on weight distribution sensitivity. Kernel loader provides CUDA dequantization for efficient inference. Benchmark runner evaluates perplexity across quantization configurations systematically.

How to Use It?

Basic Usage

from hqq.core.quantize\
  import HQQLinear,\
    HQQBackend
from hqq.models.hf.base\
  import AutoHQQHFModel

class HQQQuantizer:
  def __init__(
    self,
    model_name: str,
    bits: int = 4,
    group_size: int = 64
  ):
    self.model_name = (
      model_name)
    self.quant_config = {
      'weight_quant_params':
        {'nbits': bits,
         'channel_wise':
           True,
         'group_size':
           group_size,
         'optimize': True}}

  def quantize(self):
    model = (
      AutoHQQHFModel
        .from_pretrained(
          self.model_name))
    model.quantize_model(
      quant_config=(
        self.quant_config))
    return model

  def save(
    self,
    model,
    output_dir: str
  ):
    model.save_quantized(
      output_dir)

Real-World Examples

class MixedPrecisionHQQ:
  SENSITIVE_LAYERS = [
    'lm_head',
    'model.layers.0',
    'model.layers.1']

  def build_config(
    self,
    n_layers: int,
    default_bits: int = 4,
    sensitive_bits:
      int = 8
  ) -> dict:
    config = {}
    for i in range(
        n_layers):
      layer_name = (
        f'model.layers.{i}')
      bits = (
        sensitive_bits
        if layer_name
          in self
            .SENSITIVE_LAYERS
        else default_bits)
      config[
        layer_name] = {
          'weight_quant'
          '_params': {
            'nbits': bits,
            'channel_wise':
              True,
            'group_size':
              64,
            'optimize':
              True}}
    return config

  def quantize_mixed(
    self,
    model_name: str,
    n_layers: int
  ):
    config = (
      self.build_config(
        n_layers))
    model = (
      AutoHQQHFModel
        .from_pretrained(
          model_name))
    model.quantize_model(
      quant_config=config)
    return model

Advanced Tips

Profile layer sensitivity by quantizing individual layers at different bit widths and measuring perplexity change to build an informed mixed precision configuration. Use the optimize parameter to enable iterative weight refinement that improves quality at the cost of longer quantization time. Combine HQQ with activation quantization for additional memory savings during inference.

When to Use It?

Use Cases

Quantize a language model quickly without preparing calibration data for rapid experimentation. Apply mixed precision quantization that preserves quality in sensitive layers while compressing others aggressively. Compare HQQ quality against GPTQ and AWQ on standardized benchmarks.

Related Topics

Model quantization, half quadratic optimization, weight compression, mixed precision inference, model deployment, and GPU memory optimization.

Important Notes

Requirements

HQQ Python library with CUDA support for GPU quantization. PyTorch with compatible CUDA version. Sufficient GPU memory to load the model during quantization.

Usage Recommendations

Do: evaluate quantized models on downstream tasks rather than relying solely on perplexity as a quality metric. Use the optimize flag for production quantization where quality matters more than speed. Compare group sizes to find the configuration that suits your quality and size requirements.

Don't: assume HQQ results match calibration methods on all architectures since quality varies by model type. Skip quality evaluation when changing quantization parameters. Use very small group sizes that increase memory overhead from quantization metadata.

Limitations

HQQ quality at very low bit widths may trail calibration-based methods on specific model architectures. Custom CUDA kernels are required for efficient inference and may not support all hardware configurations. Mixed precision settings require per-model sensitivity analysis which adds evaluation overhead.