Gptq
Seamlessly automate and integrate GPTQ model quantization into your pipelines
Gptq is a community skill for applying GPTQ post-training quantization to large language models, covering calibration data preparation, quantization configuration, accuracy evaluation, model export, and inference integration for efficient model deployment.
What Is This?
Overview
Gptq provides tools for quantizing large language models using the GPTQ algorithm which achieves high compression ratios with minimal accuracy loss. It covers calibration data preparation that selects representative text samples for guiding the quantization process, quantization configuration that sets bit width, group size, and layer-wise quantization parameters, accuracy evaluation that measures perplexity and task performance before and after quantization to verify quality, model export that saves quantized weights in formats compatible with inference frameworks like vLLM and text-generation-inference, and inference integration that loads quantized models with proper dequantization kernels for accelerated generation. The skill enables practitioners to reduce model memory footprint while preserving output quality.
Who Should Use This
This skill serves ML engineers deploying large models on memory-constrained GPUs, model optimization researchers comparing quantization techniques, and inference platform teams reducing serving costs through model compression.
Why Use It?
Problems It Solves
Large language models require multiple GPUs for inference at full precision which increases serving cost substantially. Naive round-to-nearest quantization degrades model quality significantly at low bit widths. Different inference frameworks expect quantized models in specific formats requiring careful export configuration. Selecting appropriate calibration data and quantization parameters requires systematic experimentation to find optimal quality-size tradeoffs.
Core Highlights
Calibrator prepares representative text datasets for guiding weight quantization decisions. Quantizer applies GPTQ algorithm with configurable group size and bit width per layer. Evaluator measures perplexity on held-out data to verify quantization quality. Exporter saves quantized models in formats compatible with popular inference engines.
How to Use It?
Basic Usage
from auto_gptq import (
AutoGPTQForCausalLM,
BaseQuantizeConfig)
from transformers import (
AutoTokenizer)
class GPTQQuantizer:
def __init__(
self,
model_name: str,
bits: int = 4,
group_size: int = 128
):
self.tokenizer = (
AutoTokenizer
.from_pretrained(
model_name))
self.config = (
BaseQuantizeConfig(
bits=bits,
group_size=(
group_size),
desc_act=False))
self.model = (
AutoGPTQForCausalLM
.from_pretrained(
model_name,
self.config))
def quantize(
self,
calibration_data:
list[str]
):
examples = [
self.tokenizer(
text,
return_tensors=(
'pt'))
for text
in calibration_data]
self.model.quantize(
examples)
def save(
self,
output_dir: str
):
self.model\
.save_quantized(
output_dir)
self.tokenizer\
.save_pretrained(
output_dir)Real-World Examples
from datasets import (
load_dataset)
class CalibrationBuilder:
def __init__(
self,
tokenizer,
max_length: int
= 2048,
n_samples: int
= 128
):
self.tokenizer = (
tokenizer)
self.max_length = (
max_length)
self.n_samples = (
n_samples)
def from_wikitext(
self
) -> list[str]:
dataset = load_dataset(
'wikitext',
'wikitext-2-raw-v1',
split='train')
texts = [
row['text']
for row in dataset
if len(
row['text']
) > 100]
return texts[
:self.n_samples]
def from_custom(
self,
texts: list[str]
) -> list[str]:
filtered = []
for text in texts:
tokens = (
self.tokenizer(
text))
if len(tokens[
'input_ids']
) <= self\
.max_length:
filtered.append(
text)
return filtered[
:self.n_samples]Advanced Tips
Use calibration data that is representative of the target inference domain since calibration on generic text may produce suboptimal results for specialized applications like code or medical text. Enable activation order descending for improved quality at the cost of slower quantization time. Compare perplexity across group sizes to find the setting that balances model size against accuracy retention.
When to Use It?
Use Cases
Quantize a large language model to 4-bit weights for deployment on a single consumer GPU. Prepare calibration datasets from domain-specific text for targeted quantization quality. Export a GPTQ model in safetensors format for serving with vLLM or text-generation-inference.
Related Topics
Model quantization, GPTQ algorithm, weight compression, inference optimization, model deployment, calibration data, and GPU memory optimization.
Important Notes
Requirements
AutoGPTQ library with CUDA support for GPU accelerated quantization. Sufficient GPU memory to load the full-precision model during quantization. Calibration dataset with representative text samples.
Usage Recommendations
Do: evaluate quantized model perplexity on held-out data before deployment to verify acceptable quality. Use at least 128 calibration samples for stable quantization results. Test quantized model outputs on representative prompts from the target use case.
Don't: use random or empty text for calibration since the data directly affects quantization quality decisions. Quantize below 4 bits without carefully evaluating output degradation on downstream tasks. Assume quantization quality transfers across model families since different architectures respond differently to compression.
Limitations
GPTQ quantization requires a full forward pass through the model with calibration data which needs substantial GPU memory. Quantized models require compatible inference kernels and not all frameworks support all GPTQ configurations. Quality degradation at very low bit widths may be unacceptable for tasks requiring precise reasoning or factual accuracy.
More Skills You Might Like
Explore similar skills to enhance your workflow
Atlassian Admin
Atlassian Administrator for managing and organizing Atlassian products (Jira, Confluence, Bitbucket, Trello), users, permissions, security, integratio
Demio Automation
Automate Demio operations through Composio's Demio toolkit via Rube MCP
Senior Ml Engineer
Senior ML Engineer automation and integration for advanced machine learning tasks
Echtpost Automation
Automate Echtpost operations through Composio's Echtpost toolkit via
Customer Persona
Automate and integrate customer persona development to better understand and target your audience
Bioservices
Bioservices automation and integration for seamless access to biological data resources