Tensorrt Llm

TensorRT LLM automation, integration, and high-performance inference optimization workflows

TensorRT-LLM is a community skill for optimizing and deploying large language models using NVIDIA TensorRT, covering model compilation, quantization, inference serving, and performance profiling for production deployments.

What Is This?

Overview

TensorRT-LLM provides patterns for accelerating language model inference using NVIDIA TensorRT optimization. It covers model conversion from Hugging Face formats to TensorRT engines, quantization with INT8 and FP8 precision, KV-cache optimization, continuous batching for serving, and performance benchmarking. The skill enables teams to deploy language models with significantly lower latency and higher throughput compared to unoptimized inference.

Who Should Use This

This skill serves infrastructure engineers deploying LLMs to production with strict latency requirements, teams optimizing inference costs by maximizing GPU utilization, and developers building high-throughput model serving systems on NVIDIA hardware.

Why Use It?

Problems It Solves

Default model inference using PyTorch does not exploit GPU-specific optimizations available through TensorRT. High per-request latency makes interactive applications feel unresponsive to users. Low GPU utilization during inference wastes expensive hardware capacity. Serving many concurrent users requires batching strategies that naive implementations lack.

Core Highlights

Graph-level optimization fuses operations and selects optimal GPU kernels for each layer. Quantization reduces model precision to INT8 or FP8 for faster computation with minimal quality loss. KV-cache management handles paged attention for efficient memory allocation during generation. Continuous batching interleaves requests to maximize GPU utilization across concurrent users.

How to Use It?

Basic Usage

from dataclasses import dataclass, field
from pathlib import Path

@dataclass
class TRTConfig:
    model_dir: str
    output_dir: str
    dtype: str = "float16"
    max_batch_size: int = 8
    max_input_len: int = 2048
    max_output_len: int = 512
    quantization: str = "none"
    tensor_parallelism: int = 1

class TRTLLMBuilder:
    def __init__(self, config: TRTConfig):
        self.config = config

    def build_command(self) -> list[str]:
        cmd = [
            "trtllm-build",
            "--checkpoint_dir", self.config.model_dir,
            "--output_dir", self.config.output_dir,
            "--max_batch_size", str(self.config.max_batch_size),
            "--max_input_len", str(self.config.max_input_len),
            "--max_output_len", str(self.config.max_output_len),
        ]
        if self.config.quantization != "none":
            cmd.extend(["--quantization", self.config.quantization])
        if self.config.tensor_parallelism > 1:
            cmd.extend(["--tp_size",
                        str(self.config.tensor_parallelism)])
        return cmd

    def validate_output(self) -> dict:
        output_path = Path(self.config.output_dir)
        engine_files = list(output_path.glob("*.engine"))
        return {"valid": len(engine_files) > 0,
                "engines": [f.name for f in engine_files]}

Real-World Examples

from dataclasses import dataclass, field

@dataclass
class BenchmarkResult:
    tokens_per_second: float
    time_to_first_token: float
    latency_p50: float
    latency_p99: float
    gpu_utilization: float

class InferenceProfiler:
    def __init__(self, engine_dir: str):
        self.engine_dir = engine_dir
        self.results: list[BenchmarkResult] = []

    def run_benchmark(self, prompts: list[str],
                      batch_size: int) -> BenchmarkResult:
        num_batches = len(prompts) // max(batch_size, 1)
        simulated_tps = 150.0 * batch_size / 8
        result = BenchmarkResult(
            tokens_per_second=round(simulated_tps, 1),
            time_to_first_token=0.05,
            latency_p50=round(100 / simulated_tps * 1000, 1),
            latency_p99=round(100 / simulated_tps * 1500, 1),
            gpu_utilization=min(0.95, batch_size * 0.12)
        )
        self.results.append(result)
        return result

    def compare_configs(self) -> list[dict]:
        return [{"tps": r.tokens_per_second,
                 "ttft": r.time_to_first_token,
                 "p99": r.latency_p99}
                for r in self.results]

Advanced Tips

Use tensor parallelism across multiple GPUs for models that do not fit in single GPU memory. Profile time-to-first-token separately from generation throughput, as they have different optimization strategies. Test quantized engines against the original model on representative prompts to verify quality retention.

When to Use It?

Use Cases

Deploy a production chatbot backend that handles hundreds of concurrent users with low latency. Optimize an LLM serving endpoint to reduce GPU costs by increasing throughput per device. Build a real-time code completion service that requires sub-second response times.

Related Topics

NVIDIA TensorRT optimization, model quantization, inference serving with Triton, GPU kernel optimization, and LLM deployment architectures.

Important Notes

Requirements

NVIDIA GPUs with TensorRT-LLM compatible drivers and CUDA toolkit installed. Model checkpoints in a supported format for conversion. Sufficient GPU memory for the target model size and batch configuration.

Usage Recommendations

Do: benchmark latency and throughput at the expected production batch sizes before deployment. Use FP8 quantization on supported hardware for the best performance-quality tradeoff. Configure the maximum input and output lengths to match actual usage patterns rather than theoretical maximums.

Don't: build engines with maximum batch sizes much larger than production needs, as this wastes GPU memory on buffer allocation. Skip quality validation after quantization assuming precision reduction is always acceptable. Deploy without profiling memory usage under peak concurrent load.

Limitations

TensorRT-LLM requires NVIDIA GPUs and does not support AMD or other GPU vendors. Engine compilation is specific to the GPU architecture, requiring rebuilds when hardware changes. New model architectures may not be supported immediately after release.