Vllm

vLLM automation, integration, and high-throughput language model serving workflows

Source: Orchestra-Research/AI-Research-SKILLs

vLLM is a community skill for high-throughput LLM inference serving, covering PagedAttention optimization, continuous batching, model serving configuration, quantized inference, and API server deployment for efficient large language model production hosting.

What Is This?

Overview

vLLM provides guidance on deploying large language models for production inference using the vLLM serving engine. It covers PagedAttention optimization that manages GPU memory for key-value caches using paged allocation to maximize the number of concurrent requests, continuous batching that dynamically groups incoming requests to maintain high GPU utilization without waiting for batch completion, model serving configuration that tunes tensor parallel degree, max model length, and memory utilization for optimal throughput, quantized inference that loads models in reduced precision formats like AWQ and GPTQ for serving larger models on limited GPU memory, and OpenAI-compatible API server deployment that exposes models through standard chat and completion endpoints for drop-in replacement of existing API integrations. The skill helps engineering teams serve LLMs in production with maximum throughput, minimal latency, and straightforward deployment configuration.

Who Should Use This

This skill serves ML engineers deploying language models for production inference, platform teams building LLM serving infrastructure, and developers integrating self-hosted models into existing applications that currently rely on external API providers.

Why Use It?

Problems It Solves

Naive LLM inference significantly wastes GPU memory with static KV cache allocation reducing concurrent request capacity. Sequential request processing leaves the GPU idle between individual completions wasting expensive compute resources. Large models exceed available single GPU memory requiring multi-GPU tensor parallelism configuration. Standard inference frameworks do not provide OpenAI-compatible API endpoints for easy drop-in integration with existing application code and client libraries, forcing teams to write custom adapter layers.

Core Highlights

PagedAttention manages KV cache memory efficiently with paged allocation. Continuous batcher groups incoming requests dynamically for high GPU utilization. Tensor parallel engine splits large models across multiple GPUs. API server exposes standard OpenAI-compatible endpoints.

How to Use It?

Basic Usage

python -m vllm.entrypoints\
  .openai.api_server \
  --model meta-llama/\
Meta-Llama-3-8B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 4096 \
  --gpu-memory-utilization\
 0.9 \
  --port 8000

curl http://localhost:8000\
/v1/chat/completions \
  -H 'Content-Type:\
 application/json' \
  -d '{
    "model": "meta-llama/\
Meta-Llama-3-8B-Instruct",
    "messages": [{
      "role": "user",
      "content":\
 "Hello world"
    }],
    "max_tokens": 256
  }'

Real-World Examples

from openai import OpenAI

client = OpenAI(
    base_url=(
        'http://localhost:8000'
        '/v1'),
    api_key='not-needed')

response = (
    client.chat.completions
    .create(
        model=(
            'meta-llama/'
            'Meta-Llama-3'
            '-8B-Instruct'),
        messages=[{
            'role': 'user',
            'content':
                'Explain RLHF'
        }],
        max_tokens=512,
        temperature=0.7))

print(
    response.choices[0]
    .message.content)

Advanced Tips

Set gpu-memory-utilization to 0.9 to maximize KV cache capacity while leaving headroom for model weights. Use tensor-parallel-size matching the number of available GPUs for large models that exceed single GPU memory. Enable quantized model loading with --quantization awq for serving larger models on fewer GPUs. When serving multiple concurrent users, increasing max-num-seqs controls the maximum number of sequences processed in a single iteration, which directly affects throughput under heavy load.

When to Use It?

Use Cases

Deploy a self-hosted Llama model as an OpenAI-compatible API for internal applications. Serve a quantized 70B model across multiple GPUs with tensor parallelism. Build a high-throughput inference backend for a chatbot handling concurrent users.

Important Notes

Requirements

NVIDIA GPU with sufficient VRAM for the target model weights and KV cache allocation during inference. Python with vllm package installed and compatible CUDA toolkit version for GPU kernel execution. Model weights downloaded from Hugging Face Hub or stored locally in the supported format. For multi-GPU deployments, all GPUs should be connected via NVLink or high-bandwidth interconnect to minimize tensor parallelism communication overhead.

Usage Recommendations

Do: benchmark throughput with representative workloads to find optimal batch sizes and memory utilization settings. Use tensor parallelism for models that do not fit on a single GPU rather than reducing precision unnecessarily. Monitor GPU memory usage during peak load to prevent out-of-memory errors.

Don't: set gpu-memory-utilization to 1.0 since this leaves no headroom for temporary allocations. Use more tensor parallel GPUs than necessary since communication overhead reduces per-request latency benefits. Expose the API server directly to the internet without authentication and rate limiting.

Limitations

vLLM supports specific transformer model architectures and may not work with all Hugging Face models without community contributions. First request latency is higher due to model loading and CUDA kernel compilation warmup. Multi-node serving across separate machines requires additional networking configuration beyond single-node tensor parallelism.

More Skills You Might Like

Explore similar skills to enhance your workflow