Skip to content

Online Quantization

Online quantization lets you take a BF16/FP16 model and quantize its Linear and MoE weights to lower precision (such as FP8) at load time, without needing a pre-quantized checkpoint or calibration data. Weights are converted during model loading and activations are dynamically scaled during each forward pass.

Quick Start

Pass a scheme name to the quantization parameter:

from vllm import LLM

# Per-tensor FP8 quantization (one scale per weight tensor)
llm = LLM("meta-llama/Llama-3.1-8B", quantization="fp8_per_tensor")

# Per-block FP8 quantization (128x128 block scaling for weights and 1x128 block scaling for activations)
llm = LLM("meta-llama/Llama-3.1-8B", quantization="fp8_per_block")

Or with the CLI:

vllm serve meta-llama/Llama-3.1-8B --quantization fp8_per_tensor
vllm serve meta-llama/Llama-3.1-8B --quantization fp8_per_block

Supported Schemes

Scheme Weight recipe Activation recipe Notes
fp8_per_tensor fp8_e4m3 data, fp32 per-tensor scale fp8_e4m3 data, fp32 per-tensor scale On some GPUs (Ada, Hopper) linear activations use per-token scaling for better performance
fp8_per_block fp8_e4m3 data, fp32 per-128x128-block scale fp8_e4m3 data, fp32 per-1x128-block scale

Support for additional schemes will be added in future versions of vllm.

Advanced Configuration

For fine-grained control, use a quantization_config dictionary.

Separate Schemes for Dense and MoE Layers

You can apply different quantization schemes to dense linear layers and MoE expert layers:

from vllm import LLM

llm = LLM(
    "ibm-granite/granite-3.0-1b-a400m-base",
    quantization="fp8_per_tensor",
    quantization_config={
        "linear_scheme_override": "fp8_per_block",
    },
)

Or,

from vllm import LLM

llm = LLM(
    "ibm-granite/granite-3.0-1b-a400m-base",
    quantization="fp8_per_tensor",
    quantization_config={
        "moe_scheme_override": "fp8_per_block",
    },
)

Excluding Layers from Quantization

Use the ignore parameter to skip specific layers. It accepts exact layer names and regex patterns (prefixed with re:):

from vllm import LLM

llm = LLM(
    "ibm-granite/granite-3.0-1b-a400m-base",
    quantization="fp8_per_tensor",
    quantization_config={
        "ignore": [
            # exact layer name
            "model.layers.1.self_attn.o_proj",
            # regex: skip all QKV projections
            "re:.*[qkv]_proj",
        ],
    },
)

Note

For fused layers (e.g., qkv_proj which fuses q_proj, k_proj, v_proj), the ignore pattern must match the unfused shard names (q_proj, k_proj, v_proj), not the fused name.