Deployment

Precision Strategy

Understand deployment tradeoffs across FP32, TF32, BF16, FP16, INT8, INT4, GPTQ, AWQ, and GGUF quantizations. Precision decides the balance between memory, speed, cost, and output quality.

What You Will Learn

  • - What each precision means: FP32, TF32, BF16, FP16, FP8, INT8, INT4, GPTQ, AWQ, and GGUF variants.
  • - How precision choices map to training, production inference, local laptop deployment, and cloud serving.
  • - How to estimate whether a model fits your RAM/VRAM before running it.
  • - How to choose a starting precision based on hardware limits and quality priorities.

Precision Families Overview

Precision decides the balance between memory, speed, cost, and output quality.

Floating Point

FP32TF32 (NVIDIA)BF16FP16FP8

Integer / Quantized

INT8INT4 / 4-bitINT3 / 3-bitINT2 / 2-bit1-bit / Binary

Mixed / Advanced

Mixed Precision (AMP)GPTQ (usually 4-bit)AWQ (usually 4-bit)

Runtime / GGUF Quantization

GGUF Q2_KGGUF Q3_K_MGGUF Q4_K_MGGUF Q5_K_MGGUF Q6_KGGUF Q8_0

Interactive Precision Comparison

PrecisionTypeBitsMemoryBest Use

FP32

Heavy
Floating point3233.61 GBNumerical debugging

TF32 (NVIDIA)

NVIDIA-specific
TensorFloat mode on tensor cores1933.61 GBTraining acceleration with minimal code changes

BF16

Preferred for training
Floating point1616.81 GBLarge-scale training

Mixed Precision (AMP)

Default for training stacks
Mixed precision policy (FP32 + FP16/BF16)1616.81 GBProduction training

FP16

GPU-oriented
Floating point1616.81 GBGPU inference default

GGUF Q8_0

Quality-leaning
GGUF quantization88.41 GBCPU inference when quality is priority and RAM is available

FP8

Specialized
Floating point (E4M3/E5M2 variants)88.41 GBHigh-throughput datacenter inference

GGUF Q6_K

Quality-leaning
GGUF K-quant6.566.90 GBQuality-oriented local inference

INT8

Balanced
Integer quantized88.41 GBProduction inference

AWQ (usually 4-bit)

Method-sensitive
Activation-aware weight quantization method44.21 GBQuality-sensitive 4-bit inference

GGUF Q5_K_M

Balanced
GGUF K-quant5.55.78 GBLocal inference with extra RAM budget for quality

GPTQ (usually 4-bit)

Method-sensitive
Post-training quantization method44.21 GBSingle-GPU serving of larger models

INT4 / 4-bit

Aggressive compression
Integer quantized44.21 GBLocal LLM deployment

GGUF Q4_K_M

Balanced
GGUF K-quant4.54.74 GBDefault local LLM deployment on laptops/desktops

INT3 / 3-bit

Experimental
Integer quantized33.16 GBExtreme memory constraints

GGUF Q3_K_M

Aggressive compression
GGUF K-quant3.443.62 GBLow-RAM local inference when Q4 does not fit

INT2 / 2-bit

Experimental
Integer quantized22.11 GBVery constrained edge experiments

GGUF Q2_K

Aggressive compression
GGUF K-quant2.562.70 GBLast-resort fit for very low RAM

1-bit / Binary

Research only
Binary / ternary-native methods1.581.66 GBResearch

Memory Calculator + Runtime Profiles + KV Cache Precision

Formula base: 1.2x overhead, context factor 0.000002. Active profile: Directional default for mixed deployment stacks.

KV note: Highest KV fidelity; largest memory footprint.

Weights

14.00 GB

Runtime

16.80 GB

KV + Context

0.01 GB

Total

16.81 GB

Benchmarks + Confidence + Export

PrecisionWorkloadTokens/sLatencyConfidenceSource
INT4 / GGUF Q4-class

Text generation (tg 128)

Single CUDA GPU (llama.cpp sample)

131.42Sample benchmark outputlocally reproducible recipeLink
INT4 / GGUF Q4-class

Text generation (tg 128)

Single CUDA GPU (llama.cpp sample)

82.17Sample benchmark outputlocally reproducible recipeLink
GGUF Q4_K_M

Text generation (tg128 @ context depth 512)

Single CUDA GPU (llama.cpp sample)

116.71Sample benchmark outputlocally reproducible recipeLink
FP8 vs FP16

Mixtral 8x7B throughput vs latency

2x NVIDIA H100 SXM

21,000Under 0.5s response limit (plot-based)source-reportedLink
GPTQ (3-4 bit)

End-to-end inference speedup relative to FP16

A100 and A6000 (paper)

N/A3.25x (A100), 4.5x (A6000) speedup bandssource-reportedLink
AWQ 4-bit

On-device 4-bit acceleration

Desktop and mobile GPUs (paper)

N/AMore than 3x over HF FP16 implementationsource-reportedLink
FP32/FP16/INT8/INT4

Deployment sizing baseline

Generic planning model

N/AN/Aexample baselineLink

Reproducible Benchmark Runner

llama.cpp benchmark (GGUF)

llama.cpp

llama-bench -m ./models/model.Q4_K_M.gguf -ngl 99 -p 512 -n 128 -r 5 -o json

Output schema: model, n_prompt_tokens, n_gen_tokens, tps, memory, backend, commit

vLLM throughput run

vLLM

python benchmarks/benchmark_throughput.py --model your/model --dtype float16 --output-json benchmark.json

Output schema: model, dtype, request_rate, throughput_tokens_s, latency_p50_ms, latency_p95_ms, gpu

TensorRT-LLM benchmark

TensorRT-LLM

trtllm-bench --engine_dir ./engine_fp8 --input_len 512 --output_len 128 --batch_size 16 --dump_json trt_bench.json

Output schema: engine, precision, batch_size, throughput_tokens_s, latency_ms, gpu_sku

Accuracy vs Memory Tradeoff

Preparing chart...

Deployment Compatibility Matrix

StackSupport Summary
PyTorch
fp32: nativebf16: nativefp16: nativetf32: nvidia-modefp8: partialint8: via-quantizationint4: via-quantizationgptq: pluginawq: plugingguf_q4_k_m: no-native
ONNX Runtime
fp32: nativebf16: nativefp16: nativeint8: nativeint4: nativegptq: conversion-pathawq: conversion-path
TensorRT-LLM
fp16: nativebf16: nativefp8: nativeint8: nativeint4: nativegptq: conversion-pathawq: conversion-path
vLLM
fp16: nativebf16: nativeint8: partialgptq: nativeawq: nativefp8: partial
llama.cpp
gguf_q2_k: nativegguf_q3_k_m: nativegguf_q4_k_m: nativegguf_q5_k_m: nativegguf_q6_k: nativegguf_q8_0: nativeint8: via-ggufint4: via-gguffp16: native
Ollama
gguf_q2_k: nativegguf_q3_k_m: nativegguf_q4_k_m: nativegguf_q5_k_m: nativegguf_q6_k: nativegguf_q8_0: nativefp16: model-dependent

Cloud Cost Estimator (Cost per 1M tokens)

Effective tokens/sec

85.42

USD per 1M tokens

$3.577

USD per request

$0.0043

Real-World Scenario Mapping

Laptop CPU User

CPU-first, 16-32GB RAM

Recommended: GGUF Q4_K_M or Q5_K_M

Balances RAM fit and answer quality for local assistants and coding help.

8GB GPU User

Consumer GPU, limited VRAM

Recommended: INT4 / AWQ / GPTQ

Aggressive quantization makes 7B-13B class models feasible.

24GB GPU User

Prosumer GPU or workstation

Recommended: FP16 or INT8 depending quality target

Enough VRAM to pick quality-first or density-first serving.

Cloud Production Serving

Managed GPU fleet

Recommended: INT8 for density, FP16 for premium quality tiers

Tiered precision policy helps control cost while preserving SLA quality.

Precision Recommender + Copy-Ready Deployment Snippets

Recommended precision

INT8

INT8 often gives strong cost-quality balance in production inference pipelines.

Fit badge: fits 16GB

Quality warning: Kernel/runtime choice can materially affect realized speed and quality.

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-8B-Instruct --dtype float16 --max-model-len 8192

Prompt Regression Pack

Reasoning

Solve this multi-step logic puzzle and show each step clearly before final answer: ...

Pass criteria: No contradictions and final answer matches reference.

Coding

Given this failing unit test and function, produce minimal patch and explain root cause: ...

Pass criteria: Patch is minimal, tests pass, explanation identifies exact bug.

Long Context

Use only the supplied long document and answer with citations from sections.

Pass criteria: No fabricated citations; references map to provided sections.

Response Diff Viewer (Sample)

FP16 sample output

The deployment plan should use canary traffic, monitor p95 latency, and keep an INT8 rollback path with automatic threshold triggers.

INT4 sample output

Deploy canary first. Track latency and quality. If responses get unstable, revert to 8-bit or 16-bit quickly.

Sources and Methodology

Last updated: 2026-04-20

Confidence tags distinguish source-reported, example baseline, and locally reproducible recipe rows.

Deployment Checklist

  • - Run identical prompt suites across FP16/BF16, INT8, and 4-bit variants.
  • - Measure memory, throughput, latency, and quality together for each precision.
  • - Validate long-context and tool-calling behavior before rollout.
  • - Label benchmark rows by provenance to avoid over-generalizing numbers.
  • - Define rollback thresholds and keep higher-precision fallback ready.

FAQ

Should I always start with INT4 for deployment?

Not always. Start with INT8 or FP16/BF16 when quality risk is high, then downshift only if memory or cost constraints demand it.

Is BF16 better than FP16?

For training, BF16 is often preferred due to numerical range. For inference, FP16 and BF16 are both strong options depending on hardware and runtime support.

Are GGUF quantizations only for CPU?

No. GGUF is popular for CPU-first local inference, but can also run on compatible GPU backends through llama.cpp-based stacks.

How reliable are benchmark numbers across setups?

Throughput and latency vary significantly by hardware, runtime, and prompt shape. Use source tags and reproduce critical tests on your own environment.