Deployment
Precision Strategy
Understand deployment tradeoffs across FP32, TF32, BF16, FP16, INT8, INT4, GPTQ, AWQ, and GGUF quantizations. Precision decides the balance between memory, speed, cost, and output quality.
What You Will Learn
- - What each precision means: FP32, TF32, BF16, FP16, FP8, INT8, INT4, GPTQ, AWQ, and GGUF variants.
- - How precision choices map to training, production inference, local laptop deployment, and cloud serving.
- - How to estimate whether a model fits your RAM/VRAM before running it.
- - How to choose a starting precision based on hardware limits and quality priorities.
Precision Families Overview
Precision decides the balance between memory, speed, cost, and output quality.
Integer / Quantized
Mixed / Advanced
Runtime / GGUF Quantization
Interactive Precision Comparison
| Precision | Type | Bits | Memory | Best Use |
|---|---|---|---|---|
FP32 Heavy | Floating point | 32 | 33.61 GB | Numerical debugging |
TF32 (NVIDIA) NVIDIA-specific | TensorFloat mode on tensor cores | 19 | 33.61 GB | Training acceleration with minimal code changes |
BF16 Preferred for training | Floating point | 16 | 16.81 GB | Large-scale training |
Mixed Precision (AMP) Default for training stacks | Mixed precision policy (FP32 + FP16/BF16) | 16 | 16.81 GB | Production training |
FP16 GPU-oriented | Floating point | 16 | 16.81 GB | GPU inference default |
GGUF Q8_0 Quality-leaning | GGUF quantization | 8 | 8.41 GB | CPU inference when quality is priority and RAM is available |
FP8 Specialized | Floating point (E4M3/E5M2 variants) | 8 | 8.41 GB | High-throughput datacenter inference |
GGUF Q6_K Quality-leaning | GGUF K-quant | 6.56 | 6.90 GB | Quality-oriented local inference |
INT8 Balanced | Integer quantized | 8 | 8.41 GB | Production inference |
AWQ (usually 4-bit) Method-sensitive | Activation-aware weight quantization method | 4 | 4.21 GB | Quality-sensitive 4-bit inference |
GGUF Q5_K_M Balanced | GGUF K-quant | 5.5 | 5.78 GB | Local inference with extra RAM budget for quality |
GPTQ (usually 4-bit) Method-sensitive | Post-training quantization method | 4 | 4.21 GB | Single-GPU serving of larger models |
INT4 / 4-bit Aggressive compression | Integer quantized | 4 | 4.21 GB | Local LLM deployment |
GGUF Q4_K_M Balanced | GGUF K-quant | 4.5 | 4.74 GB | Default local LLM deployment on laptops/desktops |
INT3 / 3-bit Experimental | Integer quantized | 3 | 3.16 GB | Extreme memory constraints |
GGUF Q3_K_M Aggressive compression | GGUF K-quant | 3.44 | 3.62 GB | Low-RAM local inference when Q4 does not fit |
INT2 / 2-bit Experimental | Integer quantized | 2 | 2.11 GB | Very constrained edge experiments |
GGUF Q2_K Aggressive compression | GGUF K-quant | 2.56 | 2.70 GB | Last-resort fit for very low RAM |
1-bit / Binary Research only | Binary / ternary-native methods | 1.58 | 1.66 GB | Research |
Memory Calculator + Runtime Profiles + KV Cache Precision
Formula base: 1.2x overhead, context factor 0.000002. Active profile: Directional default for mixed deployment stacks.
KV note: Highest KV fidelity; largest memory footprint.
Weights
14.00 GB
Runtime
16.80 GB
KV + Context
0.01 GB
Total
16.81 GB
Benchmarks + Confidence + Export
| Precision | Workload | Tokens/s | Latency | Confidence | Source |
|---|---|---|---|---|---|
| INT4 / GGUF Q4-class | Text generation (tg 128) Single CUDA GPU (llama.cpp sample) | 131.42 | Sample benchmark output | locally reproducible recipe | Link |
| INT4 / GGUF Q4-class | Text generation (tg 128) Single CUDA GPU (llama.cpp sample) | 82.17 | Sample benchmark output | locally reproducible recipe | Link |
| GGUF Q4_K_M | Text generation (tg128 @ context depth 512) Single CUDA GPU (llama.cpp sample) | 116.71 | Sample benchmark output | locally reproducible recipe | Link |
| FP8 vs FP16 | Mixtral 8x7B throughput vs latency 2x NVIDIA H100 SXM | 21,000 | Under 0.5s response limit (plot-based) | source-reported | Link |
| GPTQ (3-4 bit) | End-to-end inference speedup relative to FP16 A100 and A6000 (paper) | N/A | 3.25x (A100), 4.5x (A6000) speedup bands | source-reported | Link |
| AWQ 4-bit | On-device 4-bit acceleration Desktop and mobile GPUs (paper) | N/A | More than 3x over HF FP16 implementation | source-reported | Link |
| FP32/FP16/INT8/INT4 | Deployment sizing baseline Generic planning model | N/A | N/A | example baseline | Link |
Reproducible Benchmark Runner
llama.cpp benchmark (GGUF)
llama.cpp
llama-bench -m ./models/model.Q4_K_M.gguf -ngl 99 -p 512 -n 128 -r 5 -o json
Output schema: model, n_prompt_tokens, n_gen_tokens, tps, memory, backend, commit
vLLM throughput run
vLLM
python benchmarks/benchmark_throughput.py --model your/model --dtype float16 --output-json benchmark.json
Output schema: model, dtype, request_rate, throughput_tokens_s, latency_p50_ms, latency_p95_ms, gpu
TensorRT-LLM benchmark
TensorRT-LLM
trtllm-bench --engine_dir ./engine_fp8 --input_len 512 --output_len 128 --batch_size 16 --dump_json trt_bench.json
Output schema: engine, precision, batch_size, throughput_tokens_s, latency_ms, gpu_sku
Accuracy vs Memory Tradeoff
Deployment Compatibility Matrix
| Stack | Support Summary |
|---|---|
| PyTorch | fp32: nativebf16: nativefp16: nativetf32: nvidia-modefp8: partialint8: via-quantizationint4: via-quantizationgptq: pluginawq: plugingguf_q4_k_m: no-native |
| ONNX Runtime | fp32: nativebf16: nativefp16: nativeint8: nativeint4: nativegptq: conversion-pathawq: conversion-path |
| TensorRT-LLM | fp16: nativebf16: nativefp8: nativeint8: nativeint4: nativegptq: conversion-pathawq: conversion-path |
| vLLM | fp16: nativebf16: nativeint8: partialgptq: nativeawq: nativefp8: partial |
| llama.cpp | gguf_q2_k: nativegguf_q3_k_m: nativegguf_q4_k_m: nativegguf_q5_k_m: nativegguf_q6_k: nativegguf_q8_0: nativeint8: via-ggufint4: via-gguffp16: native |
| Ollama | gguf_q2_k: nativegguf_q3_k_m: nativegguf_q4_k_m: nativegguf_q5_k_m: nativegguf_q6_k: nativegguf_q8_0: nativefp16: model-dependent |
Cloud Cost Estimator (Cost per 1M tokens)
Effective tokens/sec
85.42
USD per 1M tokens
$3.577
USD per request
$0.0043
Real-World Scenario Mapping
Laptop CPU User
CPU-first, 16-32GB RAM
Recommended: GGUF Q4_K_M or Q5_K_M
Balances RAM fit and answer quality for local assistants and coding help.
8GB GPU User
Consumer GPU, limited VRAM
Recommended: INT4 / AWQ / GPTQ
Aggressive quantization makes 7B-13B class models feasible.
24GB GPU User
Prosumer GPU or workstation
Recommended: FP16 or INT8 depending quality target
Enough VRAM to pick quality-first or density-first serving.
Cloud Production Serving
Managed GPU fleet
Recommended: INT8 for density, FP16 for premium quality tiers
Tiered precision policy helps control cost while preserving SLA quality.
Precision Recommender + Copy-Ready Deployment Snippets
Recommended precision
INT8
INT8 often gives strong cost-quality balance in production inference pipelines.
Fit badge: fits 16GB
Quality warning: Kernel/runtime choice can materially affect realized speed and quality.
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-8B-Instruct --dtype float16 --max-model-len 8192
Prompt Regression Pack
Reasoning
Solve this multi-step logic puzzle and show each step clearly before final answer: ...
Pass criteria: No contradictions and final answer matches reference.
Coding
Given this failing unit test and function, produce minimal patch and explain root cause: ...
Pass criteria: Patch is minimal, tests pass, explanation identifies exact bug.
Long Context
Use only the supplied long document and answer with citations from sections.
Pass criteria: No fabricated citations; references map to provided sections.
Response Diff Viewer (Sample)
FP16 sample output
The deployment plan should use canary traffic, monitor p95 latency, and keep an INT8 rollback path with automatic threshold triggers.
INT4 sample output
Deploy canary first. Track latency and quality. If responses get unstable, revert to 8-bit or 16-bit quickly.
Sources and Methodology
Last updated: 2026-04-20
Confidence tags distinguish source-reported, example baseline, and locally reproducible recipe rows.
- official-docPyTorch AMP (mixed precision)
- official-docCloud TPU BF16 guide
- official-docONNX Runtime quantization
- official-docTransformers bitsandbytes quantization
- official-repollama.cpp tensor encoding schemes
- official-repollama.cpp llama-bench README
- vendor-benchmarkTensorRT-LLM Mixtral FP8 vs FP16 blog
- vendor-docNVIDIA TF32 explainer
- paperGPTQ paper
- paperAWQ paper
- paper1-bit LLM paper (BitNet b1.58)
Deployment Checklist
- - Run identical prompt suites across FP16/BF16, INT8, and 4-bit variants.
- - Measure memory, throughput, latency, and quality together for each precision.
- - Validate long-context and tool-calling behavior before rollout.
- - Label benchmark rows by provenance to avoid over-generalizing numbers.
- - Define rollback thresholds and keep higher-precision fallback ready.
FAQ
Should I always start with INT4 for deployment?
Not always. Start with INT8 or FP16/BF16 when quality risk is high, then downshift only if memory or cost constraints demand it.
Is BF16 better than FP16?
For training, BF16 is often preferred due to numerical range. For inference, FP16 and BF16 are both strong options depending on hardware and runtime support.
Are GGUF quantizations only for CPU?
No. GGUF is popular for CPU-first local inference, but can also run on compatible GPU backends through llama.cpp-based stacks.
How reliable are benchmark numbers across setups?
Throughput and latency vary significantly by hardware, runtime, and prompt shape. Use source tags and reproduce critical tests on your own environment.