Why hardware knowledge matters
AI teams often jump straight to model benchmarks, but hardware limits are what decide whether a model can actually run, how much it costs, and where latency problems show up.
Hardware / Architecture
The SM is the main architectural unit in modern GPUs. This page summarizes the key hardware blocks that matter most for AI and inference performance tuning.
AI teams often jump straight to model benchmarks, but hardware limits are what decide whether a model can actually run, how much it costs, and where latency problems show up.
Use this page to build intuition around the blocks that determine throughput, memory pressure, and deployment feasibility before moving into execution or performance tuning.
After this overview, continue to Execution for warp behavior or Performance for roofline reasoning. If you already have a workload, jump to the VRAM calculator and GPU picker.
Core execution block where warps are scheduled and tensor pipelines execute.
Specialized matrix units for BF16/FP16/FP8 workloads and high-throughput AI kernels.
Registers, shared memory, L2, and HBM/GDDR bandwidth shape achievable performance.
Warp Schedulers
4x
Dispatch Units
8x
CUDA Cores
128x
Tensor Cores
4x
Related Navigation
Start with blocks
Focus first on what each hardware block does: scheduling, matrix math, caches, and memory transport. That foundation makes later performance discussions much easier to understand.
Then connect to workload
The useful question is always: which of these blocks is likely limiting my model, kernel, or serving pattern? This page is meant to build that habit.