Hardware / Architecture

Streaming Multiprocessor

The SM is the main architectural unit in modern GPUs. This page summarizes the key hardware blocks that matter most for AI and inference performance tuning.

Why hardware knowledge matters

AI teams often jump straight to model benchmarks, but hardware limits are what decide whether a model can actually run, how much it costs, and where latency problems show up.

What this page helps decide

Use this page to build intuition around the blocks that determine throughput, memory pressure, and deployment feasibility before moving into execution or performance tuning.

Best next step

After this overview, continue to Execution for warp behavior or Performance for roofline reasoning. If you already have a workload, jump to the VRAM calculator and GPU picker.

Streaming Multiprocessor

Core execution block where warps are scheduled and tensor pipelines execute.

Tensor Cores

Specialized matrix units for BF16/FP16/FP8 workloads and high-throughput AI kernels.

Memory Hierarchy

Registers, shared memory, L2, and HBM/GDDR bandwidth shape achievable performance.

SM Diagram Snapshot

Warp Schedulers

4x

Dispatch Units

8x

CUDA Cores

128x

Tensor Cores

4x

How to read GPU hardware pages

Start with blocks

Focus first on what each hardware block does: scheduling, matrix math, caches, and memory transport. That foundation makes later performance discussions much easier to understand.

Then connect to workload

The useful question is always: which of these blocks is likely limiting my model, kernel, or serving pattern? This page is meant to build that habit.