Hardware / Architecture

Streaming Multiprocessor

The SM is the main architectural unit in modern GPUs. This page summarizes the key hardware blocks that matter most for AI and inference performance tuning.

Streaming Multiprocessor

Core execution block where warps are scheduled and tensor pipelines execute.

Tensor Cores

Specialized matrix units for BF16/FP16/FP8 workloads and high-throughput AI kernels.

Memory Hierarchy

Registers, shared memory, L2, and HBM/GDDR bandwidth shape achievable performance.

SM Diagram Snapshot

Warp Schedulers

4x

Dispatch Units

8x

CUDA Cores

128x

Tensor Cores

4x