Back to GPU Hub

01 / Learning Path

Physical Hardware

GPU chip structure, SM internals, and compute building blocks.

Main Sections

Sub Topics

Topic 01

01

Streaming Multiprocessor (SM)

What it is

A Streaming Multiprocessor (SM) is the main execution unit inside an NVIDIA GPU. It is the hardware component that actually runs your CUDA programs and executes instructions. If a GPU is a factory, the SM is one complete department — with its own workers, tools, storage, and manager.

The analogy: CPU is to Core as GPU is to Streaming Multiprocessor. Each SM executes thousands of GPU threads in parallel.

What an SM is responsible for

  • Executing instructions (SASS machine code)
  • Managing threads — assigning them to execution units
  • Scheduling work — picking which warp runs each cycle
  • Storing thread state in the register file
  • Coordinating memory access — shared memory, L1 cache

Internal components of one SM

Compute units

  • CUDA cores → execute scalar arithmetic (add, multiply)
  • Tensor cores → execute matrix multiply-accumulate (MMA)
  • Special Function Units (SFU) → sin, cos, sqrt, exp

Memory units

  • Register file → fastest storage, private per thread
  • Shared memory / L1 cache → fast, shared by block
  • Load/Store Units (LSU) → move data to/from memory

Scheduling units

  • Warp schedulers → decide which warp executes each clock cycle

SM specs across GPU generations

GPUArchitectureSMsCUDA cores/SMTensor cores/SMWarps/SMShared mem/SM
H100 SXMHopper132128464256 KB
A100Ampere108128464192 KB
RTX 4090Ada128128448128 KB
RTX 3090Ampere82128448128 KB

SM vs CPU core — key differences

FeatureCPU CoreGPU SM
Main purposeFast sequential executionMassive parallel execution
Threads per unit1–22048 concurrent threads
Context switch~1000 cycles (slow)1 clock cycle
Cache sizeVery large (MB)Smaller (256 KB)
Control logicComplex (branch predict)Simpler (no speculation)

Parallel execution numbers (H100)

SMs in H100

132

Warp schedulers per SM

4

Threads per warp

32

Parallel threads per SM per cycle

128 (4×32)

Total parallel threads (H100)

16,896

Total concurrent threads (H100)

270,336

Factory analogy

  • GPU = factory
  • SM = department inside the factory
  • GPU cores = workers who do actual work
  • Warp scheduler = manager who assigns tasks