Back to GPU Hub
01 / Learning Path
Physical Hardware
GPU chip structure, SM internals, and compute building blocks.
Main Sections
Sub Topics
Topic 01
01
Streaming Multiprocessor (SM)
What it is
A Streaming Multiprocessor (SM) is the main execution unit inside an NVIDIA GPU. It is the hardware component that actually runs your CUDA programs and executes instructions. If a GPU is a factory, the SM is one complete department — with its own workers, tools, storage, and manager.
The analogy: CPU is to Core as GPU is to Streaming Multiprocessor. Each SM executes thousands of GPU threads in parallel.
What an SM is responsible for
- Executing instructions (SASS machine code)
- Managing threads — assigning them to execution units
- Scheduling work — picking which warp runs each cycle
- Storing thread state in the register file
- Coordinating memory access — shared memory, L1 cache
Internal components of one SM
Compute units
- CUDA cores → execute scalar arithmetic (add, multiply)
- Tensor cores → execute matrix multiply-accumulate (MMA)
- Special Function Units (SFU) → sin, cos, sqrt, exp
Memory units
- Register file → fastest storage, private per thread
- Shared memory / L1 cache → fast, shared by block
- Load/Store Units (LSU) → move data to/from memory
Scheduling units
- Warp schedulers → decide which warp executes each clock cycle
SM specs across GPU generations
| GPU | Architecture | SMs | CUDA cores/SM | Tensor cores/SM | Warps/SM | Shared mem/SM |
|---|---|---|---|---|---|---|
| H100 SXM | Hopper | 132 | 128 | 4 | 64 | 256 KB |
| A100 | Ampere | 108 | 128 | 4 | 64 | 192 KB |
| RTX 4090 | Ada | 128 | 128 | 4 | 48 | 128 KB |
| RTX 3090 | Ampere | 82 | 128 | 4 | 48 | 128 KB |
SM vs CPU core — key differences
| Feature | CPU Core | GPU SM |
|---|---|---|
| Main purpose | Fast sequential execution | Massive parallel execution |
| Threads per unit | 1–2 | 2048 concurrent threads |
| Context switch | ~1000 cycles (slow) | 1 clock cycle |
| Cache size | Very large (MB) | Smaller (256 KB) |
| Control logic | Complex (branch predict) | Simpler (no speculation) |
Parallel execution numbers (H100)
SMs in H100
132
Warp schedulers per SM
4
Threads per warp
32
Parallel threads per SM per cycle
128 (4×32)
Total parallel threads (H100)
16,896
Total concurrent threads (H100)
270,336
Factory analogy
- GPU = factory
- SM = department inside the factory
- GPU cores = workers who do actual work
- Warp scheduler = manager who assigns tasks