Performance / Bottleneck Analysis

GPU Performance Analysis

Use this page to understand whether a workload is limited by memory bandwidth or compute throughput. The roofline model is useful because it turns low-level measurements into a decision you can act on: optimize memory movement, optimize arithmetic intensity, or accept that the kernel is already near a practical limit.

How to use this page

  1. 1. Start with the workload or kernel you actually care about, not a synthetic example.
  2. 2. Use the roofline output to identify whether memory or compute is the first bottleneck.
  3. 3. Only then decide whether to change kernel structure, precision, memory access, or hardware.
  4. 4. Re-run after each change so performance work stays evidence-based.

Why this matters

Many optimization attempts fail because teams start changing the wrong thing. If a kernel is memory bound, more math optimization may not help. If it is compute bound, cache tweaks may have limited impact.

After using this page, go to execution for warp behavior or back to the GPU hub for the wider toolchain.

Memory-bound clue

If arithmetic intensity is low and bandwidth is saturated early, your biggest wins usually come from data movement, coalescing, caching, or tiling.

Compute-bound clue

If arithmetic intensity is high and the roof is compute-limited, look at tensor core usage, instruction mix, occupancy, and whether your kernel is already close to hardware limits.