Performance / Bottleneck Analysis

GPU Performance Analysis

Use this page to understand whether a workload is limited by memory bandwidth or compute throughput. The roofline model is useful because it turns low-level measurements into a decision you can act on: optimize memory movement, optimize arithmetic intensity, or accept that the kernel is already near a practical limit.

Memory-bound clue

If arithmetic intensity is low and bandwidth is saturated early, your biggest wins usually come from data movement, coalescing, caching, or tiling.

Compute-bound clue

If arithmetic intensity is high and the roof is compute-limited, look at tensor core usage, instruction mix, occupancy, and whether your kernel is already close to hardware limits.

Related workflows

Pair this page with the occupancy estimator and warp divergence visualizer to connect bottleneck analysis to kernel behavior.