The GPU Bottleneck Isn't Compute Anymore

🎧 LISTEN TO THIS ARTICLE

NVIDIA's Blackwell GPUs doubled tensor core throughput. FlashAttention-4 shows why that created a new problem — and how to work around it.

The Asymmetry Problem

Blackwell B200 GPUs deliver 2.25 petaFLOPS of tensor core throughput for BF16, doubling the Hopper H100's 1 petaFLOPS. On paper, that should translate to proportionally faster attention computation. In practice, it does not.

The reason is what the FlashAttention-4 authors — Ted Zadouri and Tri Dao (Princeton, Together AI), Markus Hoehnerbach and Vijay Thakkar (Meta), Jay Shah (Colfax Research), and Timmy Liu (NVIDIA) — call asymmetric hardware scaling. Tensor cores got dramatically faster, but the other units that attention depends on did not. Shared memory bandwidth stays at 128 bytes per clock per streaming multiprocessor. The exponential unit (MUFU), which computes softmax, remains at 16 operations per clock per SM. Both are unchanged from Hopper.

The result: on Blackwell, shared memory traffic and exponential operations now dominate execution time, exceeding matrix-multiply compute by 25 to 60 percent. The bottleneck has shifted from arithmetic to plumbing.

What FlashAttention-4 Does About It

Tensor cores doubled in speed. Shared memory didn't. That gap is now the real bottleneck in attention computation.

The preprint introduces four interlocking techniques to rebalance this asymmetry.

First, software-emulated exponentials. Instead of waiting for the slow MUFU hardware unit, FA-4 approximates the exponential function using a degree-3 polynomial on the faster FMA (fused multiply-add) units. The maximum relative error is 8.8 x 10^-5 in FP32, and after rounding to BF16, results match the hardware unit to within 1 ULP on 99 percent of inputs. The math is effectively identical, but the compute path is faster.

Second, conditional softmax rescaling. Standard FlashAttention rescales partial softmax results at every tile boundary. FA-4 skips this rescaling when the running maximum shifts by less than a threshold (log2(256) = 8.0), relying on the final normalization pass to correct small deviations. This eliminates unnecessary shared memory round-trips.

Third, the backward pass uses a 2-CTA (cooperative thread array) mode that halves global atomic reductions and reduces the shared memory overhead from 30 percent above MMA compute to just 5 percent above. The roofline analysis shows this brings shared memory cycles from 3,328 down to 2,688 while keeping MMA compute at 2,560 — finally making the backward pass compute-bound rather than memory-bound.

Fourth, the entire implementation is written in CuTe-DSL, a Python-embedded domain-specific language, rather than the C++ templates used by FlashAttention-3. Compile times drop from 55 seconds to 2.5 seconds for the forward kernel and from 45 seconds to 1.4 seconds for the backward kernel — a 20 to 30x improvement that matters for rapid iteration.

The Numbers

Benchmarked on B200 GPUs across sequence lengths from 1K to 32K tokens with BF16 precision and head dimension 128, FlashAttention-4 achieves up to 1.3x speedup over cuDNN 9.13 and 2.7x over Triton. Peak throughput reaches 1,613 TFLOPs/s, or 71 percent hardware utilization.

Notably, FlashAttention-3 does not run on B200 at all. The Hopper-specific kernel design means it cannot simply be recompiled for Blackwell. This is not a minor version bump — it is a ground-up rearchitecture forced by the hardware's changed balance of resources.

The deterministic backward mode, useful for reproducible training, runs at roughly 75 percent the speed of its nondeterministic counterpart. That is a meaningful tax, but substantially better than prior deterministic attention implementations.

Why This Pattern Matters

When hardware units scale unevenly, the fastest path forward is sometimes to emulate the slow ones in software.

The asymmetric scaling problem is not unique to attention. As tensor cores continue to outpace everything else on the die, any operation that touches shared memory, performs non-matmul arithmetic, or requires synchronization will become the relative bottleneck. The techniques here — emulating slow hardware units in software, skipping conditional operations, reducing memory traffic through cooperative execution — are likely to recur across GPU kernel design.

For practitioners working on inference scaling or attention-heavy architectures, the takeaway is concrete: raw FLOPS numbers on spec sheets increasingly misrepresent real attention performance. The gap between peak tensor throughput and achievable attention throughput will keep widening unless kernel design adapts to each generation's specific imbalances.

FlashAttention-4 is that adaptation for Blackwell. The next GPU generation will need its own.