"Hell I can also go on the anthropic API right now and get verbatim static resul...

sva_ · 2025-07-02T20:57:44 1751489864

Shouldn't it be the fact that they're non-associative? Because the reduction kernels will combine partial results (like the dot‑products in a GEMM or the sum across attention heads) in a way that the order of operations may change (non-associative), which can lead to the individual floats to be round off differently.