PhilipTrettner's comments

PhilipTrettner · 2026-04-12T06:14:12 1775974452

that's a bit what the "repeated" scenario (roughly middle of the post) measures. It's not in work order but it is the same order every time, so caches work. And there you see that the working set size matters.

Note that the base setup has zero cache reuse because each run touches a completely different and cold part of memory. (that makes the result more of an upper bound on the needed chunk size)

PhilipTrettner · 2026-04-11T17:38:23 1775929103

It definitely worked on myself :)

Do have a look, I've tried to roughly keep it small and readable. It's ~250 LOC effectively.

Also, this is CPU only. I'm not super sure what a good GPU version of my benchmark would be, though ... Maybe measuring a "map" more than a "reduction" like I do on the CPU? We should probably take a look at common chunking patterns there.

PhilipTrettner · 2026-04-08T13:27:00 1775654820

I looked into this because part of our pipeline is forced to be chunked. Most advice I've seen boils down to "more contiguity = better", but without numbers, or at least not generalizable ones.

My concrete tasks will already reach peak performance before 128 kB and I couldn't find pure processing workloads that benefit significantly beyond 1 MB chunk size. Code is linked in the post, it would be nice to see results on more systems.

twoodfin · 2026-04-11T14:43:10 1775918590

Your results match similar analyses of database systems I’ve seen.

64KB-128KB seems like the sweet spot.

throwaway81523 · 2026-04-12T05:32:55 1775971975

Doesn't it depend what you're doing? xz data compression or some video codecs? Retrograde chess analysis (endgame tablebases)? Number Field Sieve factorization in the linear algebra phase?

PhilipTrettner · 2026-02-06T13:39:15 1770385155

It's actually not a typo. Our "real" internal code starts with integer bounds on the inputs (say 2^26) and then computes for each subexpression how many bits are actually needed to exactly represent that. That can even lead to fractional bits (like in "a + b + c"). The generated code then rounds up to the next 64 bit multiple.

PhilipTrettner · 2026-02-06T13:37:46 1770385066

See https://godbolt.org/z/bYb7a38dG

It's basically: long* and long long* (the pointer types) are not compatible, and uint64_t is the "wrong" typedef on linux, or at least inconsistent with the way the intrinsics are defined.

PhilipTrettner · 2026-02-06T13:36:50 1770385010

We use them for exact predicates in our mesh booleans library. To really handle every degenerate case we even have to go quite a bit higher than 128bit in 3D.

PhilipTrettner · 2026-02-06T13:34:06 1770384846

Sorry I'm a bit late to the party.

long and long long are convertible, that's not the issue. They are distinct types though, so long* and long long* are NOT implicitly convertible. And uint64_t is not consistently the correct type.

See: https://godbolt.org/z/bYb7a38dG

I'd prefer if the intrinsics use the same uint64_t but they don't.