> We selected the AMD EPYC 7642 processor in a single-socket configuration for G...

jandrewrogers · on Feb 24, 2020

At least in my experience, for highly optimized code (which I imagine their's is), above certain core densities you run out of memory bandwidth, PCIe lanes, etc before you run out of cores unless your workload is particularly compute intensive. Power consumption also needs to be modeled into the hardware cost, so higher TDP does matter. It is a cost optimization problem, and those 64-core parts may offer little marginal gains for the added cost if bottlenecks are elsewhere in the system.

My experience buying CPUs for data intensive servers has typically been that maximizing the performance-density-cost curve often recommends a mid-range number of cores at a lower-middle clock rate. These CPUs are inexpensive relative to their product families while still having enough horses to drive your memory, PCIe, etc to saturation with highly optimized code. Just enough resources, but no more.

Hardops · on Feb 24, 2020

hey, this is Rami ... I lead HW team at Cloudflare. Great Q! .. in our case, there was a sweet spot between the number of cores, the L3 cache per core, NUMA latencies, mem bw per core, cost & power. The 48c gave us the best req per second per $, and it likely was due to a combination of all these things.

We have two more blogs that will come out probably today, which will shed more light on why AMD worked better for us.

drewg123 · on Feb 24, 2020

Hi, this is Drew from Netflix.. Just curious, are you running single socket rome in numa or non-numa mode? I've seen better performance in NPS=4 mode myself. We seem to see about a 7% increase when moving from NPS=1 to NPS=2 and another 7% going to NPS=4.

Hardops · on Feb 24, 2020

Yes, we got the most perf when we used NPS=4.

Stay tuned for 3 more blogs this week ... they will give deeper analysis on the perf % gains we saw and why.

spamizbad · on Feb 24, 2020

If you look at how the chiplets are organized, you technically have 4 cores sharing a bank of L3 (2 of these 4 core groups per chiplet). In the 48 core model, 1 core from each 4 core group are disabled, so you have 3 cores sharing the same quantity of L3. So you now have 25% more L3 cache per core. You also have 25% more per-core memory and PCIE bandwidth.

If your workload is cache or memory bandwidth sensitive you might recover some performance despite having 25% fewer cores. You can probably run fewer cores at a higher sustained clockspeed. This may reduce a 25% deficit to something more modest like 5-10%, at which point the 64 core parts are harder to justify.

kg · on Feb 24, 2020

Not to mention that web workloads are frequently memory-bandwidth-sensitive. I remember Google published a paper where they measured CPU usage in production environments and at least one of their real-world applications spent like 30% of its time in memcpy/strcpy. (The paper examined ways to optimize those copies by carefully applying non-temporal hints in the event that the destination buffer wasn't going to be used for a while).

Given that, having more memory bandwidth per-core seems like it could easily improve CF's performance a lot.

JoeAltmaier · on Feb 24, 2020

33% more per core?

spamizbad · on Feb 24, 2020

Ah right, I'm bad at math.

derefr · on Feb 24, 2020

Does anyone know, is the 48-core SKU just a lower-binned version of the 64-core SKU, i.e. where one or more cores on each chiplet have flaws, and so AMD decides to just therefore build parts entirely out of e.g. “7 out of 8 cores enabled” chiplets to create the lower SKUs? (I know the chiplet-oriented manufacturing process makes it cheaper than ever before to get chips right the first time, but that doesn’t mean that they don’t have flawed chiplets sitting around that they could intentionally make use of.)

If so, maybe AMD doesn’t have high-enough yield on their 64-core part (i.e. 8-core chiplet sub-part) to satisfy huge bulk orders for them, without also generating huge numbers of the 48-core-binned SKU (i.e. 6-core chiplets, really 6-out-of-8-enabled-core chiplets) in the process.

And I would suspect that their production process is such that they do have a real, explicit 6-core chiplet part as well, which can be mixed-and-matched within a single CPU with the flawed, re-binned 6-of-8-core chiplets, giving them a powerful hedge on their own logistics (in about the same way that SPAM has flexibility in their ratio of chicken to ham that lets them ride out turbulence in either market, making the end-product cheaper than either input), but requiring even further that people consume the SKUs containing “6”-core chiplets.

I would bet that AMD very much wants to sell large buyers the lower-core-count CPUs, since their yield guarantees that—at least for now—they have so very much more of them, and attempting to make more of the highest-end part ensures that they end up with even more of the less-than-highest-end chiplets laying around.

AMD probably ideally wants order-flow of CPUs in a ratio, e.g. “1x 7742 : 8x 7642”, and offers both better deals monetarily, and far faster delivery (/less contention on orders with other clients) when you take them up on it; or when you buy huge numbers of 7642s alone, such that you’re consuming the cast-off from bullheaded clients who wanted pure 7742s.

twotwotwo · on Feb 24, 2020

It's binned. Everything public indicates they do not manufacture a 6-core chiplet and are turning stuff off on 8-core ones.

Curiously, TSMC seemingly published their N7 defect densities and they're low enough that most chiplets would not have outright dead cores. Specifically, they said 0.09 defects per square cm in a slide you can see at https://fuse.wikichip.org/news/2879/tsmc-5-nanometer-update/ . If that's saying what it appears to, lower SKUs must use a lot of chiplets where all the cores turn on, but (say) might not hit the top chip's performance spec within its TDP.

The 7742 needs all cores to run at 2.5GHz averaging ~3.2W apiece if you leave 25W of the 225W for the I/O die. The 7642 is looser: 2.3GHz averaging ~4.1W apiece, and that after dropping the "worst" core from any CCX where they all work. (For non-obsessives, a CCX is a four-core group connected to a 16MB chunk of L3 cache.)

Note lower SKUs like the 3700X/3800X and 7232P use 8-core chiplets. You can figure the chiplet count for an SKU by dividing its L3 capacity by 32MB, and from there you can figure how many cores each chiplet has enabled.

There's also plain market segmentation, i.e. enabling/disabling stuff on identical chips to sell at different prices. In this gen I doubt it's good strategy for AMD to hold back much performance like that, though, since they really want to get some market share right now.

(If turned-off cores generally work but below spec, that suggests there could be some way to make them useful for extremely-threaded workloads. Split hardware threads onto two lower-clocked physical cores when it looks like a net win, say. Can see enough potential thorns not to bother trying, but thinking of possibly-useful silicon sitting there turned off makes it just so tempting, heh.)

c2h5oh · on Feb 24, 2020

Yes, it is binned. 8 chiplets with 6 cores (out of 8) enabled on each. 64-core part is 8 x 8

tssva · on Feb 24, 2020

"I can't imagine price would be major factor for a company of Cloudflare's size, but it's definitely true that the 48-core part is much cheaper."

I have worked at companies with 5 employees struggling day to day to keep afloat, highly profitable companies with employee counts in the 6 figures and multiple billions in revenue and just about everything in between. I have never worked at one where cost was not a major factor.

In Cloudflare's case revenues for 2019 were $287 million and they had a net loss of $105.8 million. They are competing with a market leader, Akamai, whose revenues for 2019 were 10 times theirs and had a profit of a few hundred million dollars, so I don't suspect Cloudflare is the exception to the rule of cost being a major factor.

c2h5oh · on Feb 24, 2020

It's likely L3 cache to core ratio: both 64 and 48 core parts have the same 256MB L3.

o_x · on Feb 24, 2020

Yeah, I suspect it might be a memory throughput-per-core issue or one of the oldest ones in the books: they got a better deal for 48 cores as not all chiplet cores need to be operational...