Very interesting; of course I expect there are many refinements to be made. As a...

nkurz · on May 13, 2014

I agree completely with the thought process, and but I think there might be better 'dimensions' to be used. The one that I'd recommend here is based on Little's Law, which gives a limit on throughput based on the number and latency of outstanding requests: Concurrency = Latency * Bandwidth.

It turns out that each core has can only have 10 requests for memory outstanding (line fill buffers). Since in the end these requests are coming from RAM, each request has a latency of about 100 cycles. Since each request is for 64B, this gives us a maximum throughput of about about:

  Bandwidth = 10 * 64B in flight / 100 cycles
  Bandwidth = about 6B per cycle

At a 3.5 GHz clock frequency, this would suggest that we have a hard cap at about 3.5 billion cycles/s * 6 bytes/cycle = 21 GB/s, which is mighty close to the actual limit! The numbers are fudged a little bit because it's not actually a flat 100 cycle latency to access RAM, but I think this limit is more relevant and indicative here than the instruction count.