Very interesting; of course I expect there are many refinements to be made. As a physicist my first reaction is always to mash numbers together based on dimension. If I can get within one or two orders of magnitude in a problem I know nothing about, I'm pretty happy ;)
I agree completely with the thought process, and but I think there might be better 'dimensions' to be used. The one that I'd recommend here is based on Little's Law, which gives a limit on throughput based on the number and latency of outstanding requests: Concurrency = Latency * Bandwidth.
It turns out that each core has can only have 10 requests for memory outstanding (line fill buffers). Since in the end these requests are coming from RAM, each request has a latency of about 100 cycles. Since each request is for 64B, this gives us a maximum throughput of about about:
Bandwidth = 10 * 64B in flight / 100 cycles
Bandwidth = about 6B per cycle
At a 3.5 GHz clock frequency, this would suggest that we have a hard cap at about 3.5 billion cycles/s * 6 bytes/cycle = 21 GB/s, which is mighty close to the actual limit! The numbers are fudged a little bit because it's not actually a flat 100 cycle latency to access RAM, but I think this limit is more relevant and indicative here than the instruction count.