The bottleneck is not arithmetic for a long time, it's data movement. Arithmetic is practically free nowadays. See presentation by Horst Simon (Deputy Director of Lawrence Berkeley National Laboratory) "No exascale for you!" [0]
The energy cost of transferring a single data word to a distance of 5mm on-chip is higher than the cost of a single FLOP (20 pico-Joules/bit). 5mm =~ the distance to L2 cache or another CPU core. The cost of transferring data off-chip (3D chip and/or RAM) is orders-of-magnitude higher, see graph.
The energy cost of transferring a single data word to a distance of 5mm on-chip is higher than the cost of a single FLOP (20 pico-Joules/bit). 5mm =~ the distance to L2 cache or another CPU core. The cost of transferring data off-chip (3D chip and/or RAM) is orders-of-magnitude higher, see graph.
[0] http://iwcse.phys.ntu.edu.tw/plenary/HorstSimon_IWCSE2013.pd...