Wow, you sent me down a serious rabbit hole. That core looks awesome, and the ring bus blew my mind [1].
What is the practical difference between a very wide SIMD processor vs very many single-instruction processors? Is there any? They say this processor is comprised of 16 "slices", each 265B SIMD + 1 MiB cache. Is that different at all from having 16 265B processors?
A huge SIMD core should have a die area advantage (which in turn gives it other advantages) over a bunch of little processors of the same "width" executing instructions independently. And there is less overhead.
In exchange, it can't process different instruction streams simultaneously. It has to perform the same operations over a giant chunk data, or otherwise "waste" the huge SIMD width.
Another highlight of this thing:
> There is also extensive support for predication with 8 predication registers. The unit is optimized for 8-bit integers (9-bit calculations)
From everything I read, NCore would have been a low price LLM monster. Centaur would probably be alive and selling them like hotcakes if they came out with it now, instead if then.
What is the practical difference between a very wide SIMD processor vs very many single-instruction processors? Is there any? They say this processor is comprised of 16 "slices", each 265B SIMD + 1 MiB cache. Is that different at all from having 16 265B processors?
[1] https://en.wikichip.org/wiki/centaur/microarchitectures/cha#...