Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Itanium was a bet on compilers being able to extract parallelism from the source code. Maybe JIT optimizations, where code is rewritten on the fly, more or less the same way as a reordering buffer works, would have done the trick, but, in the end, Itanium fell short of its performance promises. I'm not sure compilers alone can do it.


The trouble with Itanium was that memory latency is a limiting factor. A CPU knows what is in the cache but the compiler doesn’t. In fact, a major use of parallelism in CPUs is to hide latency and that involves a flexibility of scheduling greater than bundling groups of instructions together. (Think of 3 instructions hanging because 1 needs to load something as opposed to a system which might be able to barrel on and run a few non-dependent instructions.)


> A CPU knows what is in the cache but the compiler doesn’t.

You could kind of try to sidestep that by inserting preload instructions ahead of the instructions that would use the values but I agree that doing that at runtime would probably be better. Maybe a JIT runtime could help, but, still, that's beyond the control of the compiler.


https://www.researchgate.net/publication/3044999_Beating_in-...

This one was never realized AFAIK, but addresses a lot of the latency issues of the IPF designs due to their relatively static nature. Still does not take the massive cost of reservation stations. I think being open will be a major requirement going forward. It is hard to build trust in a black box and RISC-V is about to become a real alternative. Being modular, open and growing up from smaller systems seems like a winning strategy to me.


I can't wait to have a completely Windows-proof desktop again (even though my last SPARC still has an x86 coprocessor board) and my IBM PPC machine had Windows NT for it.


I did some research on

https://en.wikipedia.org/wiki/Transport_triggered_architectu...

which is an approach to CPU design that puts specialized CPU design in the reach of quite a few people (say on an FPGA) but the main weakness of it is that it is not smart at all about fetching. Although it is not hard at all to make that kind of CPU N-wide you are certainly going to have all N lines wait for a fetch.

Seems to me though that that kind of system could be built with a custom fetcher that would let you work around some of the challenges.


Also a major issue was the big iron shared memory systems that Itanium targeted. They collectively fell out of favor with the trend towards Virtualised x64 commodity clusters and later the cloud. I’d love to see an EPIC strategy for a CPU targeted at smaller systems. Perhaps implementing webassembly based actors in hardware and support for messaging and pmem. Wasmcloud in hardware.


The way I see, disaggregation will be a thing and we may be back to large shared memory racks in cloud datacenters. While an individual cloud machine can range from small to very large (effectively one fully dedicated host) a rack-sized server enables much larger (and profitable) offerings. Scaling up has limits, but it's very comfortable to have the option when needed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: