Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm really curious about the benefits of their implementation. It's far beyond my grasp to make any serious criticisms and I don't really want to doubt them, it just seems a pretty radical departure from even the direction of innovation.

The way they paint it sounds like they're putting in redundant cores to account for failure of what seems like what I would call the 'first line' cores, i.e. there's cores that are only used if some primary ones aren't working?

But sort of intuitively that doesn't make a whole lot of sense given the parallel nature. Maybe they are just putting in 101% of specified cores, and if there's a ~1% hopefully uniform-ish core failure rate then it's all gucci?

I guess my question is probably similar to yours, what are you giving up with yield-enhancing redundancy of a behemoth die vs integrating a bunch of confirmed working chiplets together?



The CEO says 1-1.5%.

"Cerebras approached the problem using redundancy by adding extra cores throughout the chip that would be used as backup in the event that an error appeared in that core’s neighborhood on the wafer. “You have to hold only 1%, 1.5% of these guys aside,” Feldman explained to me. Leaving extra cores allows the chip to essentially self-heal, routing around the lithography error and making a whole-wafer silicon chip viable."

https://techcrunch.com/2019/08/19/the-five-technical-challen...


The article claims that keeping everything on one die raises interconnect bandwidths and lowers latencies over what would be possible in a conventional supercomputing setup. Connections are made over the silicon that is normally left aside for cutting the chips apart. Apparently that is a special process that they had to collaborate with a partner in order to get working.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: