From the perspective of an outsider, I can't see how a company like this could s...

recursivecaveat · on Jan 2, 2020

As someone who works for another startup in this area, building the chip is only half the battle. The other half is tooling for compiling benchmark networks onto the chip in a performant manner. With 400k cores and their 'duplicate and re-route' defect strategy, this might literally be the most challenging compilation target ever made. It probably stacks up absolutely terribly in every metric right now. That's not to say it will necessarily get better, most of the people I've talked to don't think the megachip will ultimately amount to much more than a clever marketing ploy.

Veedrac · on Jan 2, 2020

A bit baffled by this because on every axis I look this seems like a dream of a compilation target.

* No DRAM or caches, everything is in SRAM, and all local SRAM loads are 1 cycle.

* Model parallel alone is full performance, no need for data parallel if you size to fit.

* Defects are handled in hardware; any latency differences are hidden & not in load path anyway.

* Fully asynchronous/dataflow by default, only need minimal synchronization between forward/backward passes.

I genuinely don't know how you'd build a simpler system than this.

jcranmer · on Jan 3, 2020

Having worked on compilers for pretty weird architectures, it's generally the case that the less like a regular CPU your architecture is, the more difficult it is to compile.

In particular, when you change the system from having to worry about how to optimally schedule a single state machine to having to place operations on a fixed routing grid (à la FPGA), the problem becomes radically different, and any looping control flow becomes an absolute nail-biter of an issue.

Veedrac · on Jan 3, 2020

Remember that you aren't compiling arbitrary programs. Neural nets don't really have any local looping control flow, in the sense that data goes in one end and comes out the other. You'll have large-scale loops over the whole network, and each core might have a loop over small, local arrays of data, but you shouldn't have any sort of internal looping involving different parts of the model.

tachyonbeam · on Jan 3, 2020

It's pretty common to have neural networks that have both recurrent nets processing text input and convolutional layers. A classic example would be visual question answering (is there a duck in this picture?). That would be a simple example involving looping over one part of the model. Ideally you want that looping to be done as locally as possible to avoid wasting time having a program on a CPU dispatching, waiting for results and controlling data flow.

Having talked to someone at Cerebras, I also know that they don't just want to do inference with this, they want to accelerate training as well. That can involve much more complex control flow than you think. Start reading about automatic differentiation and you will soon realize that it's complex enough to basically be its own subfield of compiler design. There have been multiple entire books written on the topic, and I can guarantee you there can be control-flow driven optimizations in there (eg: if x == 0 then don't compute this large subgraph).

Veedrac · on Jan 3, 2020

I would be surprised if Cerebras was trying to handle any recurrence inside the overall forward/backward passes. It seems like a lot of difficulty (as mentioned) for peanuts.

I don't get your point about training. Yes, it's backwards rather than forwards, and yes it often has fancy stuff intermixed (dropout, Adam, ...), but these are CPUs, they can do that as long as it fits the memory model.

IshKebab · on Jan 2, 2020

I'm afraid recursivecaveat is right. This is an insanely difficult compilation target. I think you're possibly talking about a different kind of "compilation" - i.e. the Clang/GCC bit that converts C++ to machine code. That is indeed trivial. But "compilation" for these chips includes much more than that.

The really complicated bit is converting the tensorflow model to some kind of computation plan. Where do you put all the tensor data? How do you move it around the chip. It's insanely complicated. If anything kills Cerebras it will be the software.

Veedrac · on Jan 2, 2020

It's model parallel, so the first thing you do is lay out your floorplan for the model, which looks like this.

https://secureservercdn.net/198.12.145.239/a7b.fcb.myftpuplo...

Then you put your data next to the core that uses it. Simples.

(Optimal placement is tricky, but approximate techniques work fine.)

IshKebab · on Jan 3, 2020

When you consider the things that that diagram doesn't show, it doesn't look at all simple. Does that graph even have training? It'll have to be pipelined too. Probably will have to use recomputation due to the shortage of memory. What about within the boxes? You can't nicely separate a matmul into pieces like that.

I work on something similar but less ambitious, trust me it is crazy complicated.

Veedrac · on Jan 3, 2020

Could you be more explicit? What about the naïve approach to training (same graph but backwards, computing gradients) is going to fail?

Wrt. matmul, if you couldn't split them up, today's AI accelerators wouldn't work full stop. But regardless, even if it was much more complex on CS-1 than on all the other sea-of-multipliers accelerators, it's obviously a problem they've solved and so irrelevant to the compilation issue.

jhj · on Jan 3, 2020

It's not like there is one SRAM, there are many SRAMs, so you get the same problem as NUMA but a thousand fold. Some computations you can map to a regular grid/hypercube/whatever quite easily, but it is unclear what the interconnect between the PEs is here, or what this thing has for a NOC or NOCs, how routing is handled, etc., and further complicating the issue is compensating for any damaged PEs or damaged routes.

Veedrac · on Jan 3, 2020

No, you don't have all the issues with traditional NUMA because you aren't doing the same sort of heterogeneous workloads. You're always working on local data, and streaming your outputs to the next layer. This isn't a request-response architecture; such a thing wouldn't scale.

jhj · on Jan 11, 2020

It is more or less the same, it's just that in NUMA you have a limited number of localities, except here it is in the thousands. The issue is one of scheduling that locality. Some process still needs to determine what data is actually local and where it should "flow". Because it can't all fit in one place, the computation needs to be tiled (potentially in multiple ways) and the tiles need to be scheduled to move around in an efficient manner.

dnautics · on Jan 2, 2020

Is it not the case that the defect identification and rerouting happens at the hw level in a QA phase post production? If not I'm even less bullish on cerebras.

HereBeBeasties · on Jan 2, 2020

Yes, that's what their web site says.

joe_the_user · on Jan 2, 2020

With 400k cores and their 'duplicate and re-route' defect strategy, this might literally be the most challenging compilation target ever made.

While I'd be generally skeptical, it seems like the compilation for the rerouting could be done on a single low level, below whatever their assembler is, and so the could just look like a regular array of cores - just a single array that translates from i to the ith "real" core and similar structures seems like it could be enough.

Edit: I mean, if they're smart, it seems like they'd make the thing look as much as possible like a generic GPU capable of OpenCL. I have no idea if they'll do that but since they have size, they won't have to sell their stuff an otherwise custom approach.

Veedrac · on Jan 2, 2020

They have customers already, one (Argonne National Labs) is given explicitly.

The issue with using ‘industry standard’ benchmarks is that it's like measuring a bus' efficiency by shuttling around a single person at a time. The CS-1 is just bigger than that; the workloads that it provides the most value on are ones that are sized to fit, and specifically built for the device.

This does make it hard to evaluate as outsiders (certainly for similar reasons I never liked Graphcore), but I don't think it means anything as grim as you say. The recipe fits.

typon · on Jan 3, 2020

They could always release figures for larger networks - they don't have to target Resnet50 (which is the MLPerf standard). I don't think anyone would hold it against them if they show massive improvements in something like GPT-2 training time (a network 37000x the size of Resnet)

Veedrac · on Jan 3, 2020

GPT-2 uses attention, which is very memory hungry to train, so probably won't work well. But I agree with your overall point.

m0zg · on Jan 2, 2020

That' sounds like horseshit to me. Very large public datasets and models are available to test training on a chip or system of any size. ImageNet is large enough for this. But if that's not sufficient, OpenImages is also available.

To me as a practitioner a meaningful metric would be "it trains an ImageNet classifier to e.g. 80% top1 in a minute". If it's not suitable for CNNs, do BERT or something else non-convolutional. Even better if I can replicate this result in a public cloud somewhere. They know this, and yet all we have is a single mention of a customer under an NDA and no public benchmarks of any kind, let alone any verifiable ones. If it did excel at those, we'd already know.

tynpeddler · on Jan 2, 2020

> Cerebras hasn’t released MLPerf results or any other independently verifiable apples-to-apples comparisons. Instead the company prefers to let customers try out the CS-1 using their own neural networks and data.

> This approach is not unusual, according to analysts. “Everybody runs their own models that they developed for their own business,” says Karl Freund, an AI analyst at Moor Insights. “That’s the only thing that matters to buyers.”

Sounds like instead of benchmarks, prospective customers get a chance to run a workload of their choice on the core before purchase. Assuming support is good, that's way better than looking at benchmarks, because you're guaranteed that the performances you're comparing are for workloads you care about.

Veedrac · on Jan 2, 2020

The appropriately large models with public recognition I know of use attention, which is too memory-hungry to work effectively on the CS-1. The datasets aren't the issue.

I'm fine with skepticism. It's certainly plausible that they don't actually do all that well.

phonon · on Jan 2, 2020

There are probably only a few hundred prospective customers. (Some may buy several units). Each unit will cost millions. They can discuss the expected workloads/performance with each prospective customer individually.

jandrese · on Jan 2, 2020

Keeping the performance figures a secret is a red flag on the level of "run, don't walk, away from this company".

At best their solution is on par with GPUs in a performance per watt/dollar sense. At worst they're scammers looking for a sucker.

privateSFacct · on Jan 2, 2020

I'm also very curious about their performance per watt / dollar for the standard ML datasets out there (facial recognition etc). We have a reasonable sense of both training time and runtime for these in the cloud (and it is falling FAST).

tedivm · on Jan 2, 2020

I got a demo of this two years ago, and honestly I don't think it matters that they aren't sharing these numbers. Any company that is going to consider this is going to want to benchmark it on their own models and systems, and as long as Cerebras allows that they aren't going to have trouble finding customers (assuming their claims line up with reality).

Even if that doesn't work out most of the people on these time have built companies that were acquired by either AMD or another chip maker.

star-trek-fleet · on Jan 2, 2020

Mass market customers are just going to skip without benchmarks.

Although, at this stage, Crebras does not care about mass market yet.