Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
NVIDIA Develops NVLink Switch: NVSwitch, 18 Ports For DGX-2 (anandtech.com)
118 points by jsheard on March 27, 2018 | hide | past | favorite | 80 comments


I'll just put the follow-up articles here rather than spamming tons of similar stories:

More DGX-2 information - https://www.anandtech.com/show/12587/nvidias-dgx2-sixteen-v1...

Quadro GV100 announced - https://www.anandtech.com/show/12579/big-volta-comes-to-quad...

Tesla V100 memory bumped to 32GB - https://www.anandtech.com/show/12576/nvidia-bumps-all-tesla-...


Five and a half years ago, one DGX-2 would be in the top 10 supercomputers in the world[1], and you'll probably be able to rent one on EC2 for under twenty bucks an hour before the year's out. You can already get the DGX-1 for under ten bucks an hour right now.

[1] 1920 petaflops of 4x4+4 matrix multiply/add, and see https://www.top500.org/lists/2012/11/


No it would not. You are comparing FP16 to FP64 performance.


It falls around 220th in the November 2012 list, and would make it onto the top 10 for the November 2007 list (comparing Rpeak) and the June 2008 list (comparing Rmax).

Systems of comparable performance in the top 10 used between 300-500 kW (30-50x the DGX2).

Red Storm, which has a listed Rpeak of 127 TFLOPs and placed sixth in the November 2007 TOP500, was "relatively inexpensive," costing only in the ballpark of $75M (~180x the price of the DGX2).


what are the differences between the Tesla V100 and the Quadro GV100? mostly just the display port? so is the GV100 more expensive then the V100?

Have they published a price for the GV100?


There is no published price and the difference is the new Quadro supports Nvidia's OptiX ray-tracing API, as well as a very moderate performance boost over the Tesla V100 card.

While it is eye-popping, you would be correct in saying this is pretty much just a relaunch under the Quadro brand.


OptiX runs on all NVIDIA cards not only volta.

The difference is NVLINK.


Ah, I am unable to edit my comment. You are correct.


Does anyone know when graphics cards will be available at sane prices for people who actually want to use them one (or two max) at a time to render graphics?


The bad news is that a lot of system builders like us (we build pre-built systems pre-installed with Deep Learning software http://lambdalabs.com) are now used to paying more than MSRP for our GPUs. MSRP for a 1080Ti is supposed to be $699. They haven't been available even in large bulk purchases at that price for a while now. I don't see it going back down any time soon.


It's really hard to come up with a timeline, but I'd say the worst has passed. It will take time to recover inventory, but the alt-coin market (basically, not-Bitcoin cryptocurrencies) has died down a lot and the rush to acquire mining capital has likely diminished.

6 months to a year, maybe?


If Ethereum does move over to proof-of-stake, then the market will be flooded with 1080Tis. I expect the DeepLearning11 server will then be a goldmine for DL researchers - cheap as chips.


Can't wait. Fingers crossed.


Yeah, it should be getting better soon. Especially since ETH is now 1/3 what it was in Jan ($458.42 at the moment, $1350+ in Jan).

As long as the prices don't rally again!


Keep an eye on the used graphics card market. If we see a burst / die down the market should be flooded with excess high end cards


I am not sure I would pay too much for card that had been run flat-out for months in a rusting hangar with standing water, though.


Run flat out but undervolted because you get better electrical efficiency.

I'd take that.


But with a used card, you have no way of knowing if it was indeed run that way.


Let's hope the deep learning craze doesn't pick up pace.


I can't say when, but you might be happy to know that graphics card availability has been improving throughout March, albeit at expensive, but no longer insane prices. I was monitoring availability of various AMD and Nvidia cards since January (using nowinstock.com): previously cards would be available for a few hours at a time at an outlet - now we're up to weeks and the prices have been going down a bit - they are still above MRSP. The the ongoing cryptocoin downward spiral persists for a few more months, GPU prices ought to come down.


Almost all of the Nvidia GTX cards are back in stock on Amazon (for Prime shipping) and are relatively close to MSRP. (Edit: You will still have to sort through overpriced ones)


Yeah depends on what you mean by close. I see the cheapest 1080ti at newegg right now for $909. Bought mine a year ago (founders edition) for $699. It is getting better but we are still far away from sanity. IMHO this (year old) card should sell today for ~$500 if things were to be normal.


I bought a GTX 1060 3GB in November 2016 for $165. The cheapest right now is $275. There is a long way to go for full recovery.


People are saying that prebuilt PCs include graphics cards at MSRP.

If you mean Volta at a sane price, they didn't announce it today so it may be a while.


You'll probably be able to pick up a 2080 for a reasonable price when it comes out, which should be soon.


Can someone explain the technology edge Nvidia has?

AMD, Intel etc. have not been able to compute in high-performance GPU market, so Nvidia must have an edge. How big and sustainable it is?


They understood really early on that you need to invest in software just as much as you would need to invest in hardware if not more.

Open standards tend to be a horse designed by a committee, it can take years for them to evolve and to reach any consensus and they would never be able to match the speed in which hardware can evolve and adapt to market requirements.

So NVIDIA essentially made their own software ecosystem which can be just as flexible as their hardware and more importantly it allows NVIDIA to be proactive rather than reactive.


They also realized non-CS researchers don't have the time, expertise, budget, or interest in writing optimized ML libraries.

And that if a graphics card company wrote easier to use / higher stack libaries, this would be a competitive advantage.


To repeat the other comments, right now CuDNN is the advantage - that is manifested as TensorFlow/Keras/PyTorch. AMD have RocM and in these benchmarks it was like 10X or more slower training for CNNs that P100s - https://www.pcper.com/reviews/Graphics-Cards/NVIDIA-TITAN-V-...

How sustainable is the advantage? Not that big - you don't need cuda compatability like hiptensorflow tried in a classic shortcuts-dont-work way. Just an alternative CuDNN for Vega that is integrated in TensorFlow distributed binaries.


In other words Nvidia has no meaningful edge in hardware microarchitecture, patents and innovation. Is this correct?

If you remove the software and compare the raw cost/flops/bandwidth/memory/energy to others the difference is small?


Yeah, pretty much. They are claiming the future is specialization in compute. Tensorcores are an attempt at this - they should more accurately be called "4x4 matrix multiply in half precision with a full-precision add". That is the instruction they support. Unless you are training a very deep CNN with a lot of 4x4 kernels, they won't give you much of a speedup. You're left then with the memory-GPU b/w. For the 1080Ti, that's 484GB/s and it's about 950GB/s for the V100. So about twice the training performance for 10 times the price. Not a good deal, IMO. When you compare that with the top Vega cards, it's a complete rip-off - but Vega cards are currently useless for deep learning. I'm not convinced AMD have put the money in to build the libraries, so we're stuck with AMD for a while yet (unless Intel by some miracle pull one out with its new platform).


What about NVLink? I'm pretty sure there is no equivalent for AMD right? That would be a pretty big deal.


NVLink is good if you are doing distributed training and sharing large amounts of gradients (very large models). If you're doing parallel experiments or your models are not huge, then PCI-e (16 GB/s) is generally good enough. AMD don't have anything close that i am aware of, but their new Ryzen board will have double the number of PCI lanes, which makes PCI-e competitive again - https://news.ycombinator.com/item?id=14450924


NVidia definitely does have an edge in GPU hardware. AMD Vega is roughly tied with NVidia Pascal in performance, but Vega consumes more power and came out two years later.


NVIDIA may have an edge in GPU hardware, but probably not in GPU hardware architecture. AMD is contractually required to use GF process which is particularly bad in the current generation, but will be competitive in the next generation. My understanding is that process difference accounts for most of edge NVIDIA has.


They were clever to understand that developers wanted the freedom to code GPGPU in C, C++, Fortran plus any other language able to target their bytecode (PTX) instead of being bound programming in crufty C.

Then they created nice numeric libraries and graphical debuggers for GPU programming.

Their new Volta GPUs were explicitly designed to be developed in C++ (there are a few talks about it).

When Khronos woke up for the idea that maybe they should support something else other than C, invoking compilers and linkers during runtime, with OpenCL 2.0, already most developers were deeply invested into CUDA.


One note on this. Their graphical debuggers aren't user friendly at all. And the Eclipse platform "Nsight" is a pile of garbage. I haven't used the version in visual studio, but if it isn't way way better, then people might as well stick to vim and cuda-GDB... or just put printf statements everywhere like most of us do.

The really need to develop a purpose built IDE or work on their integration way more.


Their graphical debugger is extremely user-friendly. I use nvvp all the time, and it's very easy to use.


True, but they are way better than whatever used to be available for OpenCL.

Have the OpenCL debuggers improved at all in the meantime?


They don't really have an edge today. They just achieved big lock-in, and inertia of those who depend on CUDA now prevents them from using other hardware.


What prevents other vendors from making CUDA compliant hardware? I thought the result of Oracle v. Google is that APIs can’t be copyrighted?


Even if you assume APIs aren't copyrightable, I don't think anyone cared to implement CUDA itself except for Nvidia. There is really no point in proliferating such APIs, since there are OpenCL / Vulkan already which are open to begin with.

What someone could try implementing though, is translating CUDA into OpenCL (if that's possible). That would be useful to break lock-in.


CUDA to OpenCL translator is possible and actually it already exists: https://github.com/hughperkins/coriander


You can implement your own cuda no problem: https://research.google.com/pubs/pub45226.html

or write your own assembly https://github.com/NervanaSystems/maxas

What you miss is there is a reason why OpenCL didn't get traction and lack of tools translating CUDA to OpenCL is not one of them. OpenCL 2.0+ is lot better than previous versions but it is too little to late.


> lack of tools translating CUDA to OpenCL is not one of them.

And why not? If you can translate between them, it will help supporting legacy CUDA software. And new one can use new OpenCL to begin with.


Legacy CUDA software would exist if OpenCL ecosystem was better. This is not the case. Code translation from CUDA to OpenCL is solution looking for a problem.


Ecosystem depends on CUDA, it doens't care where it's translated to, no? So it would work with translation, until it's properly rewritten to use open APIs. It's a solution for lock-in that limits your hardware choices, which is a problem. You don't need to look for the problem, it's pretty obvious.


> I thought the result of Oracle v. Google is that APIs can’t be copyrighted?

The result—not final yet—is that the Federal Circuit ruled APIs are eligible for copyright, but that ruling didn't create binding precedent that applies outside the Oracle v Google case. So future cases are still likely to produce the result that APIs can't be copyrighted, unless those cases also include the patent claims necessary to get them into the Federal Circuit for appeals.


How can you say that? Have you compared their cards to AMD or Intel? It's not even a fair comparison.


I think people are underestimating the difficulty of developing high performance microarchitecture for GPU or CPU.

A new clean sheet design architecture takes 5-7 years even for teams that have been doing it constantly for decades in places like Intel, AMD, ARM or Nvidia. This includes optimizing the design into process technology, yield, etc. Then there is economies of scale and price points.

Recent examples:

* Nvidia's Volta microarchitecture design started 2013, launch was December 2017

* AMD's zen CPU architecture design started 2012 and CPU was out 2017.


I think Microarchitecture design in CPU is order of magnitudes harder then GPU.


in deep learning, they presented high quality hand optimized building blocks way before anybody else did (CuDNN). An effect of that is that the libraries were built around cuda and cudnn, and now amd is still trying to catch up. Intel just hasn't delivered a fast enough, flexible enough, cheap enough gpu or gpu alternative afaik


Without reading the article first, let me guess, $200k?



Sheeeit. I’d love some of the stuff they’re smoking. You can build a 100 GPU rig for half as much. Just scatter it around the office so it’s not “in the data center”.


Well, those who were willing to pay the ~$150k for the DGX-1 (Volta-16Gb upgraded IIRC), won't necessarily find it too much -- and NVIDIA is after quick money before the markets start slipping down the hyper curve.

You also do get quite some meat compared to the DGX-1: the equivalent of 2x DGX-1 in terms of GPUs and NICs with 4x HBM2 size, plus the NVlink fabric. Plus a more SSD storage, Xeon Platinums to round it off.

Oh, and 350 lbs, let's not forget. :D

Expensive, yes! Will it sell? I'd be very surprised if it didn't!


100 GPU rig would not have 512 GB in one address space, accessible by 16 GPUs. Each GPU can directly address the memory on any of the 16 GPUs.


Not only that, but the switch has the exact same throughput as the actual memory of the V100 until you get to fairly large block sizes. It was shown in a presentation.


For the 300k you'll save just hire a full time dev to write you a distributed implementation of your solver.


Hahha, good one. Just that by the time that one dev implements everything, you'll be overtaken by the competition -- at least that's what everyone's fear is, and it's partly warranted.


Many problems are bandwidth-limited. More cores may indeed actually hurt performance as contention for the limiting resource makes it increasingly hard for any core to run efficiently.

https://superuser.com/questions/2489/cpu-cores-the-more-the-...


You can't spend your way out of the latency issues associated with such a distributed architecture, though.

That's one of things NVLink helps with, actually.


100 GPU rig of 1080ti’s would have 1.1TB VRAM overall though. And most jobs don’t scale well beyond 4 GPUs anyway.


That is problem DGX-2 interconnect solves.


Not for currently available deep learning frameworks.


You're right but only because DGX-2 is not shipping yet. ;P


100 Titan Vs = $300k. And the big deal about this system is the NvSwitch which allows the GPUs to appear to software as a single huge GPU.


So you can run a single neural net model that can use > 12gb memory?


You can already do that with Pascal since it supports unified memory you have an address space of 49bits.

Now it’s even faster since you won’t have to fetch pages from system memory or worse.

But this isn’t new NVLINK supported memory unification of this sort since it’s introduciton but never in this scale.

I’m pretty sure that if their numbers to be believed they have just made the worlds “fastest” single switched fabric.


And what are those numbers for fabric speed?..


900GB/s (GByte) bidirectional.


Do you think they actually mean 900Gb/s (Gbit) here? 18 ports of todays ~50Gb/s serdes would give 900Gb/s. And that'd be twice the bandwidth of their Nvlink 2.0 stuff (~25Gb/s), which would seem a reasonable evolution.

Edit: Turns out that a single Nvlink 3.0 port is 8-bits wide and since 144 50Gb/s serdes on a (big) chip is perfectly doable - 900GB/s must be correct.


PCIe 3.0 x16 already is capable of more than 100gbps per port.

900gbits total bandwidth for 18 NVLink ports would be abysmally slow.


The article notes it's a 2B transistor switch chip... which boggles the mind.


Oh, that's very impressive!


You didn't work back in the day when high end products where if you had to ask how much you couldn't afford it.

In 1980 or so my office mate was working on a way to measure the efficiency of different toilet designs - to try and save water.

I looked at using the then very new use of image recognition hardware (and I think neural nets) but sadly realised that with a base price of hardware at £250,000 around a million in todays money it wasn't practical.

Which was a pity as the organisation was into ml/ai back then and even hired one of the first knowledge engineers (as they wher called back then).


The DGX-2 server (16x V100) costs the same as about 26 DeepLearning11 servers (10x 1080Ti) - https://www.servethehome.com/deeplearning11-10x-nvidia-gtx-1... With 260 1080Ti GPUs, you can do neural architecture search that competes with some published work by Google.

The DL11 also pays for itself in about 90 days compared to renting a P100/V100 on AWS: http://www.logicalclocks.com/price-calculator/


You are again missing the benefits of the switch. That factors into the cost at some point. You will not get the same scaling as adding an equivalent number of gpus and separate systems that you will get with the same number and a switch in between.


10,000 watts...wow!


Too bad this came from the one of the worst companies in the world, according to the policy towards open source. Too bad Google playing on their side by do not merging OpenCL support in TensorFlow.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: