I am planning to buy a new GPU. If the GPU has 16GB of VRAM, and the model is 70...

PheonixPharts · on March 27, 2024

While GPUs are still the kings of speed, if you are worried about VRAM I do recommend a maxed out Mac Studio.

Llama.cpp + quantized models on Apple Silicon is an incredible experience, and having 192 GB of unified memory to work with means you can run models that just aren't feasible on a home GPU setup.

It really boils down to what type of local development you want to do. I'm mostly experimenting with things where the time to response isn't that big of a deal, and not fine-tuning the models locally (which I also believe GPUs are still superior for). But if your concern is "how big of a model can I run" vs "Can I have close to real time chat", the unified memory approach is superior.

bevekspldnw · on March 27, 2024

I had gone the Mac Studio route initially, but I ended up with getting an A6000 for about the same price as a Mac and putting that in a Linux server onder my desk. Ollama makes it dead simple to serve it over my local network, so I can be on my M1 Air and using it no differently than if on my laptop. The difference is that the A6000 absolutely smokes the Mac.

starik36 · on March 27, 2024

Wow, that is a lot of money ($4400 on Amazon) to throw at this problem. I am curious, what was the purpose that compelled you to spend this (for the home network, I assume) amount of money.

bevekspldnw · on March 27, 2024

Large scale document classification tasks in very ambiguous contexts. A lot of my work goes into using big models to generate training data for smaller models.

I have multiple millions of documents so GPT is cost prohibitive, and too slow. My tools of choice tend to be a first pass with Mistral to check task performance and if lacking using Mixtral.

Often I find with a good prompt Mistral will work as well as Mixtral and is about 10x faster.

I’m on my “home” network, but it’s a “home office” for my startup.

Datagenerator · on March 28, 2024

Interesting I have the same task, can you share your tools? My goal is to detect if documents contain GDPR sensitive parts or are copies of official documents like ID's and driving licenses etc - would be great to reuse your work!

bevekspldnw · on March 28, 2024

Working in the same sector, we’ll license it out soon.

rldjbpin · on March 28, 2024

this. if you can afford m3 level of money, a6000 is definitely worth it and provides you long-term access to a level of compute even hard to find in the cloud (for the price and waiting period).

it is only dwarfed by other options if your workload can use multi-gpu, which is not a granted for most cases.

c1b · on March 27, 2024

> The difference is that the A6000 absolutely smokes the Mac.

Memory Bandwidth : Mac Studio wins (about the same @ ~800)

VRAM : Mac Studio wins (4x more)

TFLOPs: A6000 wins (32 vs 38)

bevekspldnw · on March 27, 2024

VRAM in excess of the model one is using isn’t useful per se. My use cases require high throughput, and on many tasks the A6000 executes inference at 2x speed.

bee_rider · on March 27, 2024

I know the M?-pro and ultra variants are multiple standard M?’s in a single package. But so the CPUs and GPUs share a die (like a single 4 p-core CPU 10 GPU core is what come in the die, and the more exotic variants are just a result of LEGO-ing out those guys and disabling some cores for market segmentation or because they had defects?)

I guess I’m wondering if they technically could throw in their gauntlet and compete with nvidia by doing something like a 4 CPU/80 GPU/256 GB chip, if they wanted to. Seems like it’d be a really appealing ML machine. (I could also see it being technically possible but Apple just deciding that’s pointlessly niche for them).

astrange · on March 27, 2024

Ultra is the only one that's made from two smaller SoCs.

XCSme · on March 27, 2024

I already have 128GB of RAM (DDR4), and was wondering if upgrading from a 1080ti (12GB) to a 4070ti super (16GB), would make a big difference.

I assume the FP32 and FP16 operations are already a huge improvement, but also the 33% increased VRAM might lead to fewer swaps between VRAM and RAM.

loudmax · on March 27, 2024

I have an RTX 3080 with 10GB of VRAM. I'm able to run models larger than 10GB using llama.cpp and offloading to the GPU as much as can fit into VRAM. The remainder of the model runs on CPU + regular RAM.

The `nvtop` command displays a nice graph of how much GPU processing and VRAM is being consumed. When I run a model that fits entirely into VRAM, say Mistral 7B, nvtop shows the GPU processing running at full tilt. When I run a model bigger than 10GB, say Mixtral or Llama 70B with GPU offloading, my CPU will run full tilt and the VRAM is full, but the GPU processor itself will operate far below full capacity.

I think what is happening here is that the model layers that are offloaded to the GPU do their processing, then the GPU spends most of the time waiting for the much slower CPU to do its thing. So in my case, I think upgrading to a faster GPU would make little to no difference when running the bigger models, so long as the VRAM is capped at the same level. But upgrading to a GPU with more VRAM, even a slower GPU, should make the overall speed faster for bigger models because the GPU would spend less time waiting for the CPU. (Of course, models that fit entirely into VRAM will run faster on a faster GPU).

In my case, the amount of VRAM absolutely seems to be the performance bottleneck. If I do upgrade, it will be for a GPU with more VRAM, not necessarily a GPU with more processing power. That has been my experience running llama.cpp. YMMV.

htrp · on March 27, 2024

How's your performance on the 70b parameter llama series?

Any good writeups of the offloading that you found?

loudmax · on March 27, 2024

Performance of 70b models is like 1 token every few seconds. And that's fitting the whole model into system RAM, not swap. It's interesting because some of the larger models are quite good, but too annoyingly slow to be practical for most use cases.

The Mixtral models run surprisingly well. They can run better than 1 token per second, depending on quantization. Still slow, but approaching a more practical level of usefulness.

Though if you're planning on accomplishing real work with LLMs, the practical solution for most people is probably to rent a GPU in the cloud.

zozbot234 · on March 27, 2024

That's system memory, not unified memory. Unified means that all or most of it is going to be directly available to the Apple Silicon GPU.

giancarlostoro · on March 27, 2024

This is the key factor here. I have a 3080, with 16GB of Memory, but still have to run some models on CPU since the memory is not unified at all.

brandall10 · on March 27, 2024

Wait for the M3 Ultra and it will be 256GB and markedly faster.

spxneo · on March 27, 2024

Aren't quantized models different models outright requiring a new evaluation to know the deviation in performance? Or are they "good enough" in that the benefits outweigh the deviation?

I'm on the fence about whether to spend 5 digits or 4 digits. Do I go the Mac Studio route or GPUs? What are the pros and cons?

purpleblue · on March 27, 2024

Aren't the Macs good for inference but not for training or fine tuning?

llm_trw · on March 27, 2024

>If the GPU has 16GB of VRAM, and the model is 70GB, can it still run well? Also, does it run considerably better than on a GPU with 12GB of VRAM?

No, it can't run at all.

>I run Ollama locally, mixtral works well (7B, 3.4GB) on a 1080ti, but the 24.6GB version is a bit slow (still usable, but has a noticeable start-up time).

That is not mixtral, that is mistral 7b. The 1080ti is slower than running inference on current generation threadripper cpus.

XCSme · on March 27, 2024

> No, it can't run at all.

https://s3.amazonaws.com/i.snag.gy/ae82Ym.jpg

EDIT: This was ran on a 1080ti + 5900x. Initial generation takes around 10-30seconds (like it has to upload the model to GPU), but then it starts answering immediately, at around 3 words per second.

wokwokwok · on March 27, 2024

Did you check your GPU utilization?

Typically when it runs that way it runs on the CPU, not the GPU.

Are you sure you're actually offloading any work to the GPU?

At least with llama.cpp, there is no 'partially put a layer' into the GPU. Either you do, or you don't. You pick the number of layers. If the model is too big, the layers won't fit and it can't run at all.

The llama.cpp `main` executable will tell you in it's debug information when you use the -ngl flag; see https://github.com/ggerganov/llama.cpp/blob/master/examples/...

It's also possible you're running (eg. if you're using ollama) and quantized version of the model which reduces the memory requirements and quality of the model outputs.

XCSme · on March 27, 2024

I have to check, something does indeed seem weird, especially with the PC freezing like that. Maybe it runs on the CPU.

> quantized version Yes, it is 4bit quantized, but still has 24.6GB

spxneo · on March 27, 2024

this is some new flex to debate online: copying and pasting the other sides argument and waiting for your local LLM to explain why they are wrong.

how much is your hardware at today's value? what are the specs? that is impressive even though its 3 words per second. if you want to bump it up to 30, do you then 10x your current hardware cost?

XCSme · on March 27, 2024

That question was just an example (Lorem ipsum), it was easy to copy paste to demo the local LLM, I didn't intend to provide more context to the discussion.

I ordered a 2nd 3090, which has 24GB VRAM. Funny how it was $2.6k 3 years ago and now is $600.

You can probuild a decent AI local machine for around $1000.

spxneo · on March 27, 2024

https://howmuch.one/product/average-nvidia-geforce-rtx-3090-... you are right there is a huge drop in price

XCSme · on March 27, 2024

New it's hard to find, but the 2nd hand market is filled with them.

taneq · on March 28, 2024

Where are you seeing 24GB 3090s for $600?

XCSme · on March 28, 2024

2nd hand market

llm_trw · on March 27, 2024

Congratulations on using CPU inference.

XCSme · on March 27, 2024

I have those:

dolphin-mixtral:latest (24.6GB) mistral:latest (3.8GB)

The CPU is 5900x.

lxe · on March 27, 2024

Get 2 pre-owned 3090s. You will easily be able to run 70b or even 120b quantized models.

jasonjmcghee · on March 27, 2024

> mixtral works well

Do you mean mistral?

mixtral is 8x7B and requires like 100GB of RAM

Edit: (without quant as others have pointed out) can definitely be lower, but haven't heard of a 3.4GB version

kwerk · on March 27, 2024

I have two 3090s and it runs fine with `ollama run mixtral`. Although OP definitely meant mistral with the 7B note

_ea1k · on March 27, 2024

ollama run mixtral will default to the quantized version (4bit IIRC). I'd guess this is why it can fit with two 3090s.

ranger_danger · on March 27, 2024

I'm using mixtral-8x7b-v0.1.Q4_K_M.gguf with llama.cpp and it only requires 25GB.

XCSme · on March 27, 2024

I have 128GB, but something is weird with Ollama. Even though for the Ollama Docker I only allow 90GB, it ends up using 128GB/128GB, so the system because very slow (mouse freezes).

InitEnabler · on March 27, 2024

What docker flags are you running?

XCSme · on March 27, 2024

None? The default ones from their docs.

The Docker also shows minimal usage for the ollama server which is also strange.

K0balt · on March 27, 2024

I run mixtral 6 bit quant very happily on my MacBook with 64 gb.

Havoc · on March 27, 2024

The smaller quants still require a 24gb card. 16 might work but doubt it

XCSme · on March 27, 2024

Sorry, it was from memory.

I have those models in Ollama:

I have those:

dolphin-mixtral:latest (24.6GB) mistral:latest (3.8GB)

chpatrick · on March 27, 2024

The quantized one works fine on my 24GB 3090.

Zambyte · on March 28, 2024

I genuinely recommend considering AMD options. I went with a 7900 XTX because it has the most VRAM for any $1000 card (24 GB). NVIDIA cards at that price point are only 16 GB. Ollama and other inference software works on ROCm, generally with at most setting an environment variable now. I've even run Ollama on my Steam Deck with GPU inferencing :)

XCSme · on March 28, 2024

I ended up getting a 2nd hand 3090 for 680€.

Funnily, I think the card is new (smells new) and unused, most likely a scalper bought it and couldn't sell it.

Zambyte · on March 29, 2024

Nice, that's definitely a sweet deal

XCSme · on March 29, 2024

Thanks, I chose a 3090 instead of 4070ti, it was around $200 cheaper and has 24GB vs 16GB VRAM and a similar performance. The only drawback is the 350W TDP.

I still struggle with the RAM issue on Ollama, where it uses 128GB/128GB RAM for Mixtral 24.6GB, even though Docker limit is set to 90GB.

Docker seems pretty buggy on Windows...

speedylight · on March 29, 2024

Quantized models will run well, otherwise inference might be really really slow or the client crashes all together with some CUDA out of memory error.