Ask HN: Cheapest hardware to run Llama 2 70B

orost · on Aug 10, 2023

Anything with 64GB of memory will run a quantized 70B model. What else you need depends on what is acceptable speed for you. With a decent CPU but without any GPU assistance, expect output on the order of 1 token per second, and excruciatingly slow prompt ingestion. Any decent Nvidia GPU will dramatically speed up ingestion, but for fast generation, you need 48GB VRAM to fit the entire model. That means 2x RTX 3090 or better. That should generate faster than you can read.

Edit: the above is about PC. Macs are much faster at CPU generation, but not nearly as fast as big GPUs, and their ingestion is still slow.

quickthrower2 · on Aug 10, 2023

Vastai would rent you those for about $.50 an hour so gives you an idea of what it costs. Assuming the GPUs memory can be stacked

tstrimple · on Aug 10, 2023

Do these large models need the equivalent of SLI to take advantage of multiple GPU? Nvidia removed SLI from consumer cards a few years ago so I’m curious whether it’s even an option these days.

sterlind · on Aug 10, 2023

SLI isn't used at all for CUDA. if you meant NVLink, it's apparently not useful at small scales - I think the PCIe lanes are enough.

ipsum2 · on Aug 18, 2023

This is wrong, NVLink is crucial for tensor parallelism in models for training and in large (>40B param) models for inference.

Tepix · on Aug 10, 2023

I built a DIY PC with used GPUs (2x RTX 3090) for around 2300€ earlier this year. You can probably do it for slightly less now (i also added 128GB RAM and NVLink). You can generate text with >10 tok/s with that setup. Make sure to get a PSU with more than 1000W. Air cooling is a challenge, but it's possible.

Recommended reading: Tim Dettmer's guide https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

byteknight · on Aug 11, 2023

You can't even buy the graphics cards for that price where the heck did you find that?

Tepix · on Aug 13, 2023

Almost everything was used, the GPUs were around 720€ each. You can now buy them as low as 600€. Make sure to get two identical ones if you plan to connect them with NVLink.

spikedoanz · on Aug 9, 2023

If you have a lot of money (but not H100/A100 money), get 4090s as they're currently the best bang for your buck on the CUDA side (according to George Hotz). If broke, get multiple second hand 3090s. https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni.... If unwilling to spend any money at all and just want to play around with llama70b, look into petals https://github.com/bigscience-workshop/petals

XTXinverseXTY · on Aug 11, 2023

> according to George Hotz

In this video from June, George Hotz says to go with "3090s over 4090s. 3090s have NVLink... 4090s are $1600 and 3090s are 750. RAM bandwidth is about the same." Has he changed his recommendations since then?

https://youtu.be/Mr0rWJhv9jU?t=1157

flangola7 · on Aug 9, 2023

4090 has the same amount of VRAM, which is the important factor.

spikedoanz · on Aug 10, 2023

right. I opted for a used 3090 myself and plan to get a 2nd one soon. At current market prices 2x 3090s is cheaper than a single 4090 and provides double vram with more performance. If fine-tuning/lora-ing and energy efficiency is a concern though, I would opt for a 4090 since it is both far faster and far more efficient

leonletto · on Aug 10, 2023

We bought an A6000 48GB ( as mentioned by someone else ) and it’s works great for $3800. The power requirements are modest as well compared to consumer GPU’s. We looked at the ADA version but even used they are a lot more and your buying speed not usability. I would rather buy another A6000 and have 96GB of ram to fine tune with. That’s just me though and everyone needs to rank their needs against what they can afford.

gorbypark · on Aug 9, 2023

A 192gb Mac Studio should be able to run an unquantized 70B and I think would cost less than running a multi gpu setup made up of nvidia cards. I haven’t actually done the math, though. If you factor in electricity costs over a certain time period it might make the Mac even cheaper!

oceanplexian · on Aug 9, 2023

A Mac studio will "run" the model as a glorified chat bot but it'll be unusable for anything interesting at 5-6t/s. With a couple of high end consumer GPUs you're going to get closer to 20t/s. You also be able to realistically fine tune models, and run other interesting things besides an LLM.

vczf · on Aug 10, 2023

I have a $5000 128GB M2 Ultra Mac Studio that I got for LLMs due to speculation like GP here on HN. I get 7.7 tok/s with LLaMA2 70B q6_K ggml (llama.cpp).

It has some upsides in that I can run quantizations larger than 48GB with extended context, or run multiple models at once, but overall I wouldn't strongly recommend it for LLMs over an Intel+2x4090 setup.

It's competitive, but has significant tradeoffs.

kolleykibber · on Aug 10, 2023

Could you give some examples of these interesting things?

smoldesu · on Aug 10, 2023

Inferencing would probably be ~10x slower than tiling the model across equivalently priced Nvidia hardware. The highest-end M2 Mac chip you can purchase today struggles to compete with last-gen laptop cards from Nvidia. Once you factor in the value of CUDA in this space and the quality of ML driver Nvidia offers, I don't see why Macs are even considered in the "cheapest hardware" discussion.

> If you factor in electricity costs over a certain time period it might make the Mac even cheaper!

I dunno about that. The M2 Max will happily pull over 200w in GPU-heavy tasks, if we're comparing a 40-series card with CUDA optimizations to Pytorch with Metal Performance Shaders, my performance-per-watt money is on Nvidia's hardware.

throwawayadvsec · on Aug 9, 2023

your idea of cheapest is using a mac to run a LLM? are you out of your mind?

gorbypark · on Aug 11, 2023

Well, to be fair, to run an unquantized 70B model is going to take somewhere in the area of 160gb of VRAM (if my quick back of the napkin math is ok). I'm not quite sure of the state of GPUs these days, but getting a 2x a100 80gb (or 4x 40gb) setup is probably going to cost more than a Mac Studio with maxed out RAM.

If we are talking quantized, I am currently running LLaMA v1 30B at 4 bits on a MacBook Air 24GB ram, which is only a little bit more expensive than what a 24GB 4090 retails for. The 4090 would crush the MacBook Air in tokens/sec, I am sure. It is however completely usable on my MacBook (4 tokens/second, IIRC? I might be off on that).

A 4 bit 70B model should take about 36GB-40GB of RAM so a 64GB MacStudio might still be price competitive with a dual 4090 or 4090 / 3090 split setup. The cheapest Studio with 64GB of RAM is 2,399.00 (USD).

Ms-J · on Aug 9, 2023

The only info I can provide is the table I've seen on: https://github.com/jmorganca/ollama where it states one needs "32 GB to run the 13B models." I would assume you may need a GPU for this.

Related, could someone please point me in the right direction on how to run Wizard Vicuna Uncensored or Llama2 13B locally in Linux? I've been searching for a guide and have not found what I need for a beginner like myself. In the Github I referenced the download is only for Mac at the time. I have a Macbook Pro M1 I can use though it's running Debian.

Thank you.

Patrick_Devine · on Aug 11, 2023

You can run `ollama run wizard-vicuna-uncensored:13b` and it should pull and run it. For llama2 13b, it's `ollama run llama2:13b`. I haven't seen the 13b uncensored version yet.

There's a complete list of models at https://gist.github.com/mchiang0610/b959e3c189ec1e948e4f6a1f...

We'll have a better way to browse these soon.

hdjfkfbfbr · on Aug 9, 2023

Hmmm I ran llama2 ggml q4 in 6gb ram with llama.CPP on my laptop.

Ms-J · on Aug 9, 2023

I very much do appreciate your comment and will look into into llama.cpp. Was it from here: https://github.com/ggerganov/llama.cpp

Do you have a guide that you followed and could link it to me or was it just from prior knowledge? Also, do you know if I could run the Wizard Vicuna on it? That model isn't listed on the above page.

hdjfkfbfbr · on Aug 9, 2023

Glad to be of help. Yea that is the repo.

https://replicate.com/blog/run-llama-locally

I found that guide here on hn.

I run it cpu only with 16 threads but yeah perf is good enough.

BTw my 6gb figure is me.measuring from htop so llama2 is likely less.

Ms-J · on Aug 10, 2023

Thanks for the starting point. I'll give an update if I'm able to successfully run the other models. I hope it could help the community.

singhrac · on Aug 12, 2023

This code runs Llama2 quantized and unquantized in a roughly minimal way: https://github.com/srush/llama2.rs (though extracting the quantized 70B weights takes a lot of RAM). I'm running the 13B quantized model on ~10-11GB of CPU memory.

Ms-J · on Aug 17, 2023

From what I gather, this is a Rust implementation that runs Llama2. Can it run any other models like the ones I'm having trouble finding info about?

hdjfkfbfbr · on Aug 9, 2023

Not sure about vicuna myself

mechagodzilla · on Aug 9, 2023

I've been able to run it fine using llama.cpp on my 2019 iMac with 128GB of RAM. It's not super fast, but it works fine for "send it a prompt, look at the reply a few minutes later", and all it cost me was a few extra sticks of RAM.

more_corn · on Aug 10, 2023

You can run on cpu and regular ram, but gpu is quite a bit faster.

You need about a gig of RAM/nvram per billion parameters (plus some headroom for a context window). Lower precision doesn’t really affect quality.

When Ethereum flipped from proof of work to proof of stake, a lot of used high-end cards hit the market.

4 of them in a cheap server would do the trick. Would be a great business model for some cheap colo to stand up a crap-ton of those and rent while servers to everyone here.

In the meantime if you’re interested in a cheap server as described above, post in this thread.

tuxpenguine · on Aug 9, 2023

I don't think it is the cheapest, but the tiny box is an option:

https://tinygrad.org/

Tepix · on Aug 10, 2023

Tiny box is for running it at fp16, most other comments here talk about running it with 4bit quantization

PenguinRevolver · on Aug 10, 2023

I feel as if the cheapest way of running these kinds of models would be to have the whole cache/memory take space on the hard drive rather than the RAM. Then, you could just use CPU power instead of splurging out thousands for RAM & a GPU with enough VRAM.

It might or might not be reasonable speeds, but I would reason that it could avoid "sunk cost irony"; if you decide, that any point, Chat-GPT would have sufficed in your task. It's rare, but it can happen.

If you want to take this silly logic further, you can theoretically run any sized model on any computer. You could even attempt this dumb idea on a computer running Windows 95. I don't care how long it would take; if it takes seven and a half million years for 42 tokens, I would still call it a success!

pocketarc · on Aug 10, 2023

You are right about that being the cheapest, of course, in the sense that 64gb of HDD space is always going to be cheaper than RAM. But when you say

> thousands for RAM

I wonder if your perspective might be a little off - you can get 64GB DDR4 RAM for ~$100, it’s really not a big deal these days.

It’s a big deal on Mac, of course, where 64GB means big kitted out high-end model that costs thousands, but RAM really is that cheap.

PenguinRevolver · on Aug 10, 2023

Understandable; the reason I said "thousands for RAM" was because when I made that sentence, I put the theoretical RAM and GPU prices together. Oh well.

pocketarc · on Aug 14, 2023

My apologies, I think the bit of context missing from my response is you don't need a GPU at all; 64GB of RAM will suffice to run a 70B model with your CPU, and it won't even be -that- slow, you'll get a few tokens per second.

So while a lot of us think that you need to splurge in order to get into LLMs, the reality is you don't, not really, and pretty much any computer will run any model, thanks to the efforts of projects like llama.cpp. Even using the disk like you mentioned! That's a thing, too. It's slower, but it's entirely possible.

If you're willing to drop down to the 7B/13B models, you'll need even less RAM (you can run 7B models with less than 8GB of RAM), and they'll run radically faster.

People have been working really hard to make it possible to run all these models on all sorts of different hardware, and I wouldn't be surprised if Llama 3 comes out in much bigger sizes than even the 70B, since hardware isn't as much of a limitation anymore.

1letterunixname · on Aug 9, 2023

If it's only for a short time, use a price calculator to decide if it's worth renting GPUs on a cloud provider. You can get immediate temporary access for far more computing power than you can ever hope to buy outright.

amerine · on Aug 9, 2023

You would need at least a RTX A6000 for the 70b. You’re looking at maybe $4k? Plus whatever you spend on the rest of the machine? Maybe $6k all-in?

ano88888 · on Aug 10, 2023

another question I would like to know: which cloud provider provides the cheapest GPU?TQ

quickthrower2 · on Aug 10, 2023

you need to qualify that with “that is actually available”. :-). A100s I hear are harder to get in bulk. But I have modest needs!

I have been using modal and vast. Vast is cheaper. Modal has some a free inclusions of $30 but that is probably $8 to get the same power in Vast. Modal resell AWS/GCP at the moment. GCP direct seems cheap enough. As does Lambda labs.

With vast some machines don’t start, so you just need to bin them and try another. For learning and non private this is acceptable. For serious stuff I think Vast lets you filter for data-centre GPUs. Modal tends to just work and lets you store the model for later more easily.

Overall: just go with vast. You boot it up and run SSH. It is a familiar experience. Very little time needed on RTFM stuff!

Tepix · on Aug 10, 2023

Vast.ai is dirt cheap if you want 1-8 RTX 3090s

mromanuk · on Aug 9, 2023

one 4090 + 3090-Ti

https://github.com/turboderp/exllama

trifurcate · on Aug 9, 2023

That's not a minimum, just what the author had on hand. I can get 65B and 70B running just fine with exllama on 2 3090s.

cypress66 · on Aug 9, 2023

2 used 3090s

phas0ruk · on Aug 9, 2023

SageMaker

jrflowers · on Aug 9, 2023

How much can I purchase one (1) SageMaker for?

shreezus · on Aug 9, 2023

Fair warning - tools like SageMaker are good for simple use cases, but SageMaker tends to abstract away a lot of functionality you might find yourself digging through the framework for. Not to mention - it's easy to rack up a hefty AWS bill

vasili111 · on Aug 9, 2023

How much approximately will be price per hour?

astrodust · on Aug 9, 2023

Very yes.