Hmmm I ran llama2 ggml q4 in 6gb ram with llama.CPP on my laptop.

Ms-J · on Aug 9, 2023

I very much do appreciate your comment and will look into into llama.cpp. Was it from here: https://github.com/ggerganov/llama.cpp

Do you have a guide that you followed and could link it to me or was it just from prior knowledge? Also, do you know if I could run the Wizard Vicuna on it? That model isn't listed on the above page.

hdjfkfbfbr · on Aug 9, 2023

Glad to be of help. Yea that is the repo.

https://replicate.com/blog/run-llama-locally

I found that guide here on hn.

I run it cpu only with 16 threads but yeah perf is good enough.

BTw my 6gb figure is me.measuring from htop so llama2 is likely less.

Ms-J · on Aug 10, 2023

Thanks for the starting point. I'll give an update if I'm able to successfully run the other models. I hope it could help the community.

singhrac · on Aug 12, 2023

This code runs Llama2 quantized and unquantized in a roughly minimal way: https://github.com/srush/llama2.rs (though extracting the quantized 70B weights takes a lot of RAM). I'm running the 13B quantized model on ~10-11GB of CPU memory.

Ms-J · on Aug 17, 2023

From what I gather, this is a Rust implementation that runs Llama2. Can it run any other models like the ones I'm having trouble finding info about?

hdjfkfbfbr · on Aug 9, 2023

Not sure about vicuna myself