The best way for you to run the model is probably through https://github.com/ggerganov/llama.cpp. This is a plain C/C++ implementation that can run LLMs pretty efficiently. It can run the LLaMA 13B variants at a pretty quick pace (~100 ms/token) on my M1 Pro macbook.
Thank you. As I'm new to this field, I have been grappling with some concepts like GGML vs GPTQ models, etc. It seems the GPTQ models only run on Linux, so your suggestion is right to use llama.cpp as that works with GGML and is compatible with M1 Macs as well.
I had trouble running oobabooga on M1 Mac, something about Python arm64 vs x86_64...
I didn't know about GPT4-x-Vicuna that you mentioned. Interesting to see that it's uncensored. I knew about Wizard-Vicuna-uncensored though. It seems like GPT4-x-Vicuna is similar.
Not op, but working on network code it suddenly refused asserting it could not help me make 'hacking tools', which was not the case. Furthermore, even if it was, those tools are legal and useful. Unwanted censorship abounds.
I was thinking more about it helping people doing something bad, like write malware.
In fact I expect the phishers will have a field day with this. I expect phishing to become a lot more targeted and accurate in the near future. And much harder to detect.
I personally had the best experience with GPT4-x-vicuna: https://huggingface.co/NousResearch/gpt4-x-vicuna-13b.
There's more variants, and you can find information on them on https://www.reddit.com/r/LocalLLaMA.
The best way for you to run the model is probably through https://github.com/ggerganov/llama.cpp. This is a plain C/C++ implementation that can run LLMs pretty efficiently. It can run the LLaMA 13B variants at a pretty quick pace (~100 ms/token) on my M1 Pro macbook.
I'd be happy to answer more questions.