i don't think such a guide exists. this space is moving pretty fast. a short run...

louiskw · on Oct 16, 2023

Worth clarifying that GGML the library is very much active. GGML as a file format will be superseded by GGUF.

vanillax · on Oct 16, 2023

is everything ( for the most part ) a Llama model? does everything fork llama? is GGML part of llama? what is the relation of llama and mode formats. Is there an analogy? is GGML to react is to javascript? What is the differnence in GPT4all models vs llama.cpp vs ollama?

Thanks!

andy99 · on Oct 16, 2023

Everything (most llms and modern embedding models) is a transformer so the architecture is very similar. Llama(2) is a Meta (facebook) developed transformer plus the training they did on it.

Ggml is a "framework" like pytorch etc (for the purposes of this discussion) that lets you code up the architecture of a model, load in the weights that were trained, and run inference with it. Llama.cpp is a project that I'd describe as using ggml to implement some specific AI model architectures.

waldfee · on Oct 16, 2023

i am only dabbling in this space myself, so can't answer everything. all the formats i mentioned are for a quantized version of the original model. basically a lower resolution version, with the associated precision loss. e.g. original model weights are in f16, the gptq version is in int4. a big difference in size but often an acceptable loss of quality. using quants is basically a tradeoff between quality and "can i run it?".

examples of original models are llama(2), mistral, xwin. they are not directly related to any quantized versions. quants are mostly done by third parties (e.g. thebloke[1]).

using a full model for inference requires pretty beefy hardware. most inference on consumer hardware is done with quantized versions for that reason.

[1] https://huggingface.co/TheBloke

thangngoc89 · on Oct 16, 2023

GGML is the framework for running deep neural network, mostly for interference. It's the same level as Pytorch or Tensorflow. So I would say GGML is the browser in your Javascript/React analogy.

llama.cpp is a project that uses GGML the framework under the hood, same authors. Some features were even developed in llama.cpp before being ported to GGML. Ollama provides a user-friendly way to uses llama models. No ideas what it uses under the hood.

simonw · on Oct 16, 2023

The Llama name is pretty confusing at this point.

LLaMA was the model Facebook released under a non-commercial license back in February which was the first really capable openly available model. It drove a huge wave of research, and various projects were named after it (llama.cpp for example).

Llama 2 came out in July and allowed commercial usage.

But... there are increasing number of models now that aren't actually related to Llama at all. Projects like llama.cpp and Ollama can often be used to run those too.

So "Llama" no longer reliably means "related to Facebook's LLaMA architecture".

vanillax · on Oct 16, 2023

- GPTQ: pure gpu inference, used with AutoGPTQ, exllama, exllamav2, offers only 4 bit quantization

what is autoGTPTQ and exllama, what do it mean it only works with AutoGPTQ and exllama? Are those like TensorFlow Frameworks?

cptcobalt · on Oct 16, 2023

Ollama seems to be using a lot of the same, but as a really nice and easy to use wrapper for a lot of glue a lot of us would wind up writing anyway. It's quickly become my personal preference.

It looks to include submodules for GGML and GGUF from llama.cpp

https://github.com/jmorganca/ollama/tree/main/llm

aduffy · on Oct 16, 2023

The model discussed in the article is MiniLM-L6-v2, which you can run via PyTorch from the sentence-transformers project[1].

That model is based on BERT and not LLaMa [2].

[1]: https://www.sbert.net/docs/pretrained_models.html

[2]: https://huggingface.co/microsoft/MiniLM-L12-H384-uncased

moffkalast · on Oct 16, 2023

I think you're still missing AWQ ones, which are a sort of GPTQ but with dynamic quantization depending on weight importance iirc?