What's the performance like in tokens/s?

woadwarrior01 · on Feb 28, 2024

You can see ms/token in a tiny font on the top of the screen, once the text generation completes in both the videos I'd linked to. Performance will vary by machine. On my 64GB M2 Mac Studio Max, I get ~47 tokens/s (21.06ms/token) with Mistral Instruct v0.2 and ~33 tokens/s (30.14ms/token) with Mixtral Instruct v0.1.

castles · on Feb 28, 2024

Interesting! What's the prompt eval processing speed like compared to llama.cpp and kin?

woadwarrior01 · on Feb 28, 2024

I haven't run any specific low level benchmarks, lately. But chunked prefilling and tvm auto-tuned Metal kernels from mlc-llm seemed to make a big differenced, the last time I checked. Also, compared to stock mlc-llm, I use a newer version of metal (3.0) and have a few modifications to make models have a slightly smaller memory and disk footprint, also slightly faster execution. Because unlike the mlc-llm folks, I only care about compatibility with Apple platforms. They support so much more than that in their upstream project.

castles · on Feb 28, 2024

thanks, I'll give it a crack