I have a $5000 128GB M2 Ultra Mac Studio that I got for LLMs due to speculation ...

I have a $5000 128GB M2 Ultra Mac Studio that I got for LLMs due to speculation like GP here on HN. I get 7.7 tok/s with LLaMA2 70B q6_K ggml (llama.cpp).

It has some upsides in that I can run quantizations larger than 48GB with extended context, or run multiple models at once, but overall I wouldn't strongly recommend it for LLMs over an Intel+2x4090 setup.

It's competitive, but has significant tradeoffs.