> My RTX 5090 is about 10x faster (measured by FP32 TFLOPS) and I still don't fi...

> My RTX 5090 is about 10x faster (measured by FP32 TFLOPS) and I still don't find it to be fast enough. I can't imagine using something so slow for AI/ML. Only 2.2 tokens/sec on an 8B parameter Llama model? That's slower than someone typing.

Its also orders of magnitudr slower than what I normally see cited by people using 5090s; heck, its even much slower than I see on my own 3080Ti laptop card for 8B models, though usually won’t use more than an 8bpw quant for that size model.