> My RTX 5090 is about 10x faster (measured by FP32 TFLOPS) and I still don't find it to be fast enough. I can't imagine using something so slow for AI/ML. Only 2.2 tokens/sec on an 8B parameter Llama model? That's slower than someone typing.
Its also orders of magnitudr slower than what I normally see cited by people using 5090s; heck, its even much slower than I see on my own 3080Ti laptop card for 8B models, though usually won’t use more than an 8bpw quant for that size model.
Its also orders of magnitudr slower than what I normally see cited by people using 5090s; heck, its even much slower than I see on my own 3080Ti laptop card for 8B models, though usually won’t use more than an 8bpw quant for that size model.