I also have a benchmark that I'm using for my nanoagent[1] controllers.
Qwen3 is impressive in some aspects but it thinks too much!
Qwen3-0.6b is showing even better performance than Llama 3.2 3b... but it is 6x slower.
The results are similar to Gemma3 4b, but the latter is 5x faster on Apple M3 hardware. So maybe, the utility is to run better models in cases where memory is the limiting factor, such as Nvidia GPUs?
What's cool with those models is that you can tweak the thinking process, all the way down to "no thinking". It's maybe not available in your inference engine though
FWIW, their readme states /nothink - and that's what works for me.
>/think and /nothink instructions: Use those words in the system or user message to signify whether Qwen3 should think. In multi-turn conversations, the latest instruction is followed.
Turns out just is not the word here. My benchmark is made using conversations, where there is a SystemMessage and some structured content in a UserMessage.
But Qwen3 seems to ignore /no_think when appended to the SystemMessage. I can try to add it to the structured content but that will be a bit weird. Would have been better to have a "think" parameter like temperature.
Qwen3 is impressive in some aspects but it thinks too much!
Qwen3-0.6b is showing even better performance than Llama 3.2 3b... but it is 6x slower.
The results are similar to Gemma3 4b, but the latter is 5x faster on Apple M3 hardware. So maybe, the utility is to run better models in cases where memory is the limiting factor, such as Nvidia GPUs?
[1] github.com/hbbio/nanoagent