Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A 0.6B LLM with a 32k context window is interesting, even if it was trained using only distillation (which is not ideal as it misses nuance). That would be a fun base model for fine-tuning.

Out of all the Qwen3 models on Hugging Face, it's the most downloaded/hearted. https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...



these 0.5 and 0.6B models etc. are _fantastic_ for using as a draft model in speculative decoding. lm studio makes this super easy to do - i have it on like every model i play with now

my concern on these models though unfortunately is it seems like architectures very a bit so idk how it'll work


Spec decoding only depends on the tokenizer used. It's transfering either the draft token sequence or at most draft logits to the main model.


Could be an lm studio thing, but the qwen3-0.6B model works as a draft model for the qwen3-32B and qwen3-30B-A3B but not the qwen3-235B-A22B model


I suppose that makes sense, for some reason I was under the impression that the models need to be aligned / have the same tuning or they'd have different probability distributions and would reject the draft model really often.


Have you had any luck getting actual speedups? All the combinations I've tried (smallest 0.6 + largest I can fit into 24gb)...all got me slowdowns despite decent hitrate




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: