I've been trying both deepseek-r1:8b and deepseek-r1:32b using ollama on a local desktop machine.
Trying to get it to generate some pretty simple verilog code with extensive prompting.
It seems really bad?
Like specify what the module interface should be in the prompt and it ignores it and makes something up bad. Utterly rubbish code beyond that. Specify a calculation to be performed yet it calculates something very different.
What am I missing? Why is everyone so excited? Seems significantly worse to me than llama. Both o1-mini and claude haiku imperfect, sure, but way ahead. Both follow the same prompt and get the interface and calculation as specified. Am I doing it all wrong somehow (more than likely)?
After fixing up my open-webui install I tried "testing 1 2 3, testing. Respond with ok if you see this." Deepseek-r1:8b started trying to prove a number theory result.
Is there a chance this thing is heavily optimised for benchmarking not actual use?
Just to confirm, Ollama's naming is very confusing on this. Only the `deepseek-r1:671b` model on Ollama is actually deepseek-r1. The other smaller quants are a distilled version based on llama.
Which, according to the Ollama team, seems to be on purpose, to avoid people accidentally downloading the proper version. Verbatim quote from Ollama:
> Probably better for them to misunderstand and run 7b than run 671b. [...] if you don't like how things are done on Ollama, you can run your own object registry, like HF does.
It’s definitely on purpose - but if the purpose was to help the users making good choices they could actually give information - and explain what is what - instead of hiding it.
I think if you find Ollama useful, use it regardless of others say. I did give it a try, but found it lands in a weird place of "Meant for developers, marketed to non-developers", where llama.cpp sits on one extreme, and apps like LM Studio sits on the other extreme, Ollama landing somewhere in the middle.
I think the main point that turned me off was how they have their custom way of storing weights/metadata on disk, which makes it too complicated to share models between applications, I much prefer to be able to use the same weights across all applications I use, as some of them end up being like 50GB.
I ended up using llama.cpp directly (since I am a developer) for prototyping and recommending LM Studio for people who want to run local models but aren't developers.
But again, if you find Ollama useful, I don't think there is any reasons for dropping it immediately.
Yeah, I made the same argument but they seem convinced it's better to just provide their own naming instead of separating the two. Maybe marketing gets a bit easier when people believe them to be the same?
ollama has their own way of releasing their models.
when you download r1 you get 7b.
this is due to not everyone is able to run 671b.
if its missleading then more likely due to user not reading.
I'm not super convinced by their argument to blame users for not reading, but after all it is their project so.
> It is very interesting how salty many in the LLM community are over Deep Seek
You think Ollama is purposefully using misleading naming because they're mad about DeepSeek? What benefit would there be for Ollama to be misleading in this way?
The quote would imply some crankiness. But ye it could be just general nerd crankiness too of course. Maybe I should not imply the reason or speculate too much about the reason in this specific case.
It's also not helping the confusion that the distills themselves were made and released by DeepSeek.
If you want the actual "lighter version" of the model the usual way, i.e. third-party quants, there's a bunch of "dynamic quants" of the bona fide (non-distilled) R1 here: https://unsloth.ai/blog/deepseekr1-dynamic. The smallest of them is just able to barely run on a beefy desktop, at less than 1 token per second.
The press and news are talking about R1 while what you've been testing is the "distilled" version.
Sadly, Ollama has a bit of a confusing messaging about this, and it isn't super obvious you're not actually testing the model that "comes close to GPT-4o" or whatever the tagline is, but instead testing basically completely different models. I think this can explain the mismatch in expectation vs reality here.
Like an 8 year old factoring large numbers it's not amazing how well it is done it's that it is done at all that amazes. Sure. Amazing but not at all useful and not something one would expect the kind of fuss we've seen.
Seems the explanation is the deepseek-r1 models I was using are not, in fact, deepseek-r1. Thanks all for the heads up.
My take: the distills under 32B aren’t worth running. Quants seem to impact quality much more than other models. 32B and 70B unquantized are very good. 671B is SOTA.
Ollama is "taking flak" for the confusion because it's entirely created by them. If they renamed/split what they provide into deepseek-r1 and deepseek-distilled-r1, way less people would probably be confused about this.
Do other models do well for the same use cases? I thought LLMs are only good for low-value adtech codes and resources accessible on public Internet, like tons of getters/setters and onEvent triggers without much CS elements or time or multi domain implications.
They're also hyper sensitive to what I'd describe as geometric congruency between input and output: your input has to be able to be decompressed into final form with basically zero IQ spent on it, as if the input were zipped version of as yet existing output that the LLM simply macro-expanded.
R1 is just an improved LLM, nothing groundbreaking in those specific areas. Common limitations of LLMs still apply.
IMO, the layman's model of LLM should be more of predictive text than AI. It's a super fast keyboard that types faster than, not better than, your fingers.
> I thought LLMs are only good for low-value adtech codes and resources accessible on public Internet, like tons of getters/setters and onEvent triggers without much CS elements or time or multi domain implications.
I'm not sure where you get this generalization from. It seems like most local models you can run locally today on consumer hardware are kind of at that level, at least in my experience. But then you have things like o1 "pro mode" which pretty much allowed me to program things I couldn't before, and no other LLM until o1 could actually help me do.
They aren't deepseek at all but Distill models.
In LLM what distillation means is the better models trains ( fine tune ) the smaller models with their knowledge ( responses) so the smaller models are also getting better.
Deepseek 14b and 32b should be good enough, they are based on Alibaba's Qwen model and usually the best models from opensource 40b or less.
Deepseek 8b is based on Meta's Llama3
I would say 70b doesn't worth the cost in comparison to 32b except coding... but if you want a model for coding, then you should try a model specifically trained for that, like Deepseek-Coder or Mistral's Codestral.
Deepseek-r1 is considered a general use AI. It would do good enough at many topics but won't excel on everything.
I tried that distill, plus the "original" at chat.deepseek.com and the Azure-hosted replica on a simple coding problem (https://taoofmac.com/space/blog/2025/01/29/0900), and all three were bad, but not that bad. I suspect the distill will freak out with very little context.
Everyone keeps mentioning that you’re using the distilled version, which is true. But the real question is, do you see acceptable results with any model, open or private?
Verilog is relatively niche as far as programming languages go, so I’m not surprised that you’d have trouble getting good output generally. You can only train the model on so much stuff, and there is probably limited high quality training data for verilog. It’s possible the model planners just decided not to prioritize this data in the training set. 8b sized models will especially struggle to have enough knowledge about niche topics to reason over it. Anything that small is really just a language tool for NLP tasks unless it’s trained specifically to do something.
All that said, your comment does illustrate a misunderstanding with the “thinking” models. They always output a long monologue on what to say, for anything, even “hello”. It’s a different skill to prompt and steer them in the right direction. Again, small models will be worse at everything, even being directed in the right direction.
TLDR: I think you need to find a new model, or at least try the “full” version through the web app or API first.
the mind model behind it very different then that of "normal" programming languages, so less reuse of learned knowledge from other places ("knowledge" for a lack of better wording)
Trying to get it to generate some pretty simple verilog code with extensive prompting.
It seems really bad?
Like specify what the module interface should be in the prompt and it ignores it and makes something up bad. Utterly rubbish code beyond that. Specify a calculation to be performed yet it calculates something very different.
What am I missing? Why is everyone so excited? Seems significantly worse to me than llama. Both o1-mini and claude haiku imperfect, sure, but way ahead. Both follow the same prompt and get the interface and calculation as specified. Am I doing it all wrong somehow (more than likely)?
After fixing up my open-webui install I tried "testing 1 2 3, testing. Respond with ok if you see this." Deepseek-r1:8b started trying to prove a number theory result.
Is there a chance this thing is heavily optimised for benchmarking not actual use?