Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been trying both deepseek-r1:8b and deepseek-r1:32b using ollama on a local desktop machine.

Trying to get it to generate some pretty simple verilog code with extensive prompting.

It seems really bad?

Like specify what the module interface should be in the prompt and it ignores it and makes something up bad. Utterly rubbish code beyond that. Specify a calculation to be performed yet it calculates something very different.

What am I missing? Why is everyone so excited? Seems significantly worse to me than llama. Both o1-mini and claude haiku imperfect, sure, but way ahead. Both follow the same prompt and get the interface and calculation as specified. Am I doing it all wrong somehow (more than likely)?

After fixing up my open-webui install I tried "testing 1 2 3, testing. Respond with ok if you see this." Deepseek-r1:8b started trying to prove a number theory result.

Is there a chance this thing is heavily optimised for benchmarking not actual use?



Just to confirm, Ollama's naming is very confusing on this. Only the `deepseek-r1:671b` model on Ollama is actually deepseek-r1. The other smaller quants are a distilled version based on llama.

https://ollama.com/library/deepseek-r1


Which, according to the Ollama team, seems to be on purpose, to avoid people accidentally downloading the proper version. Verbatim quote from Ollama:

> Probably better for them to misunderstand and run 7b than run 671b. [...] if you don't like how things are done on Ollama, you can run your own object registry, like HF does.


It’s definitely on purpose - but if the purpose was to help the users making good choices they could actually give information - and explain what is what - instead of hiding it.


I read, they don't merge PRs for Intel or AMD hardware, so it seems to be generally a bit of a shady project.


Just use Llamacpp directly


Could you expand on this, is there any disadvantage to continuing with ollama?

I use Ollama for prototyping and then move what I can to a vLLM set up


I think if you find Ollama useful, use it regardless of others say. I did give it a try, but found it lands in a weird place of "Meant for developers, marketed to non-developers", where llama.cpp sits on one extreme, and apps like LM Studio sits on the other extreme, Ollama landing somewhere in the middle.

I think the main point that turned me off was how they have their custom way of storing weights/metadata on disk, which makes it too complicated to share models between applications, I much prefer to be able to use the same weights across all applications I use, as some of them end up being like 50GB.

I ended up using llama.cpp directly (since I am a developer) for prototyping and recommending LM Studio for people who want to run local models but aren't developers.

But again, if you find Ollama useful, I don't think there is any reasons for dropping it immediately.


Yeah, I made the same argument but they seem convinced it's better to just provide their own naming instead of separating the two. Maybe marketing gets a bit easier when people believe them to be the same?

    ollama has their own way of releasing their models. 
    when you download r1 you get 7b. 
    this is due to not everyone is able to run 671b. 
    if its missleading then more likely due to user not reading.  
I'm not super convinced by their argument to blame users for not reading, but after all it is their project so.


If nothing is specified the rule of least surprise would be the full vanilla version I would say.


The conspiracy theorist in me thinks that it's deliberate sabotage of a Chinese model.


No, those checkpoints have also been provided by DeepSeek.


It is very interesting how salty many in the LLM community are over Deep Seek.

DS has more or less been ignored for a very long time before this.


> It is very interesting how salty many in the LLM community are over Deep Seek

You think Ollama is purposefully using misleading naming because they're mad about DeepSeek? What benefit would there be for Ollama to be misleading in this way?


The quote would imply some crankiness. But ye it could be just general nerd crankiness too of course. Maybe I should not imply the reason or speculate too much about the reason in this specific case.

There is no benefit I think.


It's also not helping the confusion that the distills themselves were made and released by DeepSeek.

If you want the actual "lighter version" of the model the usual way, i.e. third-party quants, there's a bunch of "dynamic quants" of the bona fide (non-distilled) R1 here: https://unsloth.ai/blog/deepseekr1-dynamic. The smallest of them is just able to barely run on a beefy desktop, at less than 1 token per second.


Also Ollama is traditionally very sloppy with the chat templates they use, which does impact model performances.


> Ollama is traditionally very sloppy with the chat templates they use

Not that I don't believe you (I do, and I think I've seen them correct this before too), but you happen to have specific examples when this happened?


https://github.com/ollama/ollama/issues/1977

More recently deepseek 2 had a space after the assistant turn, causing issues with output quality and language https://www.reddit.com/r/LocalLLaMA/comments/1dko6rp/if_your...


I feel this particularly when I use gguf support.

How do you get accurate information on the template structure?


The press and news are talking about R1 while what you've been testing is the "distilled" version.

Sadly, Ollama has a bit of a confusing messaging about this, and it isn't super obvious you're not actually testing the model that "comes close to GPT-4o" or whatever the tagline is, but instead testing basically completely different models. I think this can explain the mismatch in expectation vs reality here.


You are hard to impress. Running a 1/20x sized version locally that would be sci fi level 10 years ago.

For such small models I would recommend specialized models only. Like Deep Seek Coder. But I think that one is lagging behind the state of art now.


Like an 8 year old factoring large numbers it's not amazing how well it is done it's that it is done at all that amazes. Sure. Amazing but not at all useful and not something one would expect the kind of fuss we've seen.

Seems the explanation is the deepseek-r1 models I was using are not, in fact, deepseek-r1. Thanks all for the heads up.


Ye, to your defence it seems like the Ollama project did it confusing on purpose to mess with Deep Seek.


The distilled versions are terrible and the full version seems very slow, but the latter is certainly in o1’s league.



>What am I missing?

Like everything. You do not even mention your machine spec, so I'm assume you just pick the ones that fit, which probably the quant versions.

Quant versions of the "small" models do not perform that well. Not the way you expected them to be.


You’re using slightly improved qwen and llama not r1. R1 only has 600b model


My take: the distills under 32B aren’t worth running. Quants seem to impact quality much more than other models. 32B and 70B unquantized are very good. 671B is SOTA.


In my own tests even the 70B distill has an unacceptably high rate of hallucinations that makes it hard to trust the results.


Those aren't actually DeepSeek. They are just Qwen or LLaMa distilled by DS. It confused me too. ollama is taking flak for the confusion.


Ollama is "taking flak" for the confusion because it's entirely created by them. If they renamed/split what they provide into deepseek-r1 and deepseek-distilled-r1, way less people would probably be confused about this.


Indeed.


Do other models do well for the same use cases? I thought LLMs are only good for low-value adtech codes and resources accessible on public Internet, like tons of getters/setters and onEvent triggers without much CS elements or time or multi domain implications.

They're also hyper sensitive to what I'd describe as geometric congruency between input and output: your input has to be able to be decompressed into final form with basically zero IQ spent on it, as if the input were zipped version of as yet existing output that the LLM simply macro-expanded.

R1 is just an improved LLM, nothing groundbreaking in those specific areas. Common limitations of LLMs still apply.

IMO, the layman's model of LLM should be more of predictive text than AI. It's a super fast keyboard that types faster than, not better than, your fingers.


> I thought LLMs are only good for low-value adtech codes and resources accessible on public Internet, like tons of getters/setters and onEvent triggers without much CS elements or time or multi domain implications.

I'm not sure where you get this generalization from. It seems like most local models you can run locally today on consumer hardware are kind of at that level, at least in my experience. But then you have things like o1 "pro mode" which pretty much allowed me to program things I couldn't before, and no other LLM until o1 could actually help me do.


They aren't deepseek at all but Distill models. In LLM what distillation means is the better models trains ( fine tune ) the smaller models with their knowledge ( responses) so the smaller models are also getting better.


Deepseek 14b and 32b should be good enough, they are based on Alibaba's Qwen model and usually the best models from opensource 40b or less.

Deepseek 8b is based on Meta's Llama3

I would say 70b doesn't worth the cost in comparison to 32b except coding... but if you want a model for coding, then you should try a model specifically trained for that, like Deepseek-Coder or Mistral's Codestral.

Deepseek-r1 is considered a general use AI. It would do good enough at many topics but won't excel on everything.


I tried that distill, plus the "original" at chat.deepseek.com and the Azure-hosted replica on a simple coding problem (https://taoofmac.com/space/blog/2025/01/29/0900), and all three were bad, but not that bad. I suspect the distill will freak out with very little context.


There's a chance that the 8b and 32b models are not meant to work like chatgpt.. deepseek on the web chat is 671b as far as I know


Why don't you give the versions on the website a go? Because they're very capable.


Everyone keeps mentioning that you’re using the distilled version, which is true. But the real question is, do you see acceptable results with any model, open or private?

Verilog is relatively niche as far as programming languages go, so I’m not surprised that you’d have trouble getting good output generally. You can only train the model on so much stuff, and there is probably limited high quality training data for verilog. It’s possible the model planners just decided not to prioritize this data in the training set. 8b sized models will especially struggle to have enough knowledge about niche topics to reason over it. Anything that small is really just a language tool for NLP tasks unless it’s trained specifically to do something.

All that said, your comment does illustrate a misunderstanding with the “thinking” models. They always output a long monologue on what to say, for anything, even “hello”. It’s a different skill to prompt and steer them in the right direction. Again, small models will be worse at everything, even being directed in the right direction.

TLDR: I think you need to find a new model, or at least try the “full” version through the web app or API first.


not only is Verilog comparatively "niche"

the mind model behind it very different then that of "normal" programming languages, so less reuse of learned knowledge from other places ("knowledge" for a lack of better wording)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: