I've worked on projects in the airline and health industry which are highly regulated too. The regulations can be incredibly difficult to process and implement, and make sure you adhere to everything correctly. I've been involved in multiple scenarios where people have made false assertions about compliance or lack of. I'd still place a bet that the SOA models make _far_ less mistakes than humans.
They might make fewer mistakes, but they aren't evenly distributed. They don't use logic when making mistakes, it is gaps in the training data and now large of a span they have to bridge in the latent space. Just as they aren't smart like humans, they aren't stupid like humans. Don't mistake rate for quality.
For some reason, tons of people seem to be in camps at both extremes. It's either "AI sucks don't trust it!" or "AI is so much better than humans!"
But the most reasonable take, which I'm happy to see reflected in so many comments in this thread, is… use both.
Do an AI pass, and have humans verify, and vice versa. Let the humans drive the AI. Then the unique shortcomings of each party can be covered by the other's strengths.
AI review is never going to beat a fully resourced human review.
It might beat an underresourced human review, on time, efficiency, cost metrics. But on the metric of accuracy, throwing unlimited humans at a problem will still beat throwing unlimited AI at it
That's an irrelevant comparison because cost is always a constraint, so there are not going to be unlimited AI or humans. The question is how to optimally combine them for a given cost.
> Do an AI pass, and have humans verify, and vice versa. Let the humans drive the AI.
You can do that, sure. But doing so negates any improvements in speed the LLM brought. And at that point, you may as well just do it yourself to begin with.
When Google showed up on the scene I found I no longer needed to memorize basic syntax and other such things. If I couldn't remember on the fly, i'd just do a quick google search and move on. This freed space in my mind to instead focus on bigger & better things.
I use GenAI tools when coding a lot, but I do not vibe code. I go through everything it generated, and we iterate. And yes, it doesn't save me a lot of time. But what it does do is free up mental capacity in a similar manner. But instead of syntax, it's more complicated patterns. Maybe I don't remember how to stitch something together, but i know it can be done. Instead of spending the time to look it up and then code it, I just tell it to do it for me.
Yeah, humans reviewing the AI review can only detect the false positives, where the LLM claims something is non-compliant and flags it for review/correction by a human or another agent. Human review can’t find the false negatives (true deficiencies not flagged) unless you do a full audit yourself to find whatever deficiencies the AI missed.
>I'd still place a bet that the SOA models make _far_ less mistakes than humans.
Genuine question: your top coder seems to be producing the most error-free code from your perspective, has the deepest knowledge of the architecture and codebase, and is faster on the trigger than the others.
But your top coder has proven and verifiable dementia, where they will confidently assume the existence of apis and code that do not exist, mix up the purpose of others and forget other things, and you can't predict when and how they will introduce errors into the system or the severity of such errors.
Are you really comfortable letting this person with dementia generate most of your codebase in the airline and health industry?
I also hope you have an iron-clad agreement that prevents the model provider from doing silent updates because all your evidence of correctness you collected thus far goes out the window in that case.
Another genuine question:
You have witnessed a human coder and the AI you're using make the same important mistake. Assuming you do not have the time and resources to retrain, fine tume, and test your frontier model:
Who would you trust not to make the same mistake multiple times in the future after you have warned them that their job depends on it, the AI or the human?
Your top coder has guard rails in place to prevent him autonomously going free - right? This is how you should approach agentic development with LLMs. Like it or not, we are the final bastion, the gatekeepers. The hallucination thing I think is mostly overblown and from speaking to colleagues it seems to vary wildly depending on which model and harness you are using - always go for SOA. In the last 3 months I can count on one hand where it's done something wrong and that's primarily as I'm operating it with guard rails and giving it context.
>Your top coder has guard rails in place to prevent him autonomously going free - right?
The parent is implying they would prefer an AI when working in the airline and health industry because it makes less errors. Read the comment again.
They have not said, "Hey, I work in the airline and health industry and I'd love to use AI for a couple of the bullshit IT UIs we have as long as we can put guardrails on the AI to stay in its lane."
I asked a yes or no question. The guardrails you can put to mitigate errors are the same guardrails pre-AI for the humans (tests, regressions, reviews). If you were wary of employing a top lead engineer with verifiable dementia prior to AI for a mission critical system, logic implies you should think twice giving that much responsibility to an AI as well.
> The hallucination thing I think is mostly overblown
Can you predict when and how the SOTA model will hallucinate? Yes or no. Can you predict the severity impact of that error beforehand? Yes or no.
>from speaking to colleagues it seems to vary wildly depending on which model and harness you are using
You have partially answered my question it would seem.
> Can you predict when and how the SOTA model will hallucinate? Yes or no. Can you predict the severity impact of that error beforehand? Yes or no.
No, but the same can be said for your colleagues. You might call what the LLM does hallucinations, I'd call them mistakes. I think we have totally forgotten that humans make them all the time and are confidently wrong too.
Your original question, doesn't really get to the bottom of the point I'm trying to make, and I don't really feel it fairly represents the issue we are talking about here. They are not the same things.
This stupid argument again. The number of mistakes _does not matter_. Get. This. In. Your. Head. The predictability of the _type_ of error is what matters. For LLMs and machine learning in general the error distribution is not what you would expect and it is not possible to predict either.
It's not just about it taking the technical competence away from our job, it's taken away the joy [1] which I wrote about.
I feel like many of my peers are beating around the bush on this topic and in denial. Even if you accept it can do a large portion of the technical part of our work, we are just supervisors at this point making sure it doesn't do any stupid shit. What is the point? Where is the fun in this? Where is the challenge? At least I have enjoyed building my career over the last 20+ years and building software, but find little joy in the work I'm doing now.
I think we're going to see a massive exodus of folks leaving the profession and a huge mental health crisis, long before the folks working in other sectors realise what's hit them.
I don't see the problem here. It's a great product and if they want to make money then I don't mind. If it's too expensive, and they hike the price to something ridiculous then I'll vote with my wallet.
I’m fine with paying a bit more. I honestly don’t think I even use any of the premium features. I started paying because their founder answered some question I sent years ago and I figured that kinds of support deserved my support. I could still be on the free tier if cost were a concern.
With that said, I do find the direction here concerning. Quietly rewriting values, removing promise of free tier, hiking prices with almost no notice. I’m concerned that this feels sudden and sneaky. Sneaky behavior erodes trust.
Management and leadership values, character, and integrity matter because it's unwise to assume there is some homogenous allegiance to customers behind the propaganda of putting the customer first. PE will and must squeeze for their margins as is their wont. They have learned it's unwise to draw attention to this.
I'm in the same boat, became a premium member to support Bitwarden and use the built-in authenticator. The subscription price is now a negative proposition, alongside the silent rollout and the other red flags raised in the post. I'll probably move to self-hosted, since I have spare compute on my VPS.
I am fine with the price increase, for me its how sneaky they're being about everything. If they sent a few emails about the recent changes I wouldn't care, but it feels like they do not want customers to know which is the last thing I want from a password manager.
Indeed. As I'm sure the new PE-focused CEO knows, the sale of a company includes not just the typical balance sheet items but also intangible assets such as goodwill. Being sneaky about is an attempt to minimize the loss of such intangibles ahead of a sale.
The problem is the rug-pull. You can't go and proudly state "free forever", and then silently back down on that commitment. That is a textbook example for the enshittification cycle... lure users in with grand promises, sell out once you got enough of a following.
(Well, technically, you can, but then don't complain about getting called out)
You must be getting a different version of that page than me. The free tier is there but there’s no “always free” verbiage. There is “start free” verbiage.
Edit: “always free” was hidden under a collapsed section
LOL.. you are correct. Funny thing though... the 'Always Free' text is linked to a "/start-free/" action\page. One could argue that they are hedging their bets.
Some other commenter says there are Archive.org cached versions with "Start free" instead of "Always free", so they must have backpedaled on this. Maybe they realized they turned the knob a bit too much towards "hot", increasing the temperature of the proverbial water too noticeably.
I’m not willing to check all the pages on archive.org but for sure a month ago they had a big “Basic Free” tile in the plan comparison. Now it’s just Premium and Family. They are definitely downplaying the ability to use it for free.
Seems like they want to downplay the mentally that you would never benefit from an upsell to the paid plans, even if the free plan itself stays always free
Let them cook. Anything that they can do to get rid of the absolute hell that is dependencies in the JS ecosystem is worthwhile. I really don't care what they add as long as it's maintained
Usually on the discover weekly playlists. It started with hip hop jazz remakes about a year ago, presumably as I like hip hop, have engaged with genuine hip hop jazz covers before and these were going viral at the time.
I hate to think what else might have surfaced on these generated playlists (which for me are the #1 selling point and reason I have stayed with Spotify), that I haven't noticed yet is AI.
I'm always wondering how long it will take for popular sentiment to finally shift. So many years of things like Blinky the fish in the Simpsons really did a number on our shared consciousness.
I think the series of actual nuclear disasters from the 1950s to 2000s - plus the fear of a hot nuclear war in the ‘70s - had more impact on the collective consciousness than The Simpsons.
This has been exactly my experience too. I've tried multiple harnesses (pi, claude code, codex) with multiple variants of qwen3.6 and gemma4 driven by both o mlx and ollama - and every single time I try to do anything meaningful I end up in a loop. On a 64GB Macbook Pro M3 Max.
I really don't know what the hell people are doing locally, and suspect a lot of the hype around running these models locally is bullshit. Sure, you can make it do something but certainly nothing useful or substantial.
I have been testing and using Qwen3.6 27B (running from my 3090) since it dropped and I genuinely think this is the first consumer hardware-grade model that can actually replace frontiers for a lot of workloads.
I ran 8 tests on a variety of open-weights models, and opus 4.7 (1mil ctx version) and the little dense model was right behind it: https://github.com/sleepyeldrazi/llm_programming_tests/tree/...
Of note is that opus was the only model to push back against the spec on the hardest challenge, saying 'thats not possible', when there are links in the spec to examples of it being done.
There may be problems with the mlx versions, as i haven't had any looping in all the testing i've done, which is all my agentic and coding work the last couple of days (since it dropped). I have had tool_call misses 4 or 5 times so far, which isn't ideal but no looping. First I used it in pi-mono and later when i realized it's a serious model switched to opencode.
My setup is llama.cpp running on a 3090 in WSL, unsloth IQ4_NL with those flags:
--ctx-size 128000 \
--jinja \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--repeat-penalty 1.0 \
--presence-penalty 0.0 \
--threads 12 \
--gpu-layers 99 \
--no-warmup \
--no-mmap \
-fa on
Maybe someone knows any tips to optimise prompt processing as that's the slowest part? It takes a few minutes before OpenCode with ~20k initial context first responds, but subsequent responses are pretty fast due to caching.
Many of us tested 27B and 35B side by side, and the dense model is significantly smarter. It indeed is slower, but 35B makes a lot of mistakes 27B doesn't.
I haven't honestly dug around to figure out if there's a hardware reason for it, but prompt processing has always been a lot slower for me on macs in general. I mostly use MLX on my 24GB M4 Pro though, so I will pull llama.cpp on it as well to see what the prefill is like.
I've gotten around 16 t/s gen with 4bit and mxfp4 on that model for generation. The 3090 I mentioned has a little over 900 gb/s, while those macs i think are around 270 GB/s. If my understanding is correct, macs do utilize the bandwidth better in this case, but it still doesn't make up the difference (on the 3090 it's around 30-35 t/s depending on size of ctx).
Also, do run a quick experiment removing the cache quants if you want to tinker with it a bit more, iirc KV quant does add a small overhead during prefill.
I would be very interested to know your prefill and generation numbers.
Combine that with full ZIMs of Wikipedia and Stack Overflow, plus documentation of your languages of choice, and you should be golden. I have 4TB SSDs in almost all my laptops (except Macs due to Tim Apple's price-gouging, but I am transitioning away from macOS), and I sync my entire eBook library as well so I am fully covered on the reference manual front.
I specifically tested on tasks I designed because I know every modern model, not only local ones, are bechmaxxed. The common benchmarks most labs use are (very likely) in their datasets to a degree (I'm assuming unintentionally, but is still highly probable) and there was a recent report on how easy it is to actually cheat them, as shown by people at UC Berkeley https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/
That is precisely why my testing has been daily driving the model for everything + 8 tasks in a domain I care about. Could there be something very similar in their datasets? Of course, at least for most of the tasks, but if that lead to the good performance experience and results I'm getting, I am personally ok with that. I don't care how high the numbers are on the common benchmarks, only if it works well enough for me.
And if this model doesn't work for you, that's perfectly ok. Everyone has different needs from models. I was just impressed that it did for me, as it was a first from a local model.
If the "loop" you mean is the infinite reasoning cycle ("Wait, actually... On second thought..."), you might want to try setting a reasoning budget. For llama.cpp, use `--reasoning-budget 1024 --reasoning-budget-message "Proceed to final answer."` to force the model to reach a conclusion.
I admit I sometimes get caught up in the tooling for its own sake, but I find local models useful for specific tasks like migrating configuration schemas, writing homelab scripts, or exploring financial data.
It might sound a bit paranoid, but privacy is another major driver for me. Keeping credentials and private information off cloud services is worth the extra friction.
> a lot of the hype around running these models locally is bullshit. Sure, you can make it do something but certainly nothing useful or substantial.
There is certainly a lot of hype around local models. Some of it is overhype, some of it is just "people finding out" and discovering what cool stuff you can do. I suspect the post is a reply to the other one a few days ago where someone from hf posted a pic with them in the plane, using a local model, and saying it's really really close to opus. That was BS.
That being said, I've been working with local LMs since before chatgpt launched. The progress we've made from the likes of gpt-j (6B) and gpt-neoX (22B) (some of the first models you could run on regular consumer hardware) is absolutely amazing. It has gone way above my expectations. We're past "we have chatgpt at home" (as it was when launched), and now it is actually usable in a lot of tasks. Nowhere near SotA, but "good enough".
I will push back a bit on the "substantial" part, and I will push a lot on "nothing useful". You can, absolutely get useful stuff out of these models. Not in a claude-code leave it to cook for 6 hours and get a working product, but with a bit of hand holding and scope reduction you can get useful stuff. When devstral came out (24B) I ran it for about a week as a "daily driver" just to see where it's at. It was ok-ish. Lots of hand holding, figured out I can't use it for planning much (looked fine at a glance, but either didn't make sense, or used outdated stuff). But with a better plan, it could handle implementation fine. I coded 2 small services that have been running in prod for ~6mo without any issues. That is useful, imo. And the current models are waaay better than devstral1.
As to substantial, eh... Your substantial can be someone else's taj mahal, and their substantial could be your toy project. It all depends. I draw the line at useful. If you can string together a couple of useful tasks, it starts to become substantial.
Can you share more on how your setup has changed over time for running these? Do you prompt them for code samples like some people do with ChatGPT or did you integrate them into your IDE or some kind of custom harness?
Sure. I usually work with devcontainers from vscode. They provide great integration ootb (port forwarding & stuff) and are ok for containing the agents for most cases. If you want to work on docker projects I also tried vagrant for vm with docker inside, and you instantiate the agent from vagrant.
For local models I used mainly cline and then roo code extensions. Roo was a bit better because it offered more customisation (prompts, tool choice, etc). I found that local models need shorter prompts and less tools to be effective. Unfortunately roo seems to be discontinued, no idea what I'll use after it stops working. Cline works fine for most of the cases ootb, especially if you run inference on a platform that supports good kv caching - I use vLLM.
For subscriptions I use their own harness, as you get the best bang for the buck. For 3rd party subscriptions that don't have their own harness I use opencode (I got a very cheap sub for GLM that I use for exploration and oss projects).
Same here. Every time a new local model comes out, I give it a spin with a pretty vanilla coding task ("refactor this method to take two parameters instead of one", or "fix this class of compiler warning across the ~20 file codebase") and more often than not, they get in endless loops, or fail in very unusual ways. They don't yet even approach the usefulness of SOTA models. It's obviously not a fair comparison, though. My 20GB GPU is never going to beat whatever enormous backend Google or Anthropic have.
You can do this with really small models but you have to do a more legwork. I wouldn't expect most trivially small models to handle anything more than 1 file reliably. The new qwen 3.6 is different though, I have heard cases where it is behaving close to sonnet.
That said I don't see why people are so scared to touch code even if it saves them 500 euro a month. Using my IDEs find across my repo and auto replacing 2 patterns is trivial to do and way faster to do by hand. I mostly use small models, it prevents a lot of the issues I've seen with large models and vibe/agentic coding medium to long term. I also write a lot of code.
You need to set sampling parameters for the llm. Had the same issue with Qwen3.5 when i first started. You can grab them off the model card page usually.
From Qwen3.6 page:
Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
min_p author here. min_p is strictly better than top_p and top_k. The big labs don't know shit about sampling, and give absolutely nuts recommendations like this.
set min_p to like 0.3 and ignore top_p and top_k and you'll be fine.
There's better samplers now like top N sigma, top-h, P-less decoding, etc, but they're often not available in your LLM inference engine (i.e. vLLM)
I’m wondering though, what does extra creativity in code generation actually look like? How is the creativity expressed in code? Does the LLM reach for Bubble Sort instead of Quicksort? Maybe it decides that sorting only the first 10 elements of an array is enough? Funny variable names? Cursing in comments?
In this case, we are not arguing that min_p is better for "creative code" (you really don't want high temperature anywhere near your code generation, despite the "turning up the heat" framing of our paper) - at least in my post claiming min_p is strictly better than top_p above.
We are instead arguing that min_p handles truncating tokens that are more likely to lead to degeneration/looping because it is partially distribution aware. Fully distribution aware samplers like the ones I mentioned above (i.e. P-less decoding) are strictly superior due to using the whole distribution to decide the truncation at every time step.
Code hallucinations, like many LLM hallucinations, can be seen as accumulation of small amounts of "sampling errors".
Yes, have tried all of these (as per the docs). Have you actually tried these? Because I have tried all 3 configurations with agentic coding that you mentioned and have the same result - loops.
I've used only Qwen3.5 so far for work and was, after initial struggles, successful with GPU setup, no mlx. Ngl the fact that they are using `presence_penalty: 0` and no `max_tokens` is weird after that exact setup caused me "initial struggles", but i've set up a simple docker-compose with vllm and qwen3.6 right now to test it out and it worked perfectly fine for me.
I've been using qwen3.5 (122b) with claude code for months, and it's definitely dumber than sonnet/opus, but it works through things reasonably well (i.e. writes half-decent code and tool calls usually work), and I pretty much never run into loops now.
and make sure you're following Unsloth's recommendations for temperature/etc.
Task-driven repo, clear your context (restart the harness), check the results.
Don't try for a rambling session where you let the thing grind for hours on a huge system. It will predictably choke or end in those loops. But do a few small chunks of work, exit the harness, then pick up the next few small chunks... It doesn't feel as magical, but it seems to be more effective, even when your model is Claude.
In the article the author describes what they made. It's definitely not bullshit, but it's also not as reliable or as handsfree as the 1t models.
For people who aren't completely vibe or agent coding these models are better than say copilot or the free models appearing after a Google search. Probably better than chatgpts flagships in some ways.
I mostly use 4b to 9b models for basic inquiries and code examples from libraries I haven't used before. Many of them can solve pretty hard math problems, and these are several steps away from say qwen3.6.
I would not discount running models locally. It's the best case scenario of a future with LLMs from a human rights and ecological perspective.
> Sure, you can make it do something but certainly nothing useful or substantial.
It works great for me. But I like to review the code and understand what it's doing, which doesn't appear to be how people do "useful or substantial" programming these days.
Everytime I am on here I am baffled by how many people just spin the wheel these days. The most important part of the sdlc for me is having humans involved in the code base. Can't plan improvements, features, refactors, etc if you don't know what the code looks like. But here we are I guess.
Hosted models are big, and there is a lot going on behind the scenes that we users have no visibility into. OpenAI, Anthropic, Google, etc do much more than just feed raw prompt tokens straight into a big 1-2TB static model and pipe the output tokens back to the web browser. The result of this is that they can do more, and end-users can get away with a lot more in terms of vague prompts and missing background.
The biggest lesson I've learned working with local models so far is: with the smaller models, you have to understand their limitations, be willing to run experiments, and fine-tune the heck out of everything. There are endless choices to be made: which model to use, which quant, thinking or not, sampling parameters, llama.cpp vs vLLM, etc. They much more fiddly for serious work than just downloading Claude Code and having it one-shot your application. But some of us enjoy fiddling so it all works out in the end.
I've done zero fine tuning in the local models I use. I also didn't do a lot of experiments except asking the 4 or 5 I downloaded what version of x package was the newest. For my work flows small models are king.
* New models running in llama.cpp (what's under the hood of ollama et al) frequently require bug fixes.
* The GGUF models that run in llama.cpp frequently require bug fixes (Unsloth is notorious for this -- they release GGUF models about 10 minutes after official .safetensors releases).
* You're probably running a <Q8 quantization of the model, and a good chance <BF16 quantization for KV cache. This makes for compounding issues as context grows and tool calls multiply.
Local models really are great but I think a major problem are the people in groups like r/localllama who run models at absurd quantization levels in order to cram them on their underpowered hardware and convince themselves that they're running SOTA at home.
The best way to run these models is, frankly, a lot of VRAM and vLLM (which is what the people developing these models are almost certainly targeting).
I’m frequently surprised how little I can find online about exactly this - different harnesses for local models and how to set them up. The documentation for opencode with local models is (IMO) pretty bad - and even Claude Opus (!) struggled to get it running. And so far I’ve not found a decent alternative to Claude Desktop.
(I’ve recently discovered that you can pipe local models into Claude’s Code and Desktop, so this is on my list to try).
Qwen3.6 is brand new. But also, search engines are so plastered with AI slop that is written by tools and companies that have no interest in you using local models. Ollama makes it 1 command to run local small models, but with the newest ones there can be kinks to work out first.
/R/localllama is okay for some information but beyond that there is so much noise and very little signal. I think it's intentional.
Thanks. I’ve been experimenting with local models for over a year now, on and off, so this isn’t just limited to the latest Qwen. Anyway, I have no problem running them, but there’s a huge difference between running something via a chat interface and running it a la Claude Code so that it can interact with the local environment and create/edit files. This is the aspect that’s difficult, in my experience.
It’s all about tooling, if the ai can fetch data it can do something rad with it. Use something like an ai harness to have an mcp server and other tooling to improve the harness and the tools I made this for my own learning: GitHub.com/ralabarge/beigebox
Have to call out that comment about grok code being sub par. I used it exclusively when it was free in Cursor and have nothing bad to say about it. And that was months ago. I imagine it’s a lot better now.
reply