Hacker Newsnew | past | comments | ask | show | jobs | submit | tarruda's commentslogin

One thing that annoys me is the ability that my mobile carrier has to just throw ad popups.

Is that something that GrapheneOS fixes?


Wtf‽ I didn't know that was possible.


Your carrier does what now?


I have a pixel 8a with a TIM SIM card and every once in a while I see an ad popup on my phone.


Go to [Settings] » [Apps] » [Special app access] » [Display over other apps] and check if any preinstalled carrier apps or anything suspicious has this permission granted.


Just checked, and only "Phone" and "Google" have this permission.

There are no preinstalled apps, I bought this phone clean on Germany and then added a Brazil's SIM card when I got back.

Could it be that the SIM card has some control over the Phone app?


Apparently this is handled by the privileged STK[1] service. It can launch browser which is I think what's happening.

GrapheneOS presently doesn’t do anything different in this case, they pull it from AOSP without modifications. However you can disable it using the frontend app (SIM Toolkit) as someone pointed out, but as far as I can tell this requires the applet on SIM card to cooperate (offer the opt out).

Otherwise you can disable the STK altogether with ADB but that will also block you out of other SIM card interactive functions, which might not be a big deal however.

Edit: "We plan to add the ability to restrict the capabilities of SIM Toolkit as an attack surface reduction measure. (2022)"[2] and open issue[3].

[1] https://wladimir-tm4pda.github.io/porting/stk.html

[2] https://discuss.grapheneos.org/d/1492-blocking-sim-toolkit-m...

[3] https://github.com/GrapheneOS/os-issue-tracker/issues/875


Thanks for the info!


Like a popup how? What kind of dialog is it? It's more likely to be an app that's bundled by your carrier than your carrier MitM'ing ads into your stuff which is kinda what it sounded like


Just a message popup, a window with dark background and some text ad on it.

I did not buy this phone from a carrier, just added the SIM card later.

Really surprised to learn this doesn't happen to others. Always assumed that the SIM card had some special privilege given by Android.


Sounds like your carrier is abusing STK to display ads.

See https://www.browserstack.com/guide/stop-popup-messages-in-an...

Caveat: if they're doing that, then they're almost certainly data mining your data streams (e.g. dns lookups etc.)

I wouldn't feel secure on such a carrier unless I also VPN'd traffic to a reputable provider (Nord, Express, or Proton) and forced DNS over TLS to known servers.


SIM cards can come with apps preloaded. There was a carrier in Mexico that would load a SIM app for Dominos Pizza and you could order a pizza from your phone if you were on that carrier. I learned this because of some carrier certification feedback I had to disposition at one job.


Can't you just change your carrier?


I would rather have a phone that doesn't let my carrier show random messages whenever they feel like it.


> This is the first model that has really broken into the anglosphere.

Before Step 3.5 Flash, I've been hearing a lot about ACEStep as being the only open weights competitor to Suno.


They seem to be the same company that released ACEStep music generation model: https://acestep.io/

Though the only mention I found was in ComfyUI docs: https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1


This is probably one of the most underrated LLMs releases in the past few months. In my local testing with a 4-bit quant (https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF/tree/mai...), it surpasses every other LLM I was able to run locally, including Minimax 2.5 and GLM-4.7, though I was only able to run GLM with a 2-bit quant. Some highlights:

- Very context efficient: SWA by default, on a 128G mac I can run the full 256k context or two 128k context streams. - Good speeds on macs. On my M1 Ultra I get 36 t/s tg and 300 t/s pp. Also, these speeds degrade very slowly as context increases: At 100k prefill, it has 20 t/s tg and 129 t/s pp. - Trained for agentic coding. I think it is trained to be compatible with claude code, but it works fine with other CLI harnesses except for Codex (due to the patch edit tool which can confuse it).

This is the first local LLM in the 200B parameter range that I find to be usable with a CLI harness. Been using it a lot with pi.dev and it has been the best experience I had with a local LLM doing agentic coding.

There are a few drawbacks though:

- It can generate some very long reasoning chains. - Current release has a bug where sometimes it goes into an infinite reasoning loop: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...

Hopefully StepFun will do a new release which addresses these issues.

BTW StepFun seems to be the same company that released ACEStep (very good music generation model). At least StepFun is mentioned in ComfyUI docs https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1


Have you tried Qwen3 Coder Next? I've been testing it with OpenCode and it seems to work fairly well with the harness. It occasionally calls tools improperly but with Qwen's suggested temperature=1 it doesn't seem to get stuck. It also spends a reasonable amount of time trying to do work.

I had tried Nemotron 3 Nano with OpenCode and while it kinda worked its tool use was seriously lacking because it just leans on the shell tool for most things. For example, instead of using a tool to edit a file it would just use the shell tool and run sed on it.

That's the primary issue I've noticed with the agentic open weight models in my limited testing. They just seem hesitant to call tools that they don't recognize unless explicitly instructed to do so.


I did play with Qwen3 Coder Next a bit, but didn't try it in a coding harness. Will give it a shot later.


Curious on how (if?) changes to the inference engine can fix the issue with infinitely long reasoning loops.

It’s my layman understanding that would have to be fixed in the model weights itself?


There's an AMA happening on reddit and they said it will be fixed in the next release: https://www.reddit.com/r/LocalLLaMA/comments/1r8snay/ama_wit...


I think there are multiple ways these infinite loops can occur. It can be an inference engine bug because the engine doesn't recognize the specific format of tags/tokens the model generates to delineate the different types of tokens (thinking, tool calling, regular text). So the model might generate a "I'm done thinking" indicator but the engine ignores it and just keeps generating more "thinking" tokens.

It can also be a bug in the model weights because the model is just failing to generate the appropriate "I'm done thinking" indicator.

You can see this described in this PR https://github.com/ggml-org/llama.cpp/pull/19635

Apparently Step 3.5 Flash uses an odd format for its tags so llama.cpp just doesn't handle it correctly.


> so llama.cpp just doesn't handle it correctly.

It is a bug in the model weights and reproducible in their official chat UI. More details here: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...


I see. It seems the looping is a bug in the model weights but there are bugs in detecting various outputs as identified in the PR I linked.


Is getting something like M3 Ultra with 512GB ram and doing oss models going to be cheaper for the next year or two compared to paying for claude / codex?

Did anyone do this kind of math?


No, it is not cheaper. An M3 ultra with 512GB costs $10k which would give you 50 months of Claude or Codex pro plans.

However, if you check the prices on Chinese models (which are the only ones you would be able to run on a Mac), they are much cheaper than the US plans. It would take you forever to get to the $10k

And of course this is not even considering energy costs or running inference on your own hardware (though Macs should be quite efficient there).


> No, it is not cheaper. An M3 ultra with 512GB costs $10k which would give you 50 months of Claude or Codex pro plans.

Claude Pro costs $200/year, so you'd get 50 years of subscription not 50 months


Did you try an MLX version of this model? In theory it should run a bit faster. I'm hesitant to download multiple versions though.


Haven't tried. I'm too used to llama.cpp at this point to switch to something else. I like being able to just run a model and automatically get:

- OpenAI completions endpoint

- Anthropic messages endpoint

- OpenAI responses endpoint

- A slick looking web UI

Without having to install anything else.


Is there a reliable way to run MLX models? On my M1 Max, LM Studio seems to output garbage through the API server sometimes even when the LM Studio chat with the same model is perfectly fine. llama.cpp variants generally always just work.


gpt-oss 120b and even 20b works OK with codex.


Both gpt-oss are great models for coding in a single turn, but I feel that they forget context too easily.

For example, when I tried gpt-oss 120b with codex, it very easily forgets something present in the system prompt: "use `rg` command to search and list files".

I feel like gpt-oss has a lot of potential for agentic coding, but it needs to be constantly reminded of what is happening. Maybe a custom harness developed specifically for gpt-oss could make both models viable for long agentic coding sessions.


> it very easily forgets something present in the system prompt: "use `rg` command to search and list files".

Maybe because that doesnt make sense? ripgrep is for finding text inside files, not a replacement for find or ls


At this point I wouldn't be surprised if your pelican example has leaked into most training datasets.

I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D


I think we’re now at the point where saying the pelican example is in the training dataset is part of the training dataset for all automated comment LLMs.


It's quite amusing to ask LLMs what the pelican example is and watch them hallucinate a plausible sounding answer.

---

Qwen 3.5: "A user asks an LLM a question about a fictional or obscure fact involving a pelican, often phrased confidently to test if the model will invent an answer rather than admitting ignorance." <- How meta

Opus 4.6: "Will a pelican fit inside a Honda Civic?"

GPT 5.2: "Write a limerick (or haiku) about a pelican."

Gemini 3 Pro: "A man and a pelican are flying in a plane. The plane crashes. Who survives?"

Minimax M2.5: "A pelican is 11 inches tall and has a wingspan of 6 feet. What is the area of the pelican in square inches?"

GLM 5: "A pelican has four legs. How many legs does a pelican have?"

Kimi K2.5: "A photograph of a pelican standing on the..."

---

I agree with Qwen, this seems like a very cool benchmark for hallucinations.


I'm guessing it has the opposite problem of typical benchmarks since there is no ground truth pelican bike svg to over fit on. Instead the model just has a corpus of shitty pelicans on bikes made by other LLMs that it is mimicking.

So we might have an outer alignment failure.


Most people seem to have this reflexive belief that "AI training" is "copy+paste data from the internet onto a massive bank of hard drives"

So if there is a single good "pelican on a bike" image on the internet or even just created by the lab and thrown on The Model Hard Drive, the model will make a perfect pelican bike svg.

The reality of course, is that the high water mark has risen as the models improve, and that has naturally lifted the boat of "SVG Generation" along with it.


How would that work? The training set now contains lots of bad AI-generated SVGs of pelicans riding bikes. If anything, the data is being poisoned.


Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder.


Have you thought about getting a second 128GB device? Open weights models are rapidly increasing in size, unfortunately.


Considered getting a 512G mac studio, but I don't like Apple devices due to the closed software stack. I would never have gotten this Mac Studio if Strix Halo existed mid 2024.

For now I will just wait for AMD or Intel to release a x86 platform with 256G of unified memory, which would allow me to run larger models and stick to Linux as the inference platform.


I aspire to casually ponder whether I need a $9,500 computer to run the latest Qwen model


You'll need more since RAM prices are up thanks to AI.


Given the shortage of wafers, the wait might be long. I am however working on a bridging solution. Sime already showed Strix Halo clustering, I am working on something similar but with some pp boost.

Unfortunately, AMD dumped a great device with unfinished software stack, and the community is rolling with it, compared to the DGX Spark, which I think is more cluster friendly.


Why 128GB?

At 80B, you could do 2 A6000s.

What device is 128gb?


AMD Strix Halo / Ryzen AI Max+ (in the Asus Flow Z13 13 inch "gaming" tablet as well as the Framework Desktop) has 128 GB of shared APU memory.


Not quite. They have 128GB of ram that can be allocated in the BIOS, up to 96GB to the GPU.


You don't have to statically allocate the VRAM in the BIOS. It can be dynamically allocated. Jeff Geerling found you can reliably use up to 108 GB [1].

[1]: https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...


allocation is irrelevant. as an owner of one of these you can absolutely use the full 128GB (minus OS overhead) for inference workloads


Care to go into a bit more on machine specs? I am interested in picking up a rig to do some LLM stuff and not sure where to get started. I also just need a new machine, mine is 8y-o (with some gaming gpu upgrades) at this point and It's That Time Again. No biggie tho, just curious what a good modern machine might look like.


Those Ryzen AI Max+ 395 systems are all more or less the same. For inference you want the one with 128GB soldered RAM. There are ones from Framework, Gmktec, Minisforum etc. Gmktec used to be the cheapest but with the rising RAM prices its Framework noe i think. You cant really upgrade/configure them. For benchmarks look into r/localllama - there are plenty.


Minisforum, Gmktec also have Ryzen AI HX 370 mini PCs with 128Gb (2x64Gb) max LPDDR5. It's dirt cheap, you can get one barebone with ~€750 on Amazon (the 395 similarly retails for ~€1k)... It should be fully supported in Ubuntu 25.04 or 25.10 with ROCm for iGPU inference (NPU isn't available ATM AFAIK), which is what I'd use it for. But I just don't know how the HX 370 compares to eg. the 395, iGPU-wise. I was thinking of getting one to run Lemonade, Qwen3-coder-next FP8, BTW... but I don't know how much RAM should I equip it with - shouldn't 96Gb be enough? Suggestions welcome!


I benchmarked unsloth/Qwen3-Coder-Next-GGUF using the MXFP4_MOE (43.7 GB) quantization on my Ryzen AI Max+ 395 and I got ~30 tps. According to [1] and [2], the AI Max+ 395 is 2.4x faster than the AI 9 HX 370 (laptop edition). Taking all that into account, the AI 9 HX 370 should get ~13 tps on this model. Make of that what you will.

[1]: https://community.frame.work/t/ai-9-hx-370-vs-ai-max-395/736...

[2]: https://community.frame.work/t/tracking-will-the-ai-max-395-...


Thanks! I'm... unimpressed.


The Ryzen 370 lacks the quad channel RAM. Stay away.


Ryzen AI HX 370 is not what you want, you need strix halo APU with unified memory


maxed out Framework Desktop


Keep in mind most of the Strix Halo machines are limited to 10Gbe networking at best.


you can use separate network adapter with RoCEv2/RDMA support like Intel E810


Most Ryzen 395 machines don't have a PCI-e slot for that so you're looking at an extension from an m.2 slot or Thunderbolt (not sure how well that will work, possibly ok at 10Gb). Minisforum has a couple newly announced products, and I think the Framework desktop's motherboard can do it if you put it in a different case, that's about it. Hopefully the next generation has Gen5 PCIe and a few more lanes.


Spark DGX and any A10 devices, strix halo with max memory config, several mac mini/mac studio configs, HP ZBook Ultra G1a, most servers

If you're targeting end user devices then a more reasonable target is 20GB VRAM since there are quite a lot of gpu/ram/APU combinations in that range. (orders of magnitude more than 128GB).


By A6000, do you mean the older Ampere generation model? 48 GB ddr6, released 2020 [1]. Can you even buy those new still?

[1] https://www.techpowerup.com/gpu-specs/rtx-a6000.c3686


That's the maximum you can get for $3k-$4k with ryzen max+ 395 and apple studio Ms. They're cheaper than dedicated GPUs by far.


Mac Studios or Strix Halo. GPT-OSS 120b, Qwen3-Next, Step 3.5-Flash all work great on a M1 Ultra.


All the GB10-based devices -- DGX Spark, Dell Pro Max, etc.


Guess, it is mac m series


maybe a deepseek v4 distill. give it a few days


Love the idea of keeping the agent filesystem in a single file!


These days I don't feel the need to use anything other than llama.cpp server as it has a pretty good web UI and router mode for switching models.


MLX support on Macs was the main reason for me.


I mostly use LM Studio for browsing and downloading models, testing them out quickly, but then actually integrating them is always with either llama.cpp or vLLM. Curious to try out their new cli though and see if it adds any extra benefits on top of llama.cpp.


Concurrency is an important use case when running multiple agents. vLLM can squeeze performance out of your GB10 or GPU that you wouldn't get otherwise.


Also they've just spent more time optimizing vLLM than llama.cpp people done, even when you run just one inference call at a time. Best feature is obviously the concurrency and shared cache though. But on the other hand, new architectures are usually sooner available in llama.cpp than vLLM.

Both have their places and are complementary, rather than competitors :)


I'm only interested in the local, single user use case. Plus I use a Mac studio for inference, so vLLM is not an option for me.


You can get concurrency gains [0] as local/single user (multi-agent) use case with vLLM with your Mac Studio.

[0] https://youtu.be/Ze5XLooTt6g?t=658


AFAIK MPS cannot be used on Asahi, so it has to be done using Vulkan which will definitely be much slower.


A VM is displayed as a window on the host OS and Emacs is the window manager within that VM window. What's the difference from running emacs directly as an application on the host?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: