More

behohippy · 2025-08-11T15:42:43 1754926963

Sure, all the slop code projects I produce get MIT licensed on public repos. It wasn't mine to begin with, so I wouldn't prevent anyone from using it.

behohippy · 2025-08-08T01:08:44 1754615324

Used 3090s have been getting expensive in some markets. Another option is dual 5060ti 16 gig. Mine are lower powered, single 8 pin power, so they max out around 180W. With that I'm getting 80t/s on the new qwen 3 30b a3b models, and around 21t/s on Gemma 27b with vision. Cheap and cheerful setup if you can find the cards at MSRP.

KronisLV · 2025-08-08T16:48:35 1754671715

For comparison, at work we got a pair of Nvidia L4 GPUs: https://www.techpowerup.com/gpu-specs/l4.c4091

That gives us a total TDP of around 150W, 48 GB of VRAM and we can run Qwen 3 Coder 30B A3B at 4bit quantization with up to 32k context at around 60-70 t/s with Ollama. I also tried out vLLM, but the performance surprisingly wasn't much better (maybe under bigger concurrent load). Felt like sharing the data point, because of similarity.

Honestly it's a really good model, even good enough for some basic agentic use (e.g. with Aider, RooCode and so on), MoE seems the way to go for somewhat limited hardware setups.

Ofc obviously not recommending L4 cards cause they have a pretty steep price tag. Most consumer cards feel a bit power hungry and you'll probably need more than one to fit decent models in there, though also being able to game with the same hardware sounds pretty nice. But speaking of getting more VRAM, the Intel Arc Pro B60 can't come soon enough (if they don't insanely overprice it), especially the 48 GB variety: https://www.maxsun.com/products/intel-arc-pro-b60-dual-48g-t...

behohippy · 2025-08-11T15:44:27 1754927067

Yeah 48g, sub 200W seems like a sweet spot for a single card setup. Then you can stack as deep as you want to get the size of model you want for whatever you want to pay for the power bill.

codazoda · 2025-08-11T16:13:24 1754928804

I've hatched a plan to build a light-weight AI model on a $149 mini-pc and host it from my bedroom.

I wonder if I could follow that up by buying a 3090 (jumping the price by $1000 plus whatever I plug it into) and contrasting the difference. Could be an eye opening experiment for me.

Here's the write up of my plan for the cheap machine if anyone is interested.

https://joeldare.com/my_plan_to_build_an_ai_chat_bot_in_my_b...

behohippy · 2025-05-28T18:31:13 1748457073

About 768 gigs of ddr5 RAM in a dual socket server board with 12 channel memory and an extra 16 gig or better GPU for prompt processing. It's a few grand just to run this thing at 8-10 tokens/s

wongarsu · 2025-05-28T22:58:18 1748473098

About $8000 plus the GPU. Let's throw in a 4080 for about $1k, and you have the full setup for the price of 3 RTX5090. Or cheaper than a single A100. That's not a bad deal.

For the hobby version you would presumably buy a used server and a used GPU. DDR4 ECC Ram can be had for a little over $1/GB, so you could probably build the whole thing for around $2k

JKCalhoun · 2025-05-29T01:37:22 1748482642

Been putting together a "mining rig" [1] (or rather I was before the tariffs, ha ha.) Going to try to add a 2nd GPU soon. (And I should try these quantized versions.)

Mobo was some kind of mining rig from AliExpress for less than $100. GPU is an inexpensive NVIDIA TESLA card that I 3D printed a shroud for (added fans). Power supply a cheap 2000 Watt Dell server PS off eBay....

[1] https://bsky.app/profile/engineersneedart.com/post/3lmg4kiz4...

phonon · 2025-05-28T23:33:50 1748475230

This is the state of the art for such a setup. Really good performance!

https://github.com/kvcache-ai/ktransformers

behohippy · 2025-03-27T11:37:49 1743075469

These articles are gold, thank you. I used your gemma one from a few weeks back to get gemma 3 performing properly. I know you guys are all GPU but do you do any testing on CPU/GPU mixes? I'd like to see the pp and t/s on pure 12 channel epyc and the same with using a 24 gig gpu to accelerate the pp.

danielhanchen · 2025-03-27T22:10:55 1743113455

Oh fantastic! Oh for MoEs like DeepSeek, technically GPUs aren't that necessary! I actually tested on 1x H100 I think it was 30 layers offloaded, and the other 30 are on CPU - it wasn't that bad at all!

behohippy · 2025-02-11T12:38:57 1739277537

You probably won't be running fp16 anything locally. We typically run Q5 or Q6 quants to maximize the size of the model and context length we can run with the VRAM we have available. The quality loss is negligable at Q6.

Eisenstein · 2025-02-11T14:20:25 1739283625

But the inference doesn't necessarily run at the quant precision.

wkat4242 · 2025-02-11T22:32:11 1739313131

As far as I understand it does if you quantify the K/V store as well (the context). And that's pretty standard now because it can increase maximum context size a lot.

Eisenstein · 2025-02-11T23:46:00 1739317560

It is available in most inference engines, but I wouldn't call it in standard use, as it can degrade quality tremendously.

wkat4242 · 2025-02-12T01:36:03 1739324163

Even at q8_0? I thought it wasn't bad just like the models itself. But very interested to hear.

And q8_0 already halves the memory usage compared to fp16.

One of the ollama Devs called the quality impact negligible at q8_0: https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...

But perhaps quantifying the KV cache does not scale as gracefully as the model itself?

Eisenstein · 2025-02-12T02:20:50 1739326850

It highly depends on the model and the context use. A model like command-r for instance is practically unaffected by it, but Qwen will go nuts. As well, tasks highly dependent on context like translation or evaluation will be more impacted than say, code generation or creative output.

behohippy · 2025-02-12T10:49:22 1739357362

Qwen is a little fussy about the sampler settings, but it does run well quantized. If you were getting infinite repetition loops, try dropping the top_p a bit. I think qwen likes lower temps too

Eisenstein · 2025-02-12T14:05:28 1739369128

We are talking about dynamically quantizing KV cache, not the model weights.

behohippy · 2025-02-13T18:37:16 1739471836

I run the KV cache at Q8 even on that model. Is it not working well for you?

wkat4242 · 2025-02-12T03:21:03 1739330463

Interesting. I didn't know that. I thought it was basically 'free' space saving. Would you know how llama3.1 fares by any chance?

behohippy · on Jan 21, 2025

I have a mini PC with an n100 CPU connected to a small 7" monitor sitting on my desk, under the regular PC. I have llama 3b (q4) generating endless stories in different genres and styles. It's fun to glance over at it and read whatever it's in the middle of making. I gave llama.cpp one CPU core and it generates slow enough to just read at a normal pace, and the CPU fans don't go nuts. Totally not productive or really useful but I like it.

ipython · on Jan 21, 2025

That's neat. I just tried something similar:

    FORTUNE=$(fortune) && echo $FORTUNE && echo "Convert the following output of the Unix `fortune` command into a small screenplay in the style of Shakespeare: \n\n $FORTUNE" | ollama run phi4

watermelon0 · on Jan 22, 2025

Doesn't `fortune` inside double quotes execute the command in bash? You should use single quotes instead of backticks.

Uehreka · on Jan 21, 2025

Do you find that it actually generates varied and diverse stories? Or does it just fall into the same 3 grooves?

Last week I tried to get an LLM (one of the recent Llama models running through Groq, it was 70B I believe) to produce randomly generated prompts in a variety of styles and it kept producing cyberpunk scifi stuff. When I told it to stop doing cyberpunk scifi stuff it went completely to wild west.

o11c · on Jan 21, 2025

You should not ever expect an LLM to actually do what you want without handholding, and randomness in particular is one of the places it fails badly. This is probably fundamental.

That said, this is also not helped by the fact that all of the default interfaces lack many essential features, so you have to build the interface yourself. Neither "clear the context on every attempt" nor "reuse the context repeatedly" will give good results, but having one context producing just one-line summaries, then fresh contexts expanding each one will do slightly less badly.

(If you actually want the LLM to do something useful, there are many more things that need to be added beyond this)

dotancohen · on Jan 21, 2025

Sounds to me like you might want to reduce the Top P - that will prevent the really unlikely next tokens from ever being selected, while still providing nice randomness in the remaining next tokens so you continue to get diverse stories.

coder543 · on Jan 22, 2025

Someone mentioned generating millions of (very short) stories with an LLM a few weeks ago: https://news.ycombinator.com/item?id=42577644

They linked to an interactive explorer that nicely shows the diversity of the dataset, and the HF repo links to the GitHub repo that has the code that generated the stories: https://github.com/lennart-finke/simple_stories_generate

So, it seems there are ways to get varied stories.

fi-le · on Jan 27, 2025

I was wondering where the traffic came from, thanks for mentioning it!

TMWNN · on Jan 22, 2025

> Do you find that it actually generates varied and diverse stories? Or does it just fall into the same 3 grooves?

> Last week I tried to get an LLM (one of the recent Llama models running through Groq, it was 70B I believe) to produce randomly generated prompts in a variety of styles and it kept producing cyberpunk scifi stuff.

100% relevant: "Someday" <https://en.wikipedia.org/wiki/Someday_(short_story)> by Isaac Asimov, 1956

janalsncm · on Jan 21, 2025

Generate a list of 5000 possible topics you’d like it to talk about. Randomly pick one and inject that into your prompt.

jaggs · on Jan 22, 2025

https://old.reddit.com/r/LocalLLaMA/comments/1i615u1/the_fir...

behohippy · on Jan 22, 2025

It's a 3b model so the creativity is pretty limited. What helped for me was prompting for specific stories in specific styles. I have a python script that randomizes the prompt and the writing style, including asking for specific author styles.

greenavocado · on Jan 22, 2025

Set temperature to 1.0

keeganpoppen · on Jan 21, 2025

oh wow that is actually such a brilliant little use case-- really cuts to the core of the real "magic" of ai: that it can just keep running continuously. it never gets tired, and never gets tired of thinking.

Dansvidania · on Jan 21, 2025

this sounds pretty cool, do you have any video/media of it?

behohippy · on Jan 22, 2025

I don't have a video but here's a pic of the output: https://imgur.com/ip8GWIh

sky2224 · on Jan 23, 2025

The next step is to format it so it looks like an endless starwars intro.

bithavoc · on Jan 21, 2025

this is so cool, any chance you post a video?

behohippy · on Jan 22, 2025

Just this pic: https://imgur.com/ip8GWIh

droideqa · on Jan 22, 2025

That's awesome!

behohippy · on April 23, 2024

I had this same issue with incomplete answers on longer summarization tasks. If you ask it to "go on" it will produce a better completion, but I haven't seen this behaviour in any other model.

Grimblewald · on April 24, 2024

Neither, still, the answers it does provide - despite a few hiccups - are truly outstanding. I am really impressed with this model, even with its issues. Though, I am sure the issues such that they are, are a month or two away from being fixed. For what its worth, I haven't played as much with the bigger model, but it seems to not struggle with the same, though take that with a grain of salt, it runs too slow on my hardware for me to rapidly test things.

behohippy · on Jan 6, 2024

It's probably an evolution of the phi-1/1.5 "Textbooks are all you Need" training method: https://arxiv.org/abs/2309.05463

alecco · on Jan 6, 2024

Yes. And the cost of these synthetic datasets is very high. Nobody is sharing. I suspect people are underestimating the amount of hardware OpenAI/Microsoft are using to build massive amounts of synthetic data. I doubt they are just training models over and over with the common crawls and such.

visarga · on Jan 6, 2024

> the cost of these synthetic datasets is very high. Nobody is sharing

There are plenty of synthetic datasets generated from GPT-4 and other models^[1]. But MS created a large one, 150B tokens. Still 2 orders of magnitude smaller than the 13T used to train GPT-4.

But in the future this will be the main way to improve models - put them to work, and filter their good stuff. Then retrain. Very expensive, but that is the cost of evolution. It took humans a very long time to create the culture and technology that underlies LLMs, it will take a similar effort to push them forward.

Human generated text was the low hanging fruit, but now that it's picked, synthetic data is the only way forward. Models generating their own experience and feedback, doing exploration, combinatorial search, learning from their interactions with humans, from games, experiments and simulations.

But if we're talking about synthetic data - then the elephant in the room is the chat logs of OpenAI. They got 180M users, assume 10K tokens/user/month, that would be 1.8B tokens per month, mostly AI written but interspersed with human replies and tool generated output. This means they can collect in less than a year about as much synthetic data as the original training set.

What if they train GPT-5 solely on synthetic data? That would simplify the copyright issues a lot, and give a 5x boost in efficiency.

[1] https://github.com/Zjh-819/LLMDataHub

pk-protect-ai · on Jan 6, 2024

Nobody underestimates it. It is clear that this stuff is not cheap. However, all publications without datasets are garbage because you can't replicate them. Why publish at all? It's just noise.

orand · on Jan 7, 2024

All world-class scientists who don't cite every book they've ever read or teacher they've ever had are garbage because you can't replicate them. Why be born at all? They're just noise.

pk-protect-ai · on Jan 12, 2024

It is not the same. If you can't replicate, you can't verify. There is a difference between what you can infer from the provided information and what you can prove. Replication is a cornerstone of scientific experimentation. Thus, the argument you are using here is bullshit.

behohippy · on Nov 30, 2023

Top_p and top_k are pretty important concepts for LLMs same as temperature so P,K,C and F are underutilized

TMWNN · on Dec 1, 2023

> top_k

topkek

behohippy · on Dec 1, 2023

No joke, that would be an awesome LLM project name!

behohippy · on June 18, 2023

Hey emad, thanks for SD and this! What's the plan if Meta does Apache 2.0 for LLaMA? Just keep going and making the 30b and 65b or build different models?

emadm · on June 18, 2023

Had a nice chat with Yann last week, we will release complementary stuffs.

I don't think 30b and 65b are useful given what we do, the key is optimising models for consumer & swarming them.

As for SD.. Maybe try the bot on the discord server testing the new version: https://discord.com/invite/stablediffusion

cypress66 · on June 19, 2023

30b 4bit is definitely useful because it runs on 3090s/4090s. You see a lot of people running 30b models.

The jump in quality is very significant as well.

65b is definitely a lot less common.