More

mnkv · 2026-03-10T20:30:56 1773174656

This blog post describes the basic work of a research engineer and nothing more. The amount of surprise the author has seems to suggest they haven't really worked in ML for very long.

Honestly? This is the best its ever been. Getting stuff to run before huggingface and uv and docker containers with cuda was way worse. Even with full open-source, go try to run a 3+ years old model and codebase. The field just moves very fast.

mnkv · 2026-02-13T18:35:03 1771007703

The paper you're talking about is "Deal or No Deal? End-to-End Learning for Negotiation Dialogues" and it was just AIs drifting away from English. The crazy news article was from Forbes with the title "AI invents its own language so Facebook had to shut it down!" before they changed it after backlash.

Not related to alignment though

https://www.forbes.com/sites/tonybradley/2017/07/31/facebook...

frenchtoast8 · 2026-02-13T21:44:26 1771019066

Friendly reminder that articles like this are not written by Forbes staff but are published directly by the author with little to no oversight by Forbes. Basically a blog running on the forbes.com domain. I'm sure there are many great contributors to Forbes, just saying that by lacking editorial oversight then by definition the domain it was published on is meaningless. I see people all the time saying something like, "It was on Forbes it must be true!" They wouldn't be saying that if it was published to Substack or Wordpress.com.

Expert difficulty is also recognizing that articles from "serious" publications like The New York Times can also be misleading or outright incorrect, sometimes obviously so like with some Bloomberg content the last few years.

kevin_thibedeau · 2026-02-13T22:16:04 1771020964

Forbes is basically a chumbox aggregator now. I'd lend more credence to any Substack.

conorcleary · 2026-02-14T11:26:45 1771068405

okay so those are two very wide blanket statements, we'll all give you the op to turn back on this.

mnkv · 2026-01-22T19:29:07 1769110147

My ideal is copilot that would evaluate the PR against some basic guidelines that maintainers write down.

And perhaps a way to filter PRs to just contributor PRs would be easy to implement and pretty useful

moraesc · 2026-01-27T02:18:44 1769480324

Another GitHub PM here. Thanks for the feedback! We're currently working on adding a way restrict PR creation to collaborations only. We've also heard some feedback around evaluating PRs against contributing guidelines which would allow maintainers to clearly define criteria that PRs must meet, so we're exploring that option as well.

mnkv · 2025-12-31T03:57:03 1767153423

Nice work. A while back, I learned convolutions using similar animations by Vincent Dumoulin and Francesco Visin's gifs

https://github.com/vdumoulin/conv_arithmetic

_giorgio_ · 2026-01-04T16:14:36 1767543276

Very good arxiv paper, I wish there where some updates on that.

mnkv · 2025-09-30T18:00:37 1759255237

> the generation of 281,128 augmented examples, from which 1,000 were held out as a benchmark test set.

This model is trained on a custom dataset of 280k examples then tested on 1k very similar examples from the same dataset. Of course it is specialized to outperform general models on this specific task in this specific domain with this specific json format for output.

This is a reasonable hobby project and interesting approach to synthetic data generation but not impressive research.

At minimum you should test your model on other benchmarks that have similar tasks e.g. docbench

gundmc · 2025-09-30T18:42:24 1759257744

It's not novel research, but I think it drives home the point that many narrow applications of AI do not require the largest, latest (and most expensive) models. And in many of those cases, a small fine-tuned model is the most performant and cost-effective.

It is probably obvious to most who follow the space closely, but you'd be surprised how many engineers don't recognize this.

Garlef · 2025-09-30T18:50:03 1759258203

It's a matter of ROI: When is it worth it to build something specialized?

sigbottle · 2025-09-30T19:00:28 1759258828

Well, one day it might be at the level of shell scripting. I don't think about "the tradeoffs of building a specialized shell script", I just do it because it's cheap and easy and solves a problem right then and there.

I don't know how you would even begin to make this kind of same observation for ML models, but seems possible. The 2010s weren't exactly building out "trivial" models, but compared to the architectures and optimizations out now, yeah those models are toy by comparison.

ImJasonH · 2025-09-30T18:51:16 1759258276

Is anybody working on making building specialized things easier and cheaper?

-_- · 2025-09-30T19:27:28 1759260448

Yes! At https://RunRL.com we offer hosted RL fine-tuning, so all you need to provide is a dataset and reward function or environment.

selim-now · 2025-10-01T07:15:47 1759302947

yes! check out https://distillabs.ai/ – follows a similar approach except the evaluation set is held out before the synthetic data generation, which I would argue makes it more robust (I'm affiliated)

bangaladore · 2025-09-30T18:10:01 1759255801

> Of course, it is specialized to outperform general models on this specific task in this specific domain with this specific json format for output.

My understanding is generally this is not considered an obvious result. In that high parameter generalist models largely outperform lower parameter specialists.

The real issue is they tested on data in their training set. *

* Incorrect-- Edit misread parent comment.

littlestymaar · 2025-09-30T18:43:08 1759257788

> The real issue is they tested on data in their training set.

Hm, no.

They trained on a part of their synthetic set and tested on another part of the set. Or at least that's what they said they did:

> from which 1,000 were held out as a benchmark test set.

Emphasis mine.

_carltg · 2025-09-30T21:51:18 1759269078

Yes, but due to it being derived from the same underlying source dataset, it is effectively evaluating on the training dataset, not an independent validation/ test dataset.

The difference is subtle but important. If we expect the model to truly outperform a general model, it should generalize to a completely independent set.

bangaladore · 2025-09-30T19:24:57 1759260297

Thanks, rereading it makes it clear that you are correct.

disiplus · 2025-09-30T18:18:42 1759256322

They did not test on the data that they tested, that's not what he wrote.

DetroitThrow · 2025-09-30T18:22:15 1759256535

They synthetically generated 290k examples and kept 10k of them for testing.

It's worth pointing out that that's technically not testing on the training set, but looking at how similar examples are in the dataset, it's clear that severe overfitting would be unavoidable. That also makes the headline very misleading.

The weights may not be published since using it for document extraction on even the same format but with slightly different content or lengths would show how abysmal this finetune does outside of the synthetic data.

bangaladore · 2025-09-30T19:25:00 1759260300

Thanks, rereading it makes it clear that you are correct.

kingjimmy · 2025-09-30T18:15:15 1759256115

in todays news, overfit models are overfit.

m3kw9 · 2025-09-30T18:06:07 1759255567

So they tested using training examples? Lmao

fxwin · 2025-09-30T18:12:46 1759255966

> held out

Aperocky · 2025-09-30T18:33:03 1759257183

Actually in this case that's not exactly true:

> generation of 281,128 augmented examples

All example are already correlated because they are generated in the same way.

littlestymaar · 2025-09-30T18:47:03 1759258023

> All example are already correlated because they are generated in the same way.

All examples of “document information extraction” would be correlated no matter where they come from because they all would be “document information extraction” examples…

The real question is whether or not the examples are representative of the broad “document information extraction” use-case.

_carltg · 2025-09-30T21:53:50 1759269230

The problem is the methodology they use to hold them out. For a truly independent validation set, they need to hold out the material before augmentation, not after. If you hold out after augmentation, then you leverage biases from the training regimen already and hence you artificially boost your model's performance. This is not sufficient to demonstrate your model is generalizing properly.

In analogy: instead of taking leaves off of different trees, they are taking leaves from different branches from the same tree.

selim-now · 2025-10-01T07:20:25 1759303225

That would definitely make the evaluation more robust. My fear is that with LLMs at hand people became allergic to preparing good human-labelled evaluation sets and would always to some degree use an LLM as a crutch.

fxwin · 2025-10-02T21:26:03 1759440363

I would agree with that

mnkv · 2025-06-28T01:54:15 1751075655

reasonable post with a decent analogy explaining on-policy learning, only major thing I take issue with is

> Reinforcement learning is a technical subject—there are whole textbooks written about it.

and then linking to the still wip RLHF book instead of the book on RL: Sutton & Barto.

dawnofdusk · 2025-06-28T05:50:55 1751089855

Haha that's crazy I'm so used to reading RL papers that when the blog linked to a textbook about RL I just filled in Sutton & Barto without clicking on the link or thinking any further about the matter.

I think the other criticism I have is that the historical importance of RLHF to ChatGPT is sort of sidelined, and the author at the beginning pinpoints something like the rise of agents as the beginning of the influence of RL in language modelling. In fact, the first LLM that attained widespread success was ChatGPT, and the secret sauce was RLHF... no need to start the story so late in 2023-2024.

mnkv · on March 18, 2025

I think it's pretty obvious it's 1. Given the recent huge, clearly politically-motivated cuts from the current administration, it feels pretty likely that FOIA could be disrupted under the guise of "cost-saving".

And I think you're supposed to be generous to the commenter, not the current administration ;)

mnkv · on Sept 25, 2024

how does this compare to zotero?

trueismywork · on Sept 25, 2024

Doesn't. I used jabref for a long time, zotero is better. Zotero has integration with browsers and sync which is its biggest advantage

decafb · on Sept 26, 2024

I love that Jabref supports working with multiple libraries (having multiple open the same time, moving entries between). Best Zotero could do was restart with difference preference files (has that changed? haven't used it in some time).

And really like that Jabref syncing requires just syncing the library folder. Zotero syncing really nudges you to the paid plan. setting up webdav just isn't as simple and the list of supported providers isn't that long.

It really helped me that the backend is a plain bibtex file. I could resolve issues with it myself. I can also version libraries with git.

mnkv · on Sept 9, 2024

> IR temperature sensor for checking your body temperature or stuff you baking in the oven

> tiny thermal camera sensor for inspecting leaks in house for the winter

So just a thermometer gun? It costs like $20-30 on amazon and I've never needed one other than in my home / kitchen. Why in the world do you want a phone for this haha.

I do think I've found the perfect car for you: https://tenor.com/view/homer-simpsons-car-gif-8120474

ValentineC · on Sept 9, 2024

On that topic, I would have liked a new AirPods Pro to be able to monitor my body temperature through my ears, while I'm working out.

It would be nice to have constant monitoring against a baseline as well, to alert people when a fever might be breaking out.

pzo · on Sept 9, 2024

If they are producing and selling it on amazon means someone buying it even if you don't need it. Body temperature check definitely would be handy. Those sensors definitely don't cost $20-30. I had CC1350 SensorTag and it already had that for retail price also around ~$35 (but altogether with 10 different sensors inside and that bought 10 years ago).

zamadatix · on Sept 9, 2024

They also sell smart outlets, back massagers, and garden sprinklers on Amazon. That doesn't imply people would find them handy in their phone.

I think it'd be an easier pitch in the watch though as that's where they are already shoving most of the health sensors (and have wrist temperature monitoring already).

mrlatinos · on Sept 10, 2024

I read that first part in 2007 Steve Ballmer's voice.

zamadatix · on Sept 11, 2024

You can also read it in 2015 Tim Cook's 3D touch announcement voice or Zu announcing the ZTE device with a 3D screen in 2017 or whoever at LG announced the wide angle lens, got meh to bad reviews about it, and then it took off afterward anyways.

My point here is I'm not saying it can't ever be something anyone would want because of that rather something selling in another device on Amazon has no weight one way or the other on whether it'd be a good thing to add to a phone.

lucb1e · on Sept 10, 2024

There's lots of devices that can do all sorts of things, but not needing a separate device was their point

mnkv · on July 20, 2024

Good summary of some of the main "theoretical" criticism of LLMs but I feel that it's a bit dated and ignores the recent trend of iterative post-training, especially with human feedback. Major chatbots are no doubt being iteratively refined on the feedback from users i.e. interaction feedback, RLHF, RLAIF. So ChatGPT could fall within the sort of "enactive" perspective on language and definitely goes beyond the issues of static datasets and data completeness.

Sidenote: the authors make a mistake when citing Wittgenstein to find similarity between humans and LLMs. Language modelling on a static dataset is mostly not a language game (see Bender and Koller's section on distributional semantics and caveats on learning meaning from "control codes")

dartos · on July 20, 2024

FWIW even more recently, models have been tuned using a method called DPO instead of RLHF.

IIRC DPO doesn’t have human feedback in the loop

valec · on July 20, 2024

it does. that's what the "direct preference" part of DPO means. you just avoid training an explicit reward model on it like in rlhf and instead directly optimize for log probability of preferred vs dispreferred responses

meroes · on July 21, 2024

What is it called when humans interact with a model through lengthy exchanges (mostly humans correcting the model’s responses to a posed question to the model, mostly through chat and labeling each statement by the model as correct or not), and then all of that text (possibly with some editing) is fed to another model to train that higher model?

Does this have a specific name?

dartos · on July 21, 2024

I don’t think that process has a specific name. It’s just how training these models works.

Conversations you have with like chatgpt are likely stored, then sorted through somehow, then added to an ever growing dataset of conversations that would be used to train entirely new models.

hackernewds · on July 21, 2024

DPO most essentially has human feedback, depends on what the preference optimizations are