This blog post describes the basic work of a research engineer and nothing more. The amount of surprise the author has seems to suggest they haven't really worked in ML for very long.
Honestly? This is the best its ever been. Getting stuff to run before huggingface and uv and docker containers with cuda was way worse. Even with full open-source, go try to run a 3+ years old model and codebase. The field just moves very fast.
The paper you're talking about is "Deal or No Deal? End-to-End Learning for Negotiation Dialogues" and it was just AIs drifting away from English. The crazy news article was from Forbes with the title "AI invents its own language so Facebook had to shut it down!" before they changed it after backlash.
Friendly reminder that articles like this are not written by Forbes staff but are published directly by the author with little to no oversight by Forbes. Basically a blog running on the forbes.com domain. I'm sure there are many great contributors to Forbes, just saying that by lacking editorial oversight then by definition the domain it was published on is meaningless. I see people all the time saying something like, "It was on Forbes it must be true!" They wouldn't be saying that if it was published to Substack or Wordpress.com.
Expert difficulty is also recognizing that articles from "serious" publications like The New York Times can also be misleading or outright incorrect, sometimes obviously so like with some Bloomberg content the last few years.
Another GitHub PM here. Thanks for the feedback! We're currently working on adding a way restrict PR creation to collaborations only. We've also heard some feedback around evaluating PRs against contributing guidelines which would allow maintainers to clearly define criteria that PRs must meet, so we're exploring that option as well.
> the generation of 281,128 augmented examples, from which 1,000 were
held out as a benchmark test set.
This model is trained on a custom dataset of 280k examples then tested on 1k very similar examples from the same dataset. Of course it is specialized to outperform general models on this specific task in this specific domain with this specific json format for output.
This is a reasonable hobby project and interesting approach to synthetic data generation but not impressive research.
At minimum you should test your model on other benchmarks that have similar tasks e.g. docbench
It's not novel research, but I think it drives home the point that many narrow applications of AI do not require the largest, latest (and most expensive) models. And in many of those cases, a small fine-tuned model is the most performant and cost-effective.
It is probably obvious to most who follow the space closely, but you'd be surprised how many engineers don't recognize this.
Well, one day it might be at the level of shell scripting. I don't think about "the tradeoffs of building a specialized shell script", I just do it because it's cheap and easy and solves a problem right then and there.
I don't know how you would even begin to make this kind of same observation for ML models, but seems possible. The 2010s weren't exactly building out "trivial" models, but compared to the architectures and optimizations out now, yeah those models are toy by comparison.
yes! check out https://distillabs.ai/ – follows a similar approach except the evaluation set is held out before the synthetic data generation, which I would argue makes it more robust (I'm affiliated)
> Of course, it is specialized to outperform general models on this specific task in this specific domain with this specific json format for output.
My understanding is generally this is not considered an obvious result. In that high parameter generalist models largely outperform lower parameter specialists.
The real issue is they tested on data in their training set. *
Yes, but due to it being derived from the same underlying source dataset, it is effectively evaluating on the training dataset, not an independent validation/ test dataset.
The difference is subtle but important. If we expect the model to truly outperform a general model, it should generalize to a completely independent set.
They synthetically generated 290k examples and kept 10k of them for testing.
It's worth pointing out that that's technically not testing on the training set, but looking at how similar examples are in the dataset, it's clear that severe overfitting would be unavoidable. That also makes the headline very misleading.
The weights may not be published since using it for document extraction on even the same format but with slightly different content or lengths would show how abysmal this finetune does outside of the synthetic data.
> All example are already correlated because they are generated in the same way.
All examples of “document information extraction” would be correlated no matter where they come from because they all would be “document information extraction” examples…
The real question is whether or not the examples are representative of the broad “document information extraction” use-case.
The problem is the methodology they use to hold them out. For a truly independent validation set, they need to hold out the material before augmentation, not after.
If you hold out after augmentation, then you leverage biases from the training regimen already and hence you artificially boost your model's performance. This is not sufficient to demonstrate your model is generalizing properly.
In analogy: instead of taking leaves off of different trees, they are taking leaves from different branches from the same tree.
That would definitely make the evaluation more robust. My fear is that with LLMs at hand people became allergic to preparing good human-labelled evaluation sets and would always to some degree use an LLM as a crutch.
Haha that's crazy I'm so used to reading RL papers that when the blog linked to a textbook about RL I just filled in Sutton & Barto without clicking on the link or thinking any further about the matter.
I think the other criticism I have is that the historical importance of RLHF to ChatGPT is sort of sidelined, and the author at the beginning pinpoints something like the rise of agents as the beginning of the influence of RL in language modelling. In fact, the first LLM that attained widespread success was ChatGPT, and the secret sauce was RLHF... no need to start the story so late in 2023-2024.
I think it's pretty obvious it's 1. Given the recent huge, clearly politically-motivated cuts from the current administration, it feels pretty likely that FOIA could be disrupted under the guise of "cost-saving".
And I think you're supposed to be generous to the commenter, not the current administration ;)
I love that Jabref supports working with multiple libraries (having multiple open the same time, moving entries between). Best Zotero could do was restart with difference preference files (has that changed? haven't used it in some time).
And really like that Jabref syncing requires just syncing the library folder. Zotero syncing really nudges you to the paid plan. setting up webdav just isn't as simple and the list of supported providers isn't that long.
It really helped me that the backend is a plain bibtex file. I could resolve issues with it myself. I can also version libraries with git.
> IR temperature sensor for checking your body temperature or stuff you baking in the oven
> tiny thermal camera sensor for inspecting leaks in house for the winter
So just a thermometer gun? It costs like $20-30 on amazon and I've never needed one other than in my home / kitchen. Why in the world do you want a phone for this haha.
If they are producing and selling it on amazon means someone buying it even if you don't need it. Body temperature check definitely would be handy. Those sensors definitely don't cost $20-30. I had CC1350 SensorTag and it already had that for retail price also around ~$35 (but altogether with 10 different sensors inside and that bought 10 years ago).
They also sell smart outlets, back massagers, and garden sprinklers on Amazon. That doesn't imply people would find them handy in their phone.
I think it'd be an easier pitch in the watch though as that's where they are already shoving most of the health sensors (and have wrist temperature monitoring already).
You can also read it in 2015 Tim Cook's 3D touch announcement voice or Zu announcing the ZTE device with a 3D screen in 2017 or whoever at LG announced the wide angle lens, got meh to bad reviews about it, and then it took off afterward anyways.
My point here is I'm not saying it can't ever be something anyone would want because of that rather something selling in another device on Amazon has no weight one way or the other on whether it'd be a good thing to add to a phone.
Good summary of some of the main "theoretical" criticism of LLMs but I feel that it's a bit dated and ignores the recent trend of iterative post-training, especially with human feedback. Major chatbots are no doubt being iteratively refined on the feedback from users i.e. interaction feedback, RLHF, RLAIF. So ChatGPT could fall within the sort of "enactive" perspective on language and definitely goes beyond the issues of static datasets and data completeness.
Sidenote: the authors make a mistake when citing Wittgenstein to find similarity between humans and LLMs. Language modelling on a static dataset is mostly not a language game (see Bender and Koller's section on distributional semantics and caveats on learning meaning from "control codes")
it does. that's what the "direct preference" part of DPO means. you just avoid training an explicit reward model on it like in rlhf and instead directly optimize for log probability of preferred vs dispreferred responses
What is it called when humans interact with a model through lengthy exchanges (mostly humans correcting the model’s responses to a posed question to the model, mostly through chat and labeling each statement by the model as correct or not), and then all of that text (possibly with some editing) is fed to another model to train that higher model?
I don’t think that process has a specific name. It’s just how training these models works.
Conversations you have with like chatgpt are likely stored, then sorted through somehow, then added to an ever growing dataset of conversations that would be used to train entirely new models.
Honestly? This is the best its ever been. Getting stuff to run before huggingface and uv and docker containers with cuda was way worse. Even with full open-source, go try to run a 3+ years old model and codebase. The field just moves very fast.
reply