Whenever I see a claim about GPT I get temporarily interested until I learn it’s GPT3.5 and not GPT4.
4 isn’t just marginally better at most tasks I use it for, it’s operating at an entirely different level to the point where I have little (no?) day-to-day use of 3.5 at this point.
I'm assuming you use gpt4 via ChatGPT plus. Does the message cap bother you? I heard it's something like 25 messages per 3 hours. That sounds so low I don't even bother subscribing.
I guess this doesn't apply if you use it via the api.
I was initially deterred. But in practice, when using it for professional purposes, I never encounter it.
Your coding speed is unlikely to be that fast, requiring 25 code segments in 3 hours. GPT-4 outputs something, you need time to double check, test, additional googling etc. Its still a massive speed boost.
Using it recreationally (Especially chatting) will result in a lot more requests.
The only time I hit it I usually realize I need a mental break anyway so usually it’s plenty. Suppose it depends on how your using it but for me I ask it for code of things I could write but would rather focus my energy on the bigger problem then a single function to merge two objects while keeping the order sorted of a joined list.. that kind of thing it’s great for
I have no idea what happened. My original 1-month Plus subscription to test GPT-4 was, as usual, limited to 25/hour and tied to my personal gmail address. But then recently-- in the last week-- I resubscribed and I accidentally did so under my work email, which has a more specialized and restricted TLD, and under that I don't have any quota limits for the GPT-4 version of ChatGPT.
It might seem a small thing, having to space out prompts 25 ever 3 hours when you might not have used more than 100-200 in a day anyway, but the net result is liberating. I experiment, explore the limits, and get whimsical with it to a much greater extent than when I can to consciously think about each prompt as a rationed resource.
It does bother me, I’ve been hit by it 3 times now (I use it as a daily driver, for code you spend enough time between prompts working that it’s rare to go through that volume)
When I hit the limit, I work on the problem myself and wait until 4 resets instead of relying on 3.5. 4 is so much better that I don’t trust 3.5 with my work anymore.
For me tbh, i kinda like the limit. I use GPT-4 a lot lately. Hitting the limit reminds me, i got too lazy writing code myself or i got way too deep into it. Then i just close the tab & remember, that i still love writing code the (not quite yet) old way.
I've never hit the limit because GPT 4 is slow (like a dialup modem) and I don't like waiting for it. Usually I do something else while it's writing a response.
It sounds very low and somehow it very rarely bothers me. It sure is annoying when it bothers me, but it's a lot higher in practice than the number feels.
GPT-4 barely performs above 3.5 now. They’ve resource-constrained or otherwise hobbled it to support all the corners of Microsoft products they’re stuffing it in. The amount of errors and logic degradation after the May update is incredibly obvious for all but the most trivial use cases.
It’s going to be very funny if being turned into a next generation Clippy is what makes them lose out to their competitors
> With these evaluation instructions, we compare RLHF model responses to Davinci003 responses and measure the fraction of times the RLHF model is preferred; we call this statistic the win-rate.
> Of the methods we studied, PPO proves the most effective, improving the win-rate against Davinci003 from 44% to 55% according to human evaluation, which even outperforms ChatGPT.
…for the metric we invented, which measures… the difference between a simulated and human evaluated result.
Or something.
Does anyone have a good idea of what this metric actually means and if it is actually relevant to anything useful?
> We find that PPO sim trained in AlpacaFarm only achieves a win-rate of 43%, while PPOGPT-4
sim trained on GPT-4 data achieves a win-rate of 50%. To contextualize these results, the initial SFT model has a win-rate of 44%, PPOhuman has a win-rate of 55%, and the best non-PPO human method has a win-rate of 51% (Best-of-16). Thus, training in simulation can provide good models directly for deployment, though this approach suffers a 5% performance gap relative to collecting
real human annotations.
...
> However, we also observe that no single LLM-based annotator captures the heterogeneity of human annotation, and substantial amounts of noise had to be injected in the simulated preference for rankings of methods trained in AlpacaFarm to match those trained with real human feedback.
...and, in summary:
> We showed that AlpacaFarm substantially lowers the cost and iteration time of research
on and development of methods for learning with pairwise feedback. AlpacaFarm provides a blueprint for constructing other useful simulators for AI research that requires human supervision, and we view it as an exciting opportunity to expand this simulation approach to support data from other domains as well as methods that learn from alternative forms of human feedback.
Ok.
...but that's no what the blog post said. The blog post said:
> Of the methods we studied, PPO proves the most effective, improving the win-rate against Davinci003 from 44% to 55% according to human evaluation, which even outperforms ChatGPT.
The closest the paper got to saying that was:
> The other mismatch is ChatGPT against PPO, where human annotators preferred PPO (55.1% vs 52.9%) unlike the simulator (46.8% vs 61.4%).
That's interesting.
> In both cases, these are
not major mistakes, as we do not expect SFT52k to be much worse than SFT10k or for a 7B LLaMA model to substantially outperform ChatGPT.
?? Mistakes?
So.. I mean, yes. I'm judging. When you write a blog saying "outperforms ChatGPT" and then, the paper doesn't say that... well.
Yeah, and I think they are using an old version (3.0? 3.5?) of ChatGPT, not GPT4, which is way better. Can anyone verify? They confusingly list GPT4 as a separate LLM, even though ChatGPT supports GPT4.
No one using ChatGPT is confused. You have to make an explicit choice in the switch, and if you are using the API you have to put the name in as a parameter.
Absolutely off topic, but I just got back from a week in Peru, where the alpaca is a prominent member of the local fauna.
For folks in the US at least, it's a relatively inexpensive trip and an absolutely gobsmackingly gorgeous country with friendly people and amazing food. Highly recommended!!!
Interesting, did other local fauna converse with you in standard 5-paragraph essay formats as taught to humans >= 12 years old, or was it only the alpaca that did so?
I wonder how much longer this "Using LLMs to evaluate the quality of other LLMs" can last. Certainly it has proven valuable and useful up until now, especially since ChatGPT is a pretty high bar to evaluate against.
But it also seems like a strange, incestuous, closed system approach.
Like, unless you are introducing something new into the system, you just have the system churning against itself, probably until it reaches an equilibrium (or else becomes incoherent).
Not really. Pretty much the "killer app" feature of ChatGPT is RLHP. Whether or not the current RLHP-ed Alpca really beats ChatGPT, it is pretty obvious that local LLMs can be RLHP-ed and it is only a matter of time before people realize running an RLHP-ed LLM locally is a better option than running ChatGPT with all the security concerns of running something "in the cloud" (which is just "somebody else's computer" in the famous saying).
Beating by generating longer answer is not a win for me. Maybe raters prefer long answers, but in reality long answers are only good if they provide extra important information.
They should try to compare answers with similar length.
RLHF is supervised learning on top of unsupervised learning. Is supervised learning at some point of the process a requirement for all reasonable ML models?
4 isn’t just marginally better at most tasks I use it for, it’s operating at an entirely different level to the point where I have little (no?) day-to-day use of 3.5 at this point.