Alpaca RLHF-ed to beat ChatGPT

r3trohack3r · on May 24, 2023

Whenever I see a claim about GPT I get temporarily interested until I learn it’s GPT3.5 and not GPT4.

4 isn’t just marginally better at most tasks I use it for, it’s operating at an entirely different level to the point where I have little (no?) day-to-day use of 3.5 at this point.

napsterbr · on May 24, 2023

I'm assuming you use gpt4 via ChatGPT plus. Does the message cap bother you? I heard it's something like 25 messages per 3 hours. That sounds so low I don't even bother subscribing.

I guess this doesn't apply if you use it via the api.

anonylizard · on May 24, 2023

I was initially deterred. But in practice, when using it for professional purposes, I never encounter it.

Your coding speed is unlikely to be that fast, requiring 25 code segments in 3 hours. GPT-4 outputs something, you need time to double check, test, additional googling etc. Its still a massive speed boost.

Using it recreationally (Especially chatting) will result in a lot more requests.

taf2 · on May 24, 2023

The only time I hit it I usually realize I need a mental break anyway so usually it’s plenty. Suppose it depends on how your using it but for me I ask it for code of things I could write but would rather focus my energy on the bigger problem then a single function to merge two objects while keeping the order sorted of a joined list.. that kind of thing it’s great for

ineedasername · on May 24, 2023

I have no idea what happened. My original 1-month Plus subscription to test GPT-4 was, as usual, limited to 25/hour and tied to my personal gmail address. But then recently-- in the last week-- I resubscribed and I accidentally did so under my work email, which has a more specialized and restricted TLD, and under that I don't have any quota limits for the GPT-4 version of ChatGPT.

It might seem a small thing, having to space out prompts 25 ever 3 hours when you might not have used more than 100-200 in a day anyway, but the net result is liberating. I experiment, explore the limits, and get whimsical with it to a much greater extent than when I can to consciously think about each prompt as a rationed resource.

r3trohack3r · on May 24, 2023

It does bother me, I’ve been hit by it 3 times now (I use it as a daily driver, for code you spend enough time between prompts working that it’s rare to go through that volume)

When I hit the limit, I work on the problem myself and wait until 4 resets instead of relying on 3.5. 4 is so much better that I don’t trust 3.5 with my work anymore.

endofreach · on May 24, 2023

For me tbh, i kinda like the limit. I use GPT-4 a lot lately. Hitting the limit reminds me, i got too lazy writing code myself or i got way too deep into it. Then i just close the tab & remember, that i still love writing code the (not quite yet) old way.

skybrian · on May 24, 2023

I've never hit the limit because GPT 4 is slow (like a dialup modem) and I don't like waiting for it. Usually I do something else while it's writing a response.

I haven't used it a whole lot.

furyofantares · on May 24, 2023

It sounds very low and somehow it very rarely bothers me. It sure is annoying when it bothers me, but it's a lot higher in practice than the number feels.

danielmarkbruce · on May 24, 2023

In practice it rarely comes up. I don't even know if it's actually enforced. I'm pretty certain I do more than 25 every 3 hours.

tulip4attoo · on May 24, 2023

I only hit the limit when I using some bots to interact with it. Never hit it after 2 months when normally using it.

danielmarkbruce · on May 24, 2023

From what I see practically everyone is making this comparison and it is bs. As you stay, 4 is an entirely different beast to 3.5.

bugglebeetle · on May 24, 2023

GPT-4 barely performs above 3.5 now. They’ve resource-constrained or otherwise hobbled it to support all the corners of Microsoft products they’re stuffing it in. The amount of errors and logic degradation after the May update is incredibly obvious for all but the most trivial use cases.

It’s going to be very funny if being turned into a next generation Clippy is what makes them lose out to their competitors

NicoJuicy · on May 24, 2023

Microsoft hosts their own models through azure? Why would that resource constrain OpenAI?

wokwokwok · on May 24, 2023

Hm. Title: “beats Chat GPT”

Reality:

> With these evaluation instructions, we compare RLHF model responses to Davinci003 responses and measure the fraction of times the RLHF model is preferred; we call this statistic the win-rate.

> Of the methods we studied, PPO proves the most effective, improving the win-rate against Davinci003 from 44% to 55% according to human evaluation, which even outperforms ChatGPT.

…for the metric we invented, which measures… the difference between a simulated and human evaluated result.

Or something.

Does anyone have a good idea of what this metric actually means and if it is actually relevant to anything useful?

jimsimmons · on May 24, 2023

The measure is win rate versus DV3. Their model wins more often than ChatGPT

Beating a weaker player more often is not evidence of being able to beat a stronger player on average though

wokwokwok · on May 24, 2023

What does “win” mean though?

It improved the simulated win rate vs human win rate?

…but chatgpt had a higher win rate overall? (And gpt4 was much higher)

What is the significance of the difference between simulated and human win rates?

jimsimmons · on May 24, 2023

You provide two samples side by side and see what humans prefer.

You should try asking what you don’t know in a non judgemental manner

wokwokwok · on May 24, 2023

/shrug

The paper says:

> We find that PPO sim trained in AlpacaFarm only achieves a win-rate of 43%, while PPOGPT-4 sim trained on GPT-4 data achieves a win-rate of 50%. To contextualize these results, the initial SFT model has a win-rate of 44%, PPOhuman has a win-rate of 55%, and the best non-PPO human method has a win-rate of 51% (Best-of-16). Thus, training in simulation can provide good models directly for deployment, though this approach suffers a 5% performance gap relative to collecting real human annotations.

...

> However, we also observe that no single LLM-based annotator captures the heterogeneity of human annotation, and substantial amounts of noise had to be injected in the simulated preference for rankings of methods trained in AlpacaFarm to match those trained with real human feedback.

...and, in summary:

> We showed that AlpacaFarm substantially lowers the cost and iteration time of research on and development of methods for learning with pairwise feedback. AlpacaFarm provides a blueprint for constructing other useful simulators for AI research that requires human supervision, and we view it as an exciting opportunity to expand this simulation approach to support data from other domains as well as methods that learn from alternative forms of human feedback.

Ok.

...but that's no what the blog post said. The blog post said:

> Of the methods we studied, PPO proves the most effective, improving the win-rate against Davinci003 from 44% to 55% according to human evaluation, which even outperforms ChatGPT.

The closest the paper got to saying that was:

> The other mismatch is ChatGPT against PPO, where human annotators preferred PPO (55.1% vs 52.9%) unlike the simulator (46.8% vs 61.4%).

That's interesting.

> In both cases, these are not major mistakes, as we do not expect SFT52k to be much worse than SFT10k or for a 7B LLaMA model to substantially outperform ChatGPT.

?? Mistakes?

So.. I mean, yes. I'm judging. When you write a blog saying "outperforms ChatGPT" and then, the paper doesn't say that... well.

It's a bit shit isn't it?

toomim · on May 24, 2023

Yeah, and I think they are using an old version (3.0? 3.5?) of ChatGPT, not GPT4, which is way better. Can anyone verify? They confusingly list GPT4 as a separate LLM, even though ChatGPT supports GPT4.

ShamelessC · on May 24, 2023

The confusion is provided entirely by OpenAI, in my opinion.

danielmarkbruce · on May 24, 2023

No one using ChatGPT is confused. You have to make an explicit choice in the switch, and if you are using the API you have to put the name in as a parameter.

jcims · on May 24, 2023

Absolutely off topic, but I just got back from a week in Peru, where the alpaca is a prominent member of the local fauna.

For folks in the US at least, it's a relatively inexpensive trip and an absolutely gobsmackingly gorgeous country with friendly people and amazing food. Highly recommended!!!

ineedasername · on May 24, 2023

Interesting, did other local fauna converse with you in standard 5-paragraph essay formats as taught to humans >= 12 years old, or was it only the alpaca that did so?

stavros · on May 24, 2023

Hahah I love that you're so excited about the trip that you're commenting about it in random threads. You've convinced me to visit, at least!

andy_xor_andrew · on May 24, 2023

I wonder how much longer this "Using LLMs to evaluate the quality of other LLMs" can last. Certainly it has proven valuable and useful up until now, especially since ChatGPT is a pretty high bar to evaluate against.

But it also seems like a strange, incestuous, closed system approach.

Like, unless you are introducing something new into the system, you just have the system churning against itself, probably until it reaches an equilibrium (or else becomes incoherent).

anothernewdude · on May 24, 2023

I wonder how long "Using humans to rate the quality of other humans" thing can last. Surely academia has only so long before it collapses.

plagiarist · on May 24, 2023

You're asserting that current LLMs are as capable as evaluating each other as are humans with advanced degrees?

anothernewdude · on May 29, 2023

Yes. They're both awful.

jacooper · on May 24, 2023

Also humans aren't exact clones

monkpit · on May 24, 2023

The title is a bit much, no?

version_five · on May 24, 2023

Yes, it violates site guidelines and should be "AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback"

jhbadger · on May 24, 2023

Not really. Pretty much the "killer app" feature of ChatGPT is RLHP. Whether or not the current RLHP-ed Alpca really beats ChatGPT, it is pretty obvious that local LLMs can be RLHP-ed and it is only a matter of time before people realize running an RLHP-ed LLM locally is a better option than running ChatGPT with all the security concerns of running something "in the cloud" (which is just "somebody else's computer" in the famous saying).

monkpit · on May 24, 2023

I was referring to the HN guidelines against editorializing titles.

fnordpiglet · on May 24, 2023

I’m sorry what’s RLHP? I’m not able to Kagi that

version_five · on May 24, 2023

The P should be an F, it's reinforcement learning from human feedback

stavros · on May 24, 2023

Reinforcement learning through human feedback.

Took me a bit of searching too.

xiphias2 · on May 24, 2023

Beating by generating longer answer is not a win for me. Maybe raters prefer long answers, but in reality long answers are only good if they provide extra important information.

They should try to compare answers with similar length.

williamcotton · on May 24, 2023

Not specific to this article…

RLHF is supervised learning on top of unsupervised learning. Is supervised learning at some point of the process a requirement for all reasonable ML models?

stevefan1999 · on May 24, 2023

So what about using it to learn the mentally insane?

Vecr · on May 24, 2023

Huh? Is this a "Jipi and the Paranoid Chip" reference, or something else?