We may have already - data is more important than anything else which is why nob...

lambdaba · on March 17, 2024

Claude 3 has *decisively* beat GPT-4, I wonder how all their attributes compare.

orbital-decay · on March 17, 2024

Has it, though? LMSys Arena Leaderboard (blind ranking by humans) [0] positions Opus just below GPT-4 with a negligible ELO gap.

[0] https://chat.lmsys.org/

espadrine · on March 17, 2024

A number of AI companies have a naming/reproducibility issue.

GPT4 Turbo, released last November, is a separate version that is much better than GPT-4 (winning 70% of human preferences in blind tests), released in March 2023.

Claude 3 Opus beats release-day GPT-4 (winning 60% of human preferences), but not GPT-4 Turbo.

In the LMSys leaderboard, release-day GPT-4 is labeled gpt-4-0314, and GPT4 Turbo is labeled gpt-4-1106-preview.

staticman2 · on March 18, 2024

That "blind ranking" is limited to about 2,000 tokens of context. So it's certainly not evaluating how good the models are at complex assignments.

BoorishBears · on March 18, 2024

Chatbot Arena is not a blind ranking.

Many, if not most, users intentionally ask the models questions to tease out their canned disclaimers: so they know exactly which model is answering.

On one hand it's fair to say disclaimers affect the usefulness of the model, but on the other I don't think most people are solely asking these LLMs to produce meth or say "fuck", and that has an outsized effect on the usefulness of Chatbot Arena as a general benchmark.

I personally recommend people use it at most as a way to directly test specific LLMs and ignore it as a benchmark.

swalsh · on March 17, 2024

I don't know if Claude is "smarter" in any significant way. But its harder working. I can ask it for some code, and I never get a placeholder. It dutifully gives me the code I need.

lambdaba · on March 17, 2024

It understands instructions better, it's rarer to have it misunderstand, and I have to be less careful with prompting.

stainablesteel · on March 17, 2024

i like some of claudes answers better, but it doesnt seem to be a better coder imo

simonw · on March 17, 2024

I've found it to be significantly better for code than GPT-4 - I've had multiple examples where the GPT-4 solution contained bugs but the Claude 3 Opus solution was exactly what I wanted. One recent example: https://fedi.simonwillison.net/@simon/112057299607427949

How well models work varies wildly according to your personal prompting style though - it's possible I just have a prompting style which happens to work better with Claude 3.

asciii · on March 17, 2024

> according to your personal prompting style though

I like the notion of someone’s personal prompting style (seems like a proxy for those that can prepare a question with context about the other’s knowledge) - that’s interesting for these systems in future job interviews

bugglebeetle · on March 17, 2024

What is your code prompting style for Claude? I’ve tried to repurpose some of my GPT-4 ones for Claude and have noticed some degradation. I use the “Act as a software developer/write a spec/implement step-by-step” CoT style.

simonw · on March 17, 2024

Almost impossible to describe prompting style, but here are some examples of how I've used Claude 3:

https://gist.github.com/simonw/4cecde4a729f4da0b5059b50c8e01... - writing a Python function

https://gist.github.com/simonw/408fcf28e9fc6bb2233aae694f8cd... - most sophisticated example, building a JavaScript command palette

https://gist.github.com/simonw/2002e2b56a97053bd9302a34e0b83... - asking it to refactor some existing code

I don't use the "Act as a X" format any more, I'm not at all convinced it has a noticeable impact on quality. I think it's yet another example of LLM superstition.

lgas · on March 17, 2024

> I don't use the "Act as a X" format any more, I'm not at all convinced it has a noticeable impact on quality. I think it's yet another example of LLM superstition.

It's very contextually dependent. You really have to things like this for your specific task, with your specific model, etc. Sometimes it helps, sometimes it hurts, and sometimes it does nothing at all.

bugglebeetle · on March 17, 2024

Super helpful! Thanks!

furyofantares · on March 17, 2024

I didn't know people were still doing this "act as etc etc" instructional prompting.

I just tell it my coding problem. Or when making something from scratch, ask for small things and incrementally add.

furyofantares · on March 17, 2024

I've found it significantly better than GPT4 for code and it's become my go-to for coding.

That's actually saying something, because there's also serious drawbacks.

- Feels a little slower. Might just be UI

- I have a lot of experience prompting GPT4

- I don't like using it for non-code because it gives me to much "safety" pushback

- No custom instructions. ChatGPT knows I use macos and zsh and a few other preferences that I'd rather not have to type into my queries frequently

I find all of the above kind of annoying and I don't like having two different LLMs I go to daily. But I mention it because it's a fairly significant hurdle it had to overcome to become the main thing I use for coding! There were a number of things where I gave up on GPT then went to Claude and it did great; never had the reverse experience so far and overall just feels like I've had noticeably better responses.

htrp · on March 17, 2024

citation needed (other than 'vibes')

YetAnotherNick · on March 17, 2024

There is no reason to believe GPT-4 had more(or higher quality) data than Google etc. has now. GPT-4 was entirely trained before the Microsoft deal. If OpenAI could pay to acquire data in 2023, >10 companies could acquire similar quality by now, and no one has similar quality model in a year.

austhrow743 · on March 17, 2024

The more disregard a company has for intellectual property rights, the more data they can use.

Google had far more to lose from a "copyright? lol" approach than OpenAI did.

brookst · on March 17, 2024

I was under the impression training was at best an undefined area of IP law. Is there any aspect of copyright that prohibits training models?

simonw · on March 17, 2024

This is being tested by a number of lawsuits right now, most notably the NY Times one: https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...

The key questions are around "fair use". Part of the US doctrine of fair use is "the effect of the use upon the potential market for or value of the copyrighted work" - so one big question here is whether a model has a negative impact on the market for the copyrighted work it was trained on.

sroussey · on March 17, 2024

I don’t think the New York Times thing is that much about training, than it is about the fact that ChatGPT can use Bing and Bing has access to New York Times articles for search purposes.

simonw · on March 17, 2024

If you read the lawsuit it's absolutely about training. The Bing RAG piece is one of the complaints in there but it's by no means the most important.

Take a look at https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20... - bullet points 2 and 4 on pages 2/3 are about training data. Bullet point 5 is the Bing RAG thing.

sroussey · on March 18, 2024

Ah, thanks!

YetAnotherNick · on March 18, 2024

Having used both Google's and OpenAI's models, the kind of issue they have are different. Google's models are superior or at least on par in knowledge. It's the instruction following and understanding where OpenAI is significantly better. I don't think pretraining data is the reason of this.

supafastcoder · on March 18, 2024

> Google had far more to lose from a "copyright? lol" approach than OpenAI did.

The company that scrapes trillions of web pages has an issue with copyright?

sib · on March 18, 2024

Well... Googlebot does pay attention to robots.txt - I don't think (original) OpenAI-bot did.

squigz · on March 17, 2024

I think Groq is something else?

LorenDB · on March 17, 2024

Indeed, Groq is a company building inference accelerators. Grok is completely unaffiliated.

andy99 · on March 17, 2024

Edited, I did mean the Grok in the article not the inference chip.

ldjkfkdsjnv · on March 17, 2024

Claude > GPT4. Anyone using these models on a daily basis knows this

int_19h · on March 18, 2024

I use these models regularly, and Claude is dumb as a rock compared to GPT-4.

jstummbillig · on March 17, 2024

It is known