One subtle thing: Musk said "open-source", we got "open-weights" instead (still ...

tylerekahn · on March 17, 2024

This is the weights and the model under Apache 2.0 license. What do you mean by open-source?

https://github.com/xai-org/grok/blob/main/model.py

https://github.com/xai-org/grok/blob/main/run.py#L25

pclmulqdq · on March 17, 2024

Still better than most of the "open weights" models that have massively restrictive terms.

solarkraft · on March 17, 2024

He also called permissively licensing Tesla's patents "open sourcing" them. He's at the forefront of misusing the term.

drexlspivey · on March 17, 2024

The “source” in “open source” refers to source code which they released. A dataset is not source code, if anyone is misusing the term it’s you.

HarHarVeryFunny · on March 18, 2024

If you can't rebuild it, then how can you be considered to have the "source code" ?

The training data isn't a dataset used at runtime - it's basically the source code to the weights.

Not sure it really matters here though (who has the GPUs and desire to retrain Grok?), but just as a matter of definition "open weights" fits better than "open source".

frabcus · on March 17, 2024

I consider the weights a binary program and the source code is the training data. The training algorithm is the compiler.

I agree this isn't standard terminology, but it makes the most sense to me in terms of power dynamics and information flow.

We know from interpretability research that the weights do algorithms eg sin approximation etc. So they feel like binary programs to me.

solarkraft · on March 18, 2024

https://youtu.be/WyTzRnGSlcI?t=88

paulgb · on March 17, 2024

Dumb question: what should open-source mean in the context of something like this? Open access to the training data and training pipeline as well?

CharlesW · on March 17, 2024

It's not a dumb question, and the answer is "yes".

simonw · on March 17, 2024

A big catch here is that you can't slap an open source license on a bunch of copyrighted training data, and to date no-one has created a truly convincing LLM exclusively trained on public domain data. It might happen soon though - there are some convincing effort in progress.

CharlesW · on March 17, 2024

Absolutely, because it’s trained mostly on unlicensed, copyrighted content, they basically can’t release source.

gfodor · on March 17, 2024

Many people think these companies are training on unlicensed data but I think OpenAI licenses their data, they just “license” it the way one would need to in order to read it.

zer00eyz · on March 17, 2024

You all keep using the word "Data"

Data, as in facts, as in the frequency of one word in relation to another.

"Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed..." FROM: https://www.copyright.gov/help/faq/faq-protect.html

It's not a question of if, rather when the cat gets out of the bag and the legal battle starts. The problem is that all the copyright applies to the expression not the factual information it expresses (in this case word relations). Now "how math works" and "the language of the law" are going to make for an interesting court case. I suspect that math wins here but it depends on what judge gets it and how high it goes.

gfodor · on March 18, 2024

No, the term data can be used to describe anything that can be recorded in bytes. It’s “data storage capacity” when you buy a hard drive.

CharlesW · on March 17, 2024

> …I think OpenAI licenses their data…

They've just started to (in response to lawsuits, it must be noted) and in the meantime, they're simultaneously claiming that (1) what they're doing is fair use (a.k.a. fair dealing) and (2) preparing for the day when courts confirm that it isn't.

logicchains · on March 18, 2024

https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama... this one claims to have been trained only on permissively licensed data.

nabakin · on March 17, 2024

Agreed. It's ridiculous people have to resort to saying their question dumb to avoid being attacked by toxic commenters.

dudus · on March 17, 2024

If you release that instead of the binary weights you can be both more open and less useful for users. Fun

zeroCalories · on March 17, 2024

Come on, that's not reasonable to expect from a company, or useful for indie hackers. Having weights that can be used however you like is enough for most people, even large companies.

schoen · on March 17, 2024

Maybe it should be called something else? "Openly-licensed"?

Just because the model weights are not really "source" (either as a matter of intuition or for example following the OSI "preferred form in which a programmer would modify the program" definition).

zeroCalories · on March 18, 2024

Sure, but I don't want to train anyone's model from scratch. Realistically, I can't download all the training data, or run the pipeline, or train the model. Making all of that available to me would be a massive burden on the company too, so they simply won't do it. If I'm able to fine-tune it, that's enough for me, and imo, that fits with the spirit of open/free software. We have to understand that this is fundamentally a different thing than something like the Linux kernel, and closer to something like an industrial project. The output is just a bunch of numbers instead of something physical.

SequoiaHope · on March 17, 2024

The Open Source Initiative is actively working on this over the course of this year, and your input will help define that meaning! Please see here for more:

https://opensource.org/blog/open-source-ai-definition-weekly...

Q6T46nT668w6i3m · on March 17, 2024

Yes, training and evaluation code, i.e., the code used to generate the weights.

SequoiaHope · on March 17, 2024

Yeah musk said “all design and engineering for the original roadster is now open source” and actually what we got was a few PCB files and zero mechanical design files so I don’t ever trust what he says.