Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One subtle thing: Musk said "open-source", we got "open-weights" instead (still better than nothing though, so it's greatly appreciated).


This is the weights and the model under Apache 2.0 license. What do you mean by open-source?

https://github.com/xai-org/grok/blob/main/model.py

https://github.com/xai-org/grok/blob/main/run.py#L25


Still better than most of the "open weights" models that have massively restrictive terms.


He also called permissively licensing Tesla's patents "open sourcing" them. He's at the forefront of misusing the term.


The “source” in “open source” refers to source code which they released. A dataset is not source code, if anyone is misusing the term it’s you.


If you can't rebuild it, then how can you be considered to have the "source code" ?

The training data isn't a dataset used at runtime - it's basically the source code to the weights.

Not sure it really matters here though (who has the GPUs and desire to retrain Grok?), but just as a matter of definition "open weights" fits better than "open source".


I consider the weights a binary program and the source code is the training data. The training algorithm is the compiler.

I agree this isn't standard terminology, but it makes the most sense to me in terms of power dynamics and information flow.

We know from interpretability research that the weights do algorithms eg sin approximation etc. So they feel like binary programs to me.



Dumb question: what should open-source mean in the context of something like this? Open access to the training data and training pipeline as well?


It's not a dumb question, and the answer is "yes".


A big catch here is that you can't slap an open source license on a bunch of copyrighted training data, and to date no-one has created a truly convincing LLM exclusively trained on public domain data. It might happen soon though - there are some convincing effort in progress.


Absolutely, because it’s trained mostly on unlicensed, copyrighted content, they basically can’t release source.


Many people think these companies are training on unlicensed data but I think OpenAI licenses their data, they just “license” it the way one would need to in order to read it.


You all keep using the word "Data"

Data, as in facts, as in the frequency of one word in relation to another.

"Copyright does not protect facts, ideas, systems, or methods of operation, although it may protect the way these things are expressed..." FROM: https://www.copyright.gov/help/faq/faq-protect.html

It's not a question of if, rather when the cat gets out of the bag and the legal battle starts. The problem is that all the copyright applies to the expression not the factual information it expresses (in this case word relations). Now "how math works" and "the language of the law" are going to make for an interesting court case. I suspect that math wins here but it depends on what judge gets it and how high it goes.


No, the term data can be used to describe anything that can be recorded in bytes. It’s “data storage capacity” when you buy a hard drive.


> …I think OpenAI licenses their data…

They've just started to (in response to lawsuits, it must be noted) and in the meantime, they're simultaneously claiming that (1) what they're doing is fair use (a.k.a. fair dealing) and (2) preparing for the day when courts confirm that it isn't.


https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama... this one claims to have been trained only on permissively licensed data.


Agreed. It's ridiculous people have to resort to saying their question dumb to avoid being attacked by toxic commenters.


If you release that instead of the binary weights you can be both more open and less useful for users. Fun


Come on, that's not reasonable to expect from a company, or useful for indie hackers. Having weights that can be used however you like is enough for most people, even large companies.


Maybe it should be called something else? "Openly-licensed"?

Just because the model weights are not really "source" (either as a matter of intuition or for example following the OSI "preferred form in which a programmer would modify the program" definition).


Sure, but I don't want to train anyone's model from scratch. Realistically, I can't download all the training data, or run the pipeline, or train the model. Making all of that available to me would be a massive burden on the company too, so they simply won't do it. If I'm able to fine-tune it, that's enough for me, and imo, that fits with the spirit of open/free software. We have to understand that this is fundamentally a different thing than something like the Linux kernel, and closer to something like an industrial project. The output is just a bunch of numbers instead of something physical.


The Open Source Initiative is actively working on this over the course of this year, and your input will help define that meaning! Please see here for more:

https://opensource.org/blog/open-source-ai-definition-weekly...


Yes, training and evaluation code, i.e., the code used to generate the weights.


Yeah musk said “all design and engineering for the original roadster is now open source” and actually what we got was a few PCB files and zero mechanical design files so I don’t ever trust what he says.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: