If you can't rebuild it, then how can you be considered to have the "source code" ?
The training data isn't a dataset used at runtime - it's basically the source code to the weights.
Not sure it really matters here though (who has the GPUs and desire to retrain Grok?), but just as a matter of definition "open weights" fits better than "open source".
A big catch here is that you can't slap an open source license on a bunch of copyrighted training data, and to date no-one has created a truly convincing LLM exclusively trained on public domain data. It might happen soon though - there are some convincing effort in progress.
Many people think these companies are training on unlicensed data but I think OpenAI licenses their data, they just “license” it the way one would need to in order to read it.
It's not a question of if, rather when the cat gets out of the bag and the legal battle starts. The problem is that all the copyright applies to the expression not the factual information it expresses (in this case word relations). Now "how math works" and "the language of the law" are going to make for an interesting court case. I suspect that math wins here but it depends on what judge gets it and how high it goes.
They've just started to (in response to lawsuits, it must be noted) and in the meantime, they're simultaneously claiming that (1) what they're doing is fair use (a.k.a. fair dealing) and (2) preparing for the day when courts confirm that it isn't.
Come on, that's not reasonable to expect from a company, or useful for indie hackers. Having weights that can be used however you like is enough for most people, even large companies.
Maybe it should be called something else? "Openly-licensed"?
Just because the model weights are not really "source" (either as a matter of intuition or for example following the OSI "preferred form in which a programmer would modify the program" definition).
Sure, but I don't want to train anyone's model from scratch. Realistically, I can't download all the training data, or run the pipeline, or train the model. Making all of that available to me would be a massive burden on the company too, so they simply won't do it. If I'm able to fine-tune it, that's enough for me, and imo, that fits with the spirit of open/free software. We have to understand that this is fundamentally a different thing than something like the Linux kernel, and closer to something like an industrial project. The output is just a bunch of numbers instead of something physical.
The Open Source Initiative is actively working on this over the course of this year, and your input will help define that meaning! Please see here for more:
Yeah musk said “all design and engineering for the original roadster is now open source” and actually what we got was a few PCB files and zero mechanical design files so I don’t ever trust what he says.