These are not really open source models. They are open weight, which is like sharing an executable instead of the actual source code. You don’t have the things you need to reproduce the model weights. What’s actually needed for these to be open source is the training source code, the training data sets, any applicable pre- or post- processing code, evaluation code, etc.
Although training the actual model is not feasible for people without the funds and hardware, having those things would at least allow the model to be auditable. Otherwise, we really have no idea what these open weight models are doing. They could be biasing themselves in various ways that are invisible but damaging.
Another thing to watch out for is licensing. Only a few models use an actual open source license from OSI (like Apache). Many are using proprietary licenses that reference external terms that can change over time, or limit how you use their model. These restrictions are definitely not in the spirit of open source.
In other words, much of this is “openwashing”, which is a new trend like “greenwashing”.
Anyways, I think it’s great that these models are able to challenge the really big proprietary models like ones from OpenAI or Anthropic. This shows the models are not that interesting or unique, and that the differentiation will come from who has access to training data. That is why Microsoft is scrambling to violate users’ privacy with Copilot agents autolaunched at startup (https://www.pcmag.com/news/microsoft-tests-having-copilot-la...) and why Apple does not let you disable Siri training easily but only one by one for individual apps (https://www.imore.com/how-stop-siri-learning-how-you-use-app...). And that’s also why OpenAI and others seem to be pushing for regulations that restrict AI using excuses like “safety” or “ethics”, when it is really about regulatory capture.
It's quite funny to me that "open source" has turned out to be a significantly more confusing term than "Free Software" due to the fact that people don't seem to understand why the word "source" is in there.
The word "source" is literally meant to mean "where it comes from", both in regards to executable software, and large language models. If the training data for a language model is not "open", then the language model is not "open source", full stop. Training data is the source of a language model in the same way code is the source of an executable program.
> people don't seem to understand why the word "source" is in there.
Disagreeing with you is not the same as not understanding. There clearly isn't a consensus, but no shortage of people just asserting that others are wrong.
If you have access to any language models, I encourage you to ask this question:
> What is the significance of the word "source" in the expression "source code"?
I landed on this question by first asking it about "open source" rather than "source code", but the answers referred to "source code", which is a bit circular for this conversation. I'll share a truncated version of how Mistral Large replied:
> It is called the "source" because it is the origin or the primary input from which the executable form of the program is derived.
The primary input from which a language model is derived is the training data. That is the source. If the training data for a model is not open, then the model is not open source, because the source of the model is not open.
While I agree, that the term "Open Source model" is misleading, an "open weight model" is much more than closed source executable. It is quite easy to modify the model and it's possible to verify and test the model. It looks like "open weight models" are just a new paradigma in the world of CS.
It's a matter of opinion how much open model should be to be called 'open source'. Looks like some believe they have the right to define it for everybody else to use. Like for software. Have to disagree. Why don't we introduce a separate term: 'open source, training infrastructure and data included for free'?
"open weight model" is confusing, because actually the architecture is open too, only data is missing.
It's a different animal. In general you cannot reproduce the model even having all the training data. There are too many random factors and nobody keeps track of them. Just pushing the training data is done at random from the dataset. This results in some interesting facts. Having the model and the data it's impossible to say if the model was trained on that exactly data. All we can say is that some pieces of that data were used in training, in some cases. Model can be 'watermarked' in hard to detect, stable to quantization and finetuning way.
So, you cannot have a reproducible, 'open source' in its strict interpretation, model.
It seems like you can use "weights available" models to bootstrap "learning step available" models though?
I guess the problem is like, if these closed models have a folder 'McDonalds propaganda' with the dummy 'treat-as-facts' file, in the training set, which I as a user would like to not include?
Indeed, and once the 'McDonalds propaganda' folder has been laundered through the weights, there is no reliable way to scrub it, particularly without knowing what was in it - any generated data might contain a subtle echo of it.
Although training the actual model is not feasible for people without the funds and hardware, having those things would at least allow the model to be auditable. Otherwise, we really have no idea what these open weight models are doing. They could be biasing themselves in various ways that are invisible but damaging.
Another thing to watch out for is licensing. Only a few models use an actual open source license from OSI (like Apache). Many are using proprietary licenses that reference external terms that can change over time, or limit how you use their model. These restrictions are definitely not in the spirit of open source.
In other words, much of this is “openwashing”, which is a new trend like “greenwashing”.
Anyways, I think it’s great that these models are able to challenge the really big proprietary models like ones from OpenAI or Anthropic. This shows the models are not that interesting or unique, and that the differentiation will come from who has access to training data. That is why Microsoft is scrambling to violate users’ privacy with Copilot agents autolaunched at startup (https://www.pcmag.com/news/microsoft-tests-having-copilot-la...) and why Apple does not let you disable Siri training easily but only one by one for individual apps (https://www.imore.com/how-stop-siri-learning-how-you-use-app...). And that’s also why OpenAI and others seem to be pushing for regulations that restrict AI using excuses like “safety” or “ethics”, when it is really about regulatory capture.