What's the benefit for Meta? They are now the true open source AI providers (after OpenAI got closedAI), but I wonder why they keep releasing such models for free and kinda open source?
Mark Zuckerberg talks about this in their Q1 earnings call.
"I think that there's an important distinction between the products we offer and a lot of the technical infrastructure, especially the software that we -- that we write to support that. And historically, whether it's the Open Compute project that we've done or just open sourcing a lot of the infrastructure that we've built, we've historically open sourced a lot of that infrastructure, even though the products themselves are obviously were not -- we haven’t open sourced the code for our core products or anything like that.
And the reason why I think why we do this is that unlike some of the other companies in the space, we're not selling a cloud computing service where we try to keep the different software infrastructure that we're building proprietary. For us, it's way better if the industry standardizes on the basic tools that we're using and therefore we can benefit from the improvements that others make and others’ use of those tools can, in some cases like Open Compute, drive down the costs of those things which make our business more efficient too.
So I think to some degree we're just playing a different game on the infrastructure than companies like Google or Microsoft or Amazon, and that creates different incentives for us. So overall, I think that that's going to lead us to do more work in terms of open sourcing, some of the lower level models and tools.
But of course, a lot of the product work itself is going to be specific and integrated with the things that we do. So it's not that everything we do is going to be open. Obviously, a bunch of this needs to be developed in a way that creates unique value for our products, but I think in terms of the basic models, I would expect us to be pushing and helping to build out an open ecosystem here, which I think is something that's going to be important."
"we can benefit from the improvements that others make and others’ use of those tools can"
I have always had the impression that maintaining an open source project is way more work than you get back from "the community" of users. Is this not true? Are for instance the internal facebook react users benefiting a huge amount from what outside contributes have built on top of react?
I think an unspoken dimension is that kneecaping the other big tech companies' entrenchments and denying them a market is always good for them - esp when as they point out, it doesn't actually hurt any of their own business interests. Other faang are always a future threat. Hurting them is always a good business move
Broad use helps uncover bugs and make the software more resilient and reliable. They don’t fix all the bugs, and they don’t build features the community wants for the sake of it, but having users of your tools is a benefit.
You get less input from the community than what you put in, but you also get different input than you would get from in-house devs who are all in the same bubble.
> I have always had the impression that maintaining an open source project is way more work than you get back from "the community" of users
You get a ton of valuable work back from quality contributors.
There is a vocal minority of people complaining that they feel burned out because of contributions but attracking some high quality contributors can help a lot.
Pretraining new hires is valuable and the new hires will also train on the open source project docs.
MS put a cool 10b into OpenAI thinking they would have a massive tech moat. FB leaks llama and now OpenAI only has it's status as the bitcoin of LLMs (first, biggest, incumbent)
FB's plan is to F everyone else (MAAG) by making sure they can't make billions off tech that FB have sitting on the shelf, yet is extremely expensive for a true startup competitor to get in on.
The software got rewritten already and the model weights are probably not protecttable.
Especially if you use the model weights to train your own model. Why would you be allowed to use copyrighted data to train your Models but not other Models?
Basically: "commoditise your complement " applied to Facebook, means they want to comoditise the foundational tech like AI. And open source is the route to that.
"For us, it's way better if the industry standardizes on the basic tools that we're using and therefore we can benefit from the improvements that others make and others’ use of those tools can, in some cases like Open Compute, drive down the costs of those things which make our business more efficient too." -- isn't this the Web 2.0 mantra applied to software?
This is the OSS model that been around for 30 years. Operating systems, web servers, countless other projects that help build the internet we know today. Now, AI tools from Meta.
In the highly interesting recent memo leaked from Google, the argument is made that open source will come out the winner in the AI battle and specifically that
"Paradoxically, the one clear winner in all of this is Meta. Because the leaked model was theirs, they have effectively garnered an entire planet's worth of free labor. Since most open source innovation is happening on top of their architecture, there is nothing stopping them from directly incorporating it into their products.
The value of owning the ecosystem cannot be overstated. Google itself has successfully used this paradigm in its open source offerings, like Chrome and Android. By owning the platform where innovation happens, Google cements itself as a thought leader and direction-setter, earning the ability to shape the narrative on ideas that are larger than itself."
This makes more sense than the moat argument. Open sourcing with a noncommercial licence means they get to incorporate effort back into their project, but others can't use it in their businesses. All the academic etc effort can be captured in this way.
The software part already got reimplemented and Models are used as training data for new Models by others. You could argue if using images and other copyrighted training data is allowed you also can use Model to train your own Model.
Precisely. The old way contains innovation to a platform (fb apps, play, appstore). The new way speaks for itself. Total domination of an ecosystem from the root up.
Except neither this model nor several of their recently-lauded “open” releases are open source; they are CC-BY-NC 4.0, aka, you are free to tinker and share, but not to use the work or derivatives for commercial purposes. Any community effort the Meta’s hobbyist-source license attracts is work that isn’t enabling commercial competition, unlike actual open source systems like Suno’s Bark (MIT) or even use-restricted-but-not-non-commercial shared source licenses like Stable Diffusion’s CreativeML Open RAIL-M.
> Any community effort the Meta’s hobbyist-source license attracts is work that isn’t enabling commercial competition
So what?
Sure, maybe the Googles of the world aren't building on top of meta's products, but I can tell you that a lot of startups are.
Does it make these startups vulnerable, to long term future legal action? Sure, but nobody is thinking that far ahead. What people are thinking about is how to get users and show off flashy demos to investors.
Instead, people are just pushing out products, breaking meta's licenses, and not telling people about it, while they attempt to get traction.
Strict licensing, without enforcement, is not worth the paper that the contract is written on.
So yes, it is still beneficial that the code is released, even with a bad license.
So, that's a reason that might wish to release a non-open-source model with this particular license, and one that provides an alternative to the “Meta is doing this because they stand to benefit from open source models taking off”, specifically, “Meta is doing this because it stands to benefit from drawing energy away from open source models into ones that cannot legally be used to commercially compete”.
> Does it make these startups vulnerable, to long term future legal action? Sure, but nobody is thinking that far ahead.
Well, the startups may not be, but Meta maybe is, and its acquiring a zero-cost, upside-only investment in every startup doing that. “Unjust enrichment”.
This might be an unpopular opinion on HN but the whole “ask for forgiveness not for permission” view some take to business feels pretty bad taste to me.
But I am able to train my own LLM on the output of their LLM, right? Or are the big AI players going to argue that you cannot train an AI on data you don't have a license to? (See the catch 22 here?)
> But I am able to train my own LLM on the output of their LLM, right?
Sure. And, there's an argument that the license only applies to the code because model weights aren’t subject to copyright anyway. And available-under-any-license is a lot better than OpenAI’s current stance as far as enabling anyone else, since they’ve gone completely closed to the point where even their papers on their models are more PR than reproducible science. There's a continuum from secret sauce to “do what thou wilt”, and I am not a zealot arguing anything not Open Source must be rejected as not a positive step.
My guess is this isn't their competitive edge, network effects, products, data and distribution is.
In a way, it takes away their competitors edge while racing to the bottom to compete with open source. At the same time, they establish themselves as experts and keep attracting great talent that wants to publish their work openly. And it benefits all of us, so good marketing amongst developers too.
This comment resonates with me and reminds me of T-Mobile making international roaming free; they didn’t really have a ton of business coming from that service, but knew how important it was for their competitors. They made theirs free and forced the industry down that path. (Have since added some fees back but the point is similar to your thoughts)
As Ben Thompson pointed out recently, unlike Google and OpenAI, Meta benefits from open source AI taking off because that makes everyone better content creators, which accrues further value to their social media platforms.
If Meta stands to benefit from open source AI taking off, why are its models CC-BY-NC 4.0 instead of open source?
EDIT: On reflection, you can probably extend the content creation argument to say that noncommercial tools enabling that without enabling commercial competition, to the extent that some of the models will be integrated into Meta products, is the best of all worlds for Meta, so the basic argument works even without open source in the strict sense.
Meta has one of the best if not the best open source track record. They do it likely because it does not interfere with their business model. If outsiders find ways to improve their tech it only helps them.
Facebook doesn't want the models to be the money making bit, because they aren't a licensing/subscription service. They are an ads and soon hardware-platform company. They want those bits to be what people pay for. Not the models.
All these models are licensed under a non-commercial license. So their competitors don't gain a real advantage.
Other than OpenAI (who are remarkably tight lipped), ML researchers are pretty chatty in both their papers and watercooler hangouts. So, the information is going to get out either way. Might as well get ahead of it, and look like the good guy in the process.
This model is vision-only so it can't be SOTA even if it's #1 performing in many of the original categories of benchmarks, which it is (it's a very very good model).
We've moved on from ImageNet-style tests "Choose the most appropriate label for this image from 200 possible labels" to much more advanced "Reasoning" tests[0]. PaLI[1] is potentially the SoTA here but BeIT-3[2] may be better example for my thesis. Notice that BeIT-3 is trained on not just images, but also trained in natural language. It outperforms purely image-trained models on even pure-image tasks like Object Detection and Semantic Segmentation.
Take a look at the major benchmarks for Segmentation (ade20k) [4]: DINOv2, 11th place. BEiT-3, 4th place. Yes, BEiT-3 has 72% more parameters but it's also basically an entire LLM. Even GPT-4 is a multi-modal model, and actually accepts images as prompt inputs, OpenAI just doesn't expose that ability.
More importantly, the new multi-modal models can understand human questioning like "What type of flowers are in the blue buckets of this image?" and respond intelligently, in English/whatever.
DINOv2 was trained with techniques borrowed from LLM training methods, but is not trained for natural language.
Purely my speculation: OpenAI is hobbling their products trying to support all kinds of integrations, specifically Microsoft’s. GPT-4 is not performant enough for end user applications, so they’ve had to gimp a lot of its reasoning to make it speedier.
This opens up an opportunity for their competitors to eat into their moat because OpenAI is treading water/downgrading their product, chasing scale. Meta is leveraging this opening to flood the field with amazing open source tools, all of which compete with OpenAI offerings, knowing that the open source community will run with them and further erode OpenAI’s moat.
It will be hard to use the model without reproducing (copying) it.
You may consider any output of the model derivative work and then it would apply. But that goes against the understanding that outputs of ML models are copyrightable.
If however, someone runs the model for non-commercial purpose and you take the the output of the model and commercialize it...
hard to use the model without
reproducing (copying) it
Can you link to a source that backs up this line of thought? It is not how I understand copyright law. AFAIK it is not about you making a copy for yourself. It is about person A copying content person B has the copyright to and then giving it to person C. That's why the license talks about "sharing".
The way I understand it, what A does with their copy of the content does not fall under the copyright law.
If at any point a person chooses to "share" the work of an artist with other people outside of the marketplace where the artist is selling their work, then those people are stealing. I really don't think it's that hard to discern fair use from theft. I also don't understand why so many people rush to defend stealing software just because "it's only numbers on a computer". Stolen software is not just a number on a computer. Quality software is very expensive to produce, and when people take that software without paying for I then that contributes to bad software being produced in the future, because the quality developers you want producing your software can no longer be afforded for your bullshit, because too many of you have no problem stealing from other people.
Now? Copyright law is broken for decades. My 2 cents that it's part of the whole government/technology lag, lawmakers are decades behind in terms of what they know, their whole perspective is skewed
To get it onto your server you need to copy it (= reproduce it), and so you have to obey the license and not use that copy for commercial purposes. IANAL, but that seems kind of obvious.
If we follow this (obvious as you say) line of thought, can you read the code while at work? Isn't that a use of the copy (your browser made) for commercial purposes?
I don't think this is what copyright is about. As far as I know, it is about sharing a copy with a third party. But I am happy to be corrected if someone has a link to a reputable source that says otherwise.
Copyright law is about copying, full-stop. It doesn't matter if you are copying it for yourself, it's still a copy. That's why copyright laws in some jurisdictions had to be updated to allow temporary copies in RAM existing as part of use (e.g. USA [0]), or to allow format-shifting like putting a CD you own on an MP3 player (e.g. NZ [1]), and some attempts didn't get it right at first (e.g. USA [2]).
Not a single time it talks about the act of copying and/or using the software. It constantly talks about sharing it.
The consumer.org.nz page you linked to also mainly focusses on "selling" the data itself and explicitely states that you can make copies for yourself:
Can I download mp3s from the internet
and copy them to CD?
Sure – as long as they were obtained legitimately
– either purchased from an online store or from a
legal free download. As we said before, if you own
the music legally then you can make a copy.
This seems to underline my thinking that copyright is mostly about sharing with 3rd parties. Not about the act of duplicating bytes in a technical manner or about what you do with your copy of the bytes.
From the first section of the consumer.org.nz page (emphasis mine): "Can I make a copy of albums I own?
Yes. A recent amendment to the Copyright Act means that you can copy music – in the Act as 'sound recordings' – you own for your own personal use.". No such amendment would be needed if copyright was about sharing. Same with the temporary copies in RAM.
The license talks about "reproducing", which is copying.
You seem to be basically saying that you think Meta's lawyers and Creative Commons's lawyers don't know how to do their jobs (and can be trivially out-thought in a couple of minutes by laypeople), which seems very unlikely.
I don't know why you're so insistent on asking a non-lawyer for legal advice. The license says "NonCommercial means not primarily intended for or directed towards commercial advantage or monetary compensation", so it depends on your reasons for reading the code (and it doesn't matter whether you do it at work or elsewhere).
The part you quoted is from the definition of "NonCommercial". The way "NonCommercial" is then used is:
the Licensor hereby grants You ... to ...
a. reproduce and Share the Licensed Material,
in whole or in part, for NonCommercial purposes
b. produce, reproduce, and Share Adapted Material
for NonCommercial purposes
As you can see, it always talks about sharing the content which can only be done for NonCommercial reasons. If it wouldn't only apply to sharing but to any type of usage, why would sharing be mentioned in every paragraph?
Also, in my experience, a lawyer would use "or Share" and not "and Share" if they wanted to express that the act of "reproducing" alone (whatever that means) is enough to fall under the license, even if no sharing would occur.
So my feeling is still that the license deals with sharing. Not with act of using.
I know this is a bit of a meme now in HN, and I agree that GPLv3 is certainly a much better license. However to push back a bit, I would say that at least the license allows researchers to do the important work that needs to be done.
It seems that there are a lot of grifters that are making "small businesses" off of LLMs that are pushing the 'not real open source' narrative a bit too much.
Open source for allowing research is far more important than allowing some group of script kiddies trying to make a silly LLM app to make a quick buck, and something like GPLv3 is even better in this regard since a company can be founded on it, but completely transparently - which is arguably a bit better than have a 100LoC python script in a completely opaque MIT license that's trying to be used to sell to a company for $$, which is imo highly unethical (not novel or intellectually challenging in the slightest, and therefore not valuable).
So yes, open source should be what we strive for, but research is the most important aspect of OSS.
> I know this is a bit of a meme now in HN, and I agree that GPLv3 is certainly a much better license.
The GPLv3 is certainly an open source license, sure.
Better in blanket terms is...not the point I’m making, and I am not arguing that, in this space, all open source licenses are categorically better than non-open source licenses.
But, I do think its important to describe licenses accurately and understand the implications of particular licenses.
> However to push back a bit, I would say that at least the license allows researchers to do the important work that needs to be done.
Yes, as far as sharing research, this is worlds better than OpenAI. And its worth noting that while the usage restrictions aren’t as competition-restricting, the most widely touted successful “open source” model (Stable Diffusion) is also not open source strictly (the license has usage restrictions) though there are some notable truly-open-source models.
If you can read the source code, it is source-available. This model is source-available but not open source because CC-BY-NC 4.0 restricts the way the software can be used (noncommercial use only). This is contrary to the Open Source Definition's "No Discrimination Against Fields of Endeavor" clause:
> 6. No Discrimination Against Fields of Endeavor
> The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research.
Luckily to prevent everybody interpreting sounds and glyphs as they wish, we -as a civilization- agreed to adhere to certain standards of how things would be interpreted. The same way we agreed that "allow" means can but not must, we agreed what open source means. Just because you disagree doesn't mean you are in the right.
> Luckily to prevent everybody interpreting sounds and glyphs as they wish, we -as a civilization- agreed to adhere to certain standards of how things would be interpreted.
Exactly. I'm for using definitions of words as they are commonly understood. The other definition relies on arbitrary conditions totally unrelated to the English meanings of the words being used.
> Exactly. I'm for using definitions of words as they are commonly understood.
No you are not. Open source in the context of computing has a definition that is commonly understood. Why then are you using definition of that word that is not commonly understood?
Just open to viewing? Would you consider stolen and GitHub-posted code as open source?
Licenses adhering to the OSD generally ensure open viewing, open use, open modification and open distribution. Stripping that back to just viewing removes a large part of openness that "open source" has been built upon.
Interesting to see if/how this is different (or maybe built upon) Segment Anything[0].
For folks in the know: I often see segmentation models on video frames producing patchy results (see the DinoV2 video of the running dog, the body gets black patches randomly, so the segmentation fails for certain frames). What methods are folks using to deal with this - standard fine-tuning, or is there a way to "force" the area to be cleanly segmented (ie, add a bounding box around the class to supplement the data)?
And is it something that can be implemented in foundation models, or are we always going to have patchy zero-shot results like this on video files?
It's probably a random person. But, as long as it's not another story about the Surgeon General saying how social media is harmful for kids, hob-knobbing at the next COP conference next to oil companies, getting gigadollar fines related to GDPR? Am I missing anything? Have there been any actual good news lately?
I am not an expert, but Meta might be missing out on a business opportunity by being blindsided on everything social media. Over time, every company (Google, Apple, Amazon) diversified themselves. With their AI firepower, they could have explored a number of new avenues. Every business need not boot with billion users.
Well, Meta invested 36 billion dollars between 2019 and Oct 2022 trying to diversify beyond social media apps into the multiverse and VR with Meta Quest and Meta Reality Labs. They've also invested in NFTs.
I feel Meta, Google and OpenAI are the real AI leaders of this industry. Definitely not Microsoft who is riding on integrations with OpenAI APIs. Just saying.
Microsoft made a massive investment in OpenAI and get 75% of OpenAI profit at least until that earns them the $10B invested back. Microsoft, for now, from a financials perspective are OpenAI rather than any old third party integrator. They are the paymasters.