> Open source AI models allow for greater transparency, collaboration, and innov...

dbish · on June 3, 2024

Correct. Worse is that there are models being touted as "open source" that don't allow for a bunch of different uses and specify their own custom licensing (look at what Falcon originally had, Meta's models with specific commercial carveouts, etc.), we need an rms of the new age to call these fake OSS approaches out as they feel more like they are being done to get the OSS marketing shine, without actually being free and open.

Your "source" is not open nor is it transparent if training code, original dataset, model architecture details, and training methodology are not all there.

reaperman · on June 3, 2024

Closest to this would be https://www.eleuther.ai whose training data is largely public and training processes are openly discussed, planned, and evaluated on their Discord server. Much of their training dataset is available at https://the-eye.eu (their onion link is considered "primary", however, due to copyright concerns)

sillysaurusx · on June 3, 2024

Is our dataset still available? I thought it was taken offline.

Where do you go under that link to get it?

E.g. https://the-eye.eu/public/AI/pile/readme.txt says it’s gone (and "old news"? I disagree).

wongarsu · on June 3, 2024

There are still plenty of reliably sources for magnet links to The Pile, e.g. [1]. The DMCA takedowns are just a minor inconvenience.

1: https://web.archive.org/web/20230820001113/https://academict...

sillysaurusx · on June 3, 2024

Thank you. How’d you dig this one up?

wongarsu · on June 3, 2024

[1] is the first result if I google "the pile torrent". It doesn't link to the torrent because of a DMCA notice, so I just used the wayback machine to retrieve a version from before the date of that notice. Don't tell the publisher.

1: https://academictorrents.com/details/0d366035664fdf51cfbe9f7...

sillysaurusx · on June 3, 2024

Frustratingly, they scan my comments, so hopefully they won’t bother filing a DMCA for that.

(Seeing "sillysaurusx" appear in print on official court documents was pretty amusing out of context, though.)

robertk · on June 3, 2024

Shawn, there is a mildly redacted version available at https://huggingface.co/datasets/monology/pile-uncopyrighted

sillysaurusx · on June 3, 2024

Thank you.

lolinder · on June 3, 2024

You're correct if you're focused exclusively on the work surrounding building foundation models to begin with. But if you take a broader view, having open models that we can legally fine tune and hack with locally has created a large and ever-growing community of builders and innovators that could not exist without these open models. Just take a look at projects like InvokeAI [0] in the image space or especially llama.cpp [1] in the text generation space. These projects are large, have lots of contributors, move very fast, and drive a lot of innovation and collaboration in applying AI to various domains in a way that simply wouldn't be possible without the open models.

[0] https://github.com/invoke-ai/InvokeAI

[1] https://github.com/ggerganov/llama.cpp

no_wizard · on June 3, 2024

Taking the broader view of this nature feels like an attempt to change the narrative.

The entire point of having transparency is around building those foundations so they don’t inherit the biases of humans, for starters. Right now, we have zero introspection into this and no ability to improve upon it with the widely deployed models being used today, and that has already created problematic situations, let alone situations that are problematic and not known yet.

Transparency around this is a very good thing to prevent AI from inheriting negative human ideas and biases, and broadens access to improve training data that benefits everyone

throwup238 · on June 3, 2024

I think they're parallel concerns and everyone has their own priorities. Openness of the models and their training is important but for most people, it wouldn't really matter anyway because they can't afford the computing power to do their own training.

I care about all that in the abstract but what I can download and use on my computer is more concrete and immediate.

lolinder · on June 3, 2024

I'm a big believer in not allowing the pursuit of perfection to cause us to lose sight of the good things that we have.

Yes, these open models could stand to be more open and I hope that we'll see that in the future. But at the same time I'm extremely grateful to the companies who have released their weights under reasonable terms. Them doing so has undeniably led to an enormous amount of innovation and collaboration that would not have been possible without the weights.

If we constantly downplay and disparage the real efforts that companies make to release IP to the world because they don't go as far as we'd like, we're setting ourselves up for a world where companies don't release anything at all.

no_wizard · on June 3, 2024

>Yes, these open models could stand to be more open and I hope that we'll see that in the future

The most operative word here is hope. Which means we may not see more get open sourced over time. Especially, if there is no pressure for companies to do so.

>If we constantly downplay and disparage the real efforts that companies make to release IP to the world because they don't go as far as we'd like, we're setting ourselves up for a world where companies don't release anything at all.

I don't mean anything as disparagement or downplay, but companies aren't releasing this stuff because it makes everyone feel good. Its a tactic. They're only open sourcing something because they expect to get something out of it. That's fine, I'm all for that. That's a valid reason, and often it can be a 2 way street.

What it isn't though, is an attempt at any company saying "we are open sourcing this today because we want to encourage more transparency and auditability as AI takes on more critical roles in society, to ensure in the domains its being applied, to the best of our ability and the ability of our community, that it does not inherit negative human biases"

lolinder · on June 3, 2024

> companies aren't releasing this stuff because it makes everyone feel good. Its a tactic.

It's a tactic, but one of the primary reasons to expect it to be effective is building goodwill in the community. If the goodwill dries up then most of the reason to open anything up is gone.

no_wizard · on June 4, 2024

Goodwill doesn't equal transparency or auditability, which are the core concerns that have been repeated around AI models and training of said models

Xunjin · on June 3, 2024

As someone layman in that field I do agree with you, would love inputs from specialists in the area.