Computer power is not stagnating, but the availability of training data is. It's...

robwwilliams · on May 13, 2024

No: soon the wide wild world itself becomes training data. And for much more than just an LLM. LLM plus reinforcement learning—this is were the capacity of our in silico children will engender much parental anxiety.

Animats · on May 13, 2024

This may create a market for surveillance camera data and phone calls.

"This conversation may be recorded and used for training purposes" now takes on a new meaning.

Can car makers sell info from everything that happens in their cars?

abenga · on May 13, 2024

Well, this is a massively horrifying possibility.

diego_sandoval · on May 13, 2024

Agree.

However, I think the most cost-effective way to train for real world is to train in a simulated physical world first. I would assume that Boston Dynamics does exactly that, and I would expect integrated vision-action-language models to first be trained that way too.

pixl97 · on May 13, 2024

That's how everyone in robotics is doing these days.

You take a bunch of mo-cap data and simulate it with your robot body. Then as much testing as you can with the robot and feed the behavior back in to the model for fine tuning.

Unitree gives an example of the simulation versus what the robot can do in their latest video

https://www.youtube.com/watch?v=GzX1qOIO1bE

diego_sandoval · on May 13, 2024

I don't think training data is the limiting factor for current models.

emporas · on May 13, 2024

It is a limiting factor, due to diminishing returns. A model trained on double the data, will be 10% better, if that!

When it comes to multi-modality, then training data is not limited, because of many different combinations of language, images, video, sound etc. Microsoft did some research on that, teaching spacial recognition to an LLM using synthetic images, with good results. [1]

When someone states that there are not enough training data, they usually mean code, mathematics, physics, logical reasoning etc. In the open internet right now, there are is not enough code to make a model 10x better, 100x better and so on.

Synthetic data will be produced of course, scarcity of data is the least worrying scarcity of all.

Edit: citation added,

[1] VoT by MS https://medium.com/@multiplatform.ai/microsoft-researchers-p...

diego_sandoval · on May 14, 2024

> A model trained on double the data, will be 10% better, if that!

If the other attributes of the model do not improve, sure.

MVissers · on May 13, 2024

Soon these models are cheap enough to learn in the real world. Reduced costs allows for usage at massive scale.

Releasing models to users that where users can record video is more data. Users conversing with AI is also additional data.

Another example is models that code– And then debug the code and learn from that.

This will be anywhere, and these models will learn from anything we do/publish online/discuss. Scary.

Pretty soon– OpenAI will have access to

bigyikes · on May 13, 2024

It isn’t clear that we are running out of training data, and it is becoming increasingly clear that AI-generated training data actually works.

For the skeptical, consider that humans can be trained on material created by less intelligent humans.

rglullis · on May 13, 2024

> humans can be trained on material created by less intelligent humans.

For the skeptics, "AI models" are not intelligent at all so this analogy makes no sense.

You can teach lots of impressive tricks to dogs, but there is no amount of training that will teach them basic algebra.