Whoa, this is extremely impressive. Quotes from the BBC article: > "For the firs...

bko · on Dec 23, 2020

I've noticed all the top performing AI reinforcement algorithms i hear about know next to nothing about the initial rules. And not only do they perform as well as more supervised methods, but much better

The one exception is self driving. I listened to the Lex Fridman interview with ceo of waymo recently and he made a case for the controlled environment (e.g. separate detection from decision making and planning) and pushed back against the end to end approach that doesn't make any preconceived assumptions about the environment. As an example he takes red lights. They're clearly human engineered signals, so it makes sense to have a module that can explicitly determine the signal as opposed to learning the behavior

But that's true about other games as well and end to end methods still outperform. Which makes me ask, is end to end learning an inevitability for self driving as well or is this the one domain special due to complexity or other aspects?

philipkglass · on Dec 23, 2020

A machine controlling a real car gathers feedback no faster than real time and cannot afford to learn the meanings of street signs from the consequences of ignoring them. A machine can learn the rules of Atari games from scratch by playing them orders of magnitude faster than real time and treating "death" as one signal among many.

In order for a machine to learn driving the same way it learns Atari games, it seems that it would need an extremely high fidelity virtual environment to learn in. The high fidelity requirement would necessitate a lot of up-front investment in trying to get the simulations right. You might spend a whole career just trying to build a drivable Virtual Philadelphia as challenging as the real thing. The details would also make it much more expensive to run training sessions at high multiples of real time.

Given those factors, I'm not surprised that self-driving vehicle experiments just use real environments and don't try to learn the fundamental rules from scratch. But it's an interesting point that these choices may make it harder for agents to keep improving.

bko · on Dec 23, 2020

There's a middle ground too which alpha go leveraged which bootstrapped learning from actual human gameplay predicting user actions. That's what comma.ai does and AlphaGo still performed significantly better than explicit rules or higher level abstractions. That and a mix of simulations might yield better results

ars · on Dec 23, 2020

If you could make a virtual environment that was that good, you would have already solved the self driving problem.

I wonder if Google would be willing to pay people to add cameras to their cars to collect real world data in far larger scale.

samatman · on Dec 24, 2020

I do believe that's the basis of Tesla's play in the space: a bet that enough cameras collecting real-world data can beat out a dedicated from-scratch self-driving system in deployment.

Whether this will work remains very much an open question.

gaudat · on Dec 23, 2020

That might be happening now already. I think Google sourced their dataset from dashcam videos uploaded to YouTubr?

martamorena284 · on Dec 24, 2020

Really not true. We have all the tools available... Just model a city in Unreal Engine 4 (GTA 6 anyone?). Your sensors, like LIDAR could do ray-queries, cameras could render views, etc. It's not 100% real, but probably real enough to learn the basics. Photorealistic graphics should be sufficient for AI to learn real-world interaction. In the end our whole world might be just that, a simulation. So I see zero reason why we couldn't train a self-driving car AI in one.

And the coolest thing is that this could also be used as a basis for a AAA video game, and these tend to make billions as well these days, so it's a win win for everyone. AI companies with funding should invest heavily into virtual reality and gaming, because they will need to perfect this to train their models.

mckirk · on Dec 24, 2020

The problem is that these simulations are far from perfect, and AIs are great at exploiting paths-of-least-resistance. So you'd end up with an AI that could drive through your virtual city flawlessly by overfitting on some cues that you wouldn't even notice, but are reliable enough because of the necessarily lower complexity of the simulation.

It wouldn't translate to the real world; unless maybe you add enough noise to the simulation to prevent the AI from using too simple cues, but at that point, it's questionable how much insight the AI could still distill from the simulation.

Shorel · on Dec 24, 2020

In that case, I would like to see what the AI can learn in Euro Truck Simulator 2.

grandmczeb · on Dec 24, 2020

Minor tangent, but I think you’re referring to this video[1] with Dmitri Dolgov. He’s the CTO of Waymo, not CEO.

[1] https://youtu.be/P6prRXkI5HM

chongli · on Dec 23, 2020

What does it mean to “not be given the rules”? If you set a child down in front of a chess board with the pieces nearby and they are not aware of the rules, I doubt they’d ever figure out how to play even a single correct game of chess. Heck, the child may decide to put the pieces in their mouth or dress them up as make belief characters.

Without any concept of the rules you have no way of even knowing that you’ve set up the pieces for a legal starting position, never mind executing a legal move to open the game.

This is really bizarre.

ignoranceprior · on Dec 23, 2020

This is explained in Appendix A of the paper ("Comparison to AlphaZero"): https://arxiv.org/pdf/1911.08265.pdf

Basically, AlphaZero was provided with a simulator that was able to distinguish legal and illegal moves and determine which future game states would be wins or losses. This was used to generate the search tree of possible states and actions.

MuZero doesn't have access to a simulator, it only has access to its direct environment. MuZero excludes actions that are immediately illegal, which solves the problem you mention in your penultimate paragraph, but it needs to learn the game's dynamics in order to determine which future moves and states are possible.

thomasahle · on Dec 23, 2020

The point is this

> AlphaZero used the set of legal actions obtained from the simulator to mask the policy network at interior nodes. MuZero does not perform any masking within the search tree, but only masks legal actions at the root of the search tree where the set of available actions is directly observed. The policy network rapidly learns to exclude actions that are unavailable, simply because they are never selected.

MuZero still masks legal moves, but only at the root. All its parts are eventually trained on the output of its root, and so learn the legal moves.

The justify this root level masking by how the Atari will only allow you to perform legal moves, while a weak enough player may consider illegal moves while planning in your head.

The main thing that's slightly "hidden under the rug" is that for "masking" to make sense in the first place, MuZero needs to know a set of all moves that may be legal at some point in the games.

chongli · on Dec 23, 2020

Oh okay. So it's a technique that essentially allows the tree search to be less "pedantic" about the rules in future states. Very interesting.

I would love to see how this might go for more complicated games such as NES adventure games and RPGs.

orlp · on Dec 23, 2020

> while a weak enough player may consider illegal moves while planning in your head

This isn't just weak players. E.g. strong chess players often consider moves as if blocking pawns weren't there. They might consider a bishop to be on a strong diagonal despite there being a blocking pawn because they can imagine moves that would happen if that pawn would disappear.

thomasahle · on Dec 24, 2020

I suppose you are right. But MuZero won't be able to do this, since it's training forces it to consider legal moves in its planning.

orlp · on Dec 24, 2020

No it doesn't. MuZero does its planning entirely in its own latent space (it may not even actually think of the game in terms of 'moves' but in whatever steps it considers relevant instead), only the output is filtered for legal moves.

It's no different than a monkey operating a chess computer that makes sure the monkey only performs legal moves. Your suggestion would be akin to suggesting that the chess computer would be affecting the monkey's mind so that it can only think in terms of legal chess moves.

klipt · on Dec 24, 2020

Seems you could equivalently treat rule breaking as a loss, and any algorithm sophisticated enough to learn how to win will also learn to avoid breaking the rules.

llamaz · on Dec 24, 2020

Interestingly enough, this is exactly how one of the world chess champions, Jose Raul Capablanca, was said to have learned chess as a child.

It may be true, or perhaps it was a story concocted in order to emphasize his innate talent.

thomasahle · on Dec 23, 2020

>> "[It] can start from nothing, and just through trial and error both discover the rules of the world

Unless they've changed a lot of things since the original paper, this is a bit exaggerated.

MuZero learns what moves are allowed in a given position/situation, but it still needs to know a finite overall set of possible actions.

E.g. for chess, it isn't told which fourty moves a available at each point in its search tree, but it still knows to only consider 64x64 discrete options.

tromp · on Dec 23, 2020

> only consider 64x64 discrete options

It wouldn't know that moving its King from e1 to g1 must be accompanied by moving its rook from h1 to f1.

Or that moving a white pawn from e5 to d6 must in some cases be accompanied by removing the black pawn on e6.

I guess the environment does these in response. That doesn't suffice for

moving its pawn from b7 to b8 must be accompanied by replacing said pawn by some other piece.