MuZero: DeepMind’s New AI Mastered More Than 50 Games

Published by Jan Heaney on

Dear Fellow Scholars, this is Two Minute Papers
with Károly Zsolnai-Fehér. Some papers come with an intense media campaign
and a lot of nice videos, and some other amazing papers are at the risk of slipping under the
radar because of the lack of such a media presence. This new work from DeepMind is indeed
absolutely amazing, you’ll see in a moment why, and is not really talked about. So in
this video, let’s try to reward such a work. In many episodes, you get ice cream for your
eyes, but today, you get ice cream for your mind. Buckle up. In the last few years, we have seen DeepMind’s
AI defeat the best Go players in the world, and after OpenAI’s venture in the game of
DOTA2, DeepMind embarked on a journey to defeat pro players in Starcraft 2, a real-time strategy
game. This is a game that requires a great deal of mechanical skill, split-second decision
making and we have imperfect information as we only see what our units can see. A nightmare
situation for any AI. You see some footage of its previous games here on the screen. And, in my opinion, people seem to pay too
much attention to how good a given algorithm performs, and too little to how general it
is. Let me explain. DeepMind has developed a new technique that
tries to rely more on its predictions of the future, and generalizes to many many more
games than previous techniques. This includes AlphaZero, a previous technique also from
them that was able to play Go, Chess, and Japanese Chess or Shogi as well and beat any
human player at these games confidently. This new method is so general, that it does
as well as AlphaZero at these games, however, it can also play a wide variety of Atari games
as well. And that is the key here: writing an algorithm that plays chess well has been
a possibility for decades. For instance, if you wish to know more, make sure to check
out Stockfish, which is an incredible open-source project and a very potent algorithm. However,
Stockfish cannot play anything else – whenever we look at a new game, we have to derive a
new algorithm that solves it. Not so much with these learning methods, that can generalize
to a wide variety of games! This is why I would like to argue that the generalization
capability of these AIs is just as important as their performance. In other words, if there
were a narrow algorithm that is the best possible Chess algorithm that ever existed, or a somewhat
below world-champion level AI that can play any game we can possibly imagine, I would
take the latter in a heartbeat. Now, speaking about generalization, let’s
see how well it does at these Atari games, shall we? After 30 minutes of time on each
game, it significantly outperforms humans on nearly all of these games, the percentages
show you here what kind of outperformance we are talking about. In many cases, the algorithm
outperforms us several times, and up to several hundred times. Absolutely incredible. As you see, it has a more than formidable
score on almost all of these games, and therefore it generalizes quite well. I’ll tell you
in a moment about the games it falters at, but for now, let’s compare it to three other
competing algorithms. You see one bold number per row, which always highlights the best
performing algorithm for your convenience. The new technique beats the others on about
66% of the games, including the Recurrent Experience Replay technique, in short, R2D2.
Yes, this is another one of those crazy paper names. And even when it falls short, it is
typically very close. As a reference, humans triumphed on less than 10% of the games. We still have a big fat zero on Pitfall and
Montezuma’s Revenge games. So why is that? Well, these games require long-term planning,
which is one of the more difficult cases for reinforcement learning algorithms. In an earlier
episode, we discussed how we can infuse an AI agent with a curiosity to go out there
and explore some more with success. However, note that these algorithms are more narrow
than the one we’ve been talking about today. So there is still plenty of work to be done,
but I hope you see that this is incredibly nimble progress on AI research. Bravo DeepMind!
It seems like deepmind and open AI will be the big players in the quest towards AGI

Peter Wilkinson · January 7, 2020 at 5:22 pm

are there videos of it playing any of these games? I would love to watch it play 😀

Michael Pierce · January 7, 2020 at 5:29 pm

Have you looked at the Baritone AI for Minecraft? I don't think it's authors are publishing in papers but are distributing a lot of public threads. It uses a lot of exploration in order to learn behavior and there is a private thread of a semi related AI That's mastering PVP on anarchy servers

Henry AI Labs · January 7, 2020 at 6:02 pm

Great video! I am exploring MuZero this week as well in a series going from AlphaGo –> AlphaGo Zero –> AlphaZero –> MuZero! I hope attention around MuZero and these algorithms will also inspire more people to participate in Kaggle's Connect X 1st RL Competition!

Benjamin Anderson · January 7, 2020 at 8:58 pm

I just want to point out that when you make the distinction between narrow and general algorithms and say that you'd take the less advanced general algorithm over a more advanced algorithm for a specific game, you're talking about it from a research perspective. You're making the assumption that "2 papers down the line" there will be an algorithm that will be just as general and more skilled than the current algorithm. However, if there were a game which for some reason was impossible to play or get good at using a generalized algorithm (I don't know if this is even possible but just consider it), then it would be necessary to have a specific algorithm for that game if humans wanted to create something more skilled than themselves. Yes, that algorithm would not advance AI research at all, but it would be very useful for people who only care about that game.

jnalanko · January 7, 2020 at 10:47 pm

Thanks for making the video, though I wish you went into more detail on how MuZero differs from AlphaZero and why it is able to generalize so well. How does it work? That would be the real ice cream for the soul. You're making me pull up the paper myself :).

Thanks for the video, love Károly Zsolnai-Fehér's enthusiasm for generalisation in this new year!
In case you want more detail, we covered this paper our reading group:

Passe-Science · January 8, 2020 at 9:53 am

Performance and generalization yes but a very important aspect is also HOW it does things. So here are some additional details of what the paper is about:
Those alphago like agents used to have an external game simulator (to process the move they actually commit to play) and an internal simulator (to process the move they are browsing during the search to decide the best option). A game simulator does several things:
-you cannot play illegal move with it.
-it gives you the dynamic answer of rules to a move, like by example capturing the stones your move is actually capturing.
-it gives you a terminal status: "this is a win, loss, draw state".
The breakthrough of MuZero is that it has NO internal simulator of the game given, it should learn it's own representation of a game state and of game dynamic when "playing in it's head during search" (but still have an external simulator to process the moves it commit to play, so which is called only once by move of a game of a training session)
So basically MuZero could:
-read illegal moves as a possibility.
-badly process the dynamic of it's move (like it could forgot to capture the stones actually captured by it's move as long as this happens in it's head during search)
-miss that the game is ended (keep reading after a terminal state of the game).
So it's basically what every beginner in chess go through as the beginning of their learning process. Yet it still manages to learn a quite perfect and optimized model of the game, and use it to master the game itself.

