One of DeepMind's latest papers, Open-Ended Learning Leads to Generally Capable Agents, explains how DeepMind produced agents that can successfully play games as complex as hide-and-seek or capture-the-flag without even having trained on or seen these games before.
As far as I know, this is an entirely unprecedented level of generality for a reinforcement-learning agent.
The following is a high-level summary of the paper, meant to be accessible to non-specialists, that should nevertheless produce something resembling a gears-level model. I want to focus on explaining the optimization process that produced this agent; on what the different parts of the optimization process are; on why each different part is necessary; and on what would happen if different parts of it were missing.
After that summary, I'll add a few more comments and questions about design choices within the paper and about future research I'd like to see. I'm far less certain about this second part, however.
I was going to include a part on AI timelines -- but whether this paper influences your timelines, and in what direction, depends on a lot of priors that are out-of-scope for what I want to do here.
Before we get into the optimization process of the agent, I need to talk about the environment within which the agent trained. Core to the project of this paper are the millions of dynamically-generated tasks on which the agent can train.
Each task in the XLand Unity-powered 3d environment space is defined by a (1) unique physical world and a (2) unique set of goals / rewards. Thoughout what follows I refer to (1) as the "environment" or "world", to (2) as the "game", and to both of them together as a "task." Note that both of these can be generated programmatically without human intervention.
(The show reel of the trained agents operating on hand-made, held-out test tasks is worth watching for at least a few minutes to get a feel for the complexity possible from both world and goals, and is probably much clearer than my writing about the environment space. [I mean, if you want to understand soccer, watch a game of it, don't read a description.] Although you should note that the intelligibility of the goals in the video is uncharacteristic of the goals in the training tasks, because the show-reel goals were made by humans from human-intelligible games, rather than randomly generated.)
Anyhow. What kind of variety exists in this space?
Well, each world has a static landscape with dynamic, simulated rigid-body objects placed upon it. The topographical features of the world can be randomly mutated; the size of the world and lighting of the world can vary; the rigid-body cubes, pyramids, spheres, and slabs on the map can be randomly colored, sized, and placed.
Each game, with goals and rewards, can also be randomly generated. I'm not being redundant by talking about goals and rewards; each agent both receives information specifying what would cause it to be rewarded (the goal) and receives a numeric reward of 0 or 1 in each timestep.
The fundamental atoms for the definition of goals are atomic predicates, such as being "on", "near", "far", or "holding" something. These atoms can be applied to different entities to form sub-goals, such as "the player is on the yellow floor" or "the black cube is near the black pyramid". A complete goal is then represented as set of options (disjunctions) over some set(s) of necessary predicates (conjunctions) -- a complete goal might be "(Hold a purple sphere AND be near a yellow cube) OR (See an opponent AND be near a black cube)." . Obviously such goals can be randomly generated, and obviously there are a very large number of them.
The total environment space is then the combination of possible worlds with possible games, forming tasks. This space is enormous -- the paper says there are 1016 possible world topologies, and 1018 possible games.
I believe that the enormous number of possible tasks is really important if you're interested in machine learning as a path towards actual human-level intelligence. Why?
Let me back up a little with some machine learning 101.
In most non-reinforcement-learning, machine-learning tasks, you divide the data from which you learn into two sets: a training set and a test set. (Actually, at least three, but I simplify for now.)
Suppose you're training an image model to distinguish between hotdog and not-a-hotdog. If you had 1000 images with which to train, and you let it train on all of them, you'd be in a bad state. Even if it was able to successfully distinguish between hotdog and not-a-hot-dog for every one of those images, you would not know if it could generalize to new images. It might have simply memorized which of those particular images were hotdogs and which were not, without learning to generalize to never-before-seen-images. You'd be like a math teacher who discovered he was using the same math problems in his blackboard demonstrations and on his tests, by accident.
To avoid this, you could divide your images into a 800-image training set and an 200-image test set. Suppose you train the model on the training set exclusively, without showing it the test set, until it gives you the right answer for all 800 images in the training set. You then check what results it gives you on the test set, without training on the test set. If the model has learned to figure out the Platonic form-of-a-hotdog from the training data, it might give you the right answer for 199 of your 200-image test set, indicating that your training has been a success. On the other hand, if it has simply memorized the data in your 800-image training set, it might do no better than random on the 200-image test, indicating that you have overfit to your training data.
This kind of process is absolutely fundamental to non reinforcement-learning machine learning. The first thing you learn as a ML engineer is to always carefully divide your data into training and test sets (well, training, test, and validation sets, and sometimes more) and never to confuse them.
The problem with reinforcement learning is that until a few years ago, for 99% of the tasks, there was no distinction between training and test set. You could train an algorithm to play Breakout from the Atari Learning Environment, and if it did well -- it did well, and you wrote your paper, and didn't try your algorithm against a test game.
In retrospect this is somewhat shocking.
But to be fair, it is a little unclear what the distinction would be between a training task and a test task in reinforcement learning. What is a game that's similar to a game you play on, such that it's the same task in some sense (like identifying a hotdog) but which is nevertheless not the same task in another sense (like identifying a different picture of a hotdog)? When the answer rolled around, it was of course obvious: levels of a video game could fill this notion fairly precisely.
OpenAI's ProcGen benchmark is a good example of such an environment: you can train on some arbitrary number of procedurally-generated levels in various video games, and then test your performance on a never-before-seen set of other procedurally-generated levels.
It became pretty clear from results on ProcGen (and elsewhere) that the existing RL algorithms were probably horribly, horribly overfit. In the case of ProcGen, OpenAI showed that an agent could train on 100 or even 1000 different levels of some task, learn to do very well on them all, but still do very, very badly on a never-before-seen level. This is obviously a problem if you want your RL agents to be steps on a staircase towards Artificial General Intelligence, rather than steps on a staircase towards making artificially shitty students in a class who only learn to parrot their teacher.
So the ability to randomly generate worlds and games in the XLand space is very important for this reason. It lets DeepMind check to see if their agents have actually learned transferable skills, or if they have instead memorized actions likely to result in high reward without truly learning how to navigate the environment.
DeepMind's paper specifies (appendix 3) that they created a list of test worlds (where a world is a topology and set of object placements) and test games (where a game is a set of rewards / goals), and never trained their agents on tasks (combined worlds and games) where either of these matched the held-out test worlds or games. This is going further than ProcGen -- while ProcGen varies the world, but keeps tasks the same, DeepMind is varying both world and the tasks.
I believe that this is the kind of thing you need to be doing if you're actually interested in taking steps towards general intelligence, so it is interesting to see this.
DeepMind describes the process of training an agent in this task space as involving 3 levels. I'm going to conceptually split their middle level into two, for a total of 4 levels, because I think this makes it easier to understand.
This 4-level stack looks to me essentially like a complication of the inner-and-outer loop optimization, probably best described in full generality by Gwern. And although this optimization process can be used in ML, it can be found naturally arising in the world as well.
The basic mechanism behind such a two-part optimization process is to have a quickly-acting, but unavoidably biased or inaccurate inner optimizer, which is evaluated by a slowly-acting, but unbiased and accurate outer optimization process.
Within the context of natural history, for instance, you can view human intelligence and evolution as this kind of two-level process optimizing reproductive fitness. Intelligence is the inner optimizer. It is fast; it allows you to come up with ideas like crop rotation, writing, irrigation, and so on, which allow the humans that implement these ideas to have lots of reproductive success. But it is also inaccurate (from the perspective of reproductive fitness); it allows you to come up with ideas like social media and online porn, which might cause the humans that implement these ideas to have less reproductive success. The fast inner loop of human intelligence is ultimately judged by the slower, outer loop of reproductive success or natural selection.
Similar dynamics can be found in the economic sphere, with internal-to-a-corporation-change (inner loop) and bankruptcy (outer loop).
The 3 or 4 level process that DeepMind creates is a little more complex, but mostly the same kind of thing.
Here's a description of the four levels. I'll go into each of them in more detail below -- this is just for initial orientation.
1. Reinforcement learning improves the performance of an agent over a particular mixture of XLand tasks. Note that this is done for many different agents all at once. But at this part of the optimization process, the agents are only causally connected as occasional opponents to each other in competitive tasks, or cooperating partners in cooperative tasks.
2. An agent-specific teacher adjusts the distribution of XLand tasks to ensure the right difficulty. That is, as the agent trains, a teacher adjusts the tasks to be neither too hard nor too easy for the agent.
3. Periodically, two different agent-teacher pairs are compared, and pairs that do better against a measure of general competence are copied and replace pairs that do worse. That is, natural selection occurs against our actual objective -- a carefully developed notion of general competence.
4. Periodically, replace all of our agents with a "new generation" of agents, initially trained to imitate the best agents from the prior generation. . That is, we replace all of our current agents, because after training for some time neural-network based agents seem to lose some of their flexibility and ability to learn; we bootstrap the new from the old agents so they learn faster this time around.
I'm going to dive into each of these steps in much more detail below.
I want to note something before I begin, though. None of these steps involve revolutionary changes; no Kuhnian paradigm shifts are involved in this paper. What this paper shows is what you can accomplish through combining current state-of-the-art techniques, making incremental progress, using good engineering, and throwing a little bit (though not that much, really) of money at compute.
It also doesn't by any means establish an upper limit to these things; I believe the results here clearly indicate you'd make further progress by scaling up along the same lines.
The innermost loop is the one that actually changes how agents act, by adjusting each agent's neural network to increase the likelihood of actions leading to reward.
Each agent runs through the (aforementioned) distribution of dynamically generated tasks within the space of possible tasks. In the beginning, each agent's actions are random -- each just flails around in the virtual worlds. The neural-network-parameterized policy which defines what actions the agent is likely to do is initialized with random values, and so initially outputs a mostly even distribution, with every action as likely as every other. Thus, the random flailing.
But this random action is nevertheless a kind of exploration. Some actions lead to more reward, and some actions to less. The kind of actions followed by above-average reward are made more likely (when in similar situations). And the kind of actions followed by below-average reward are made less likely (when in similar situation). Over time, the distribution of actions output by the neural network becomes less even, and begins to put more probability mass on the better actions, and less probability mass on the worse actions.
This general technique of thus adjusting the probability of actions directly is policy gradient optimization. The generalizing power of a neural network is used to help recognize "similar situations," as is the case for all neural-network-based optimization methods. There is a lot of complexity in the architecture of their neural network, which I have not addressed above, but this is nevertheless the core of it.
By many measures of human intelligence, this is a very stupid algorithm. Notably, this kind of algorithm is model free; no component of it (at least directly) predicts what the environment or what an opposing agent will do. If you think that intelligence requires prediction then you'll think that this algorithm is missing very important parts of intelligence. Instead, it is mostly learning a mapping from situations to actions, where the generator of this mapping adjusts it to produce higher reward.
Note, though, that policy gradient methods (and other model free methods) can act as if they are predicting things. OpenAI Five was trained through policy optimization, and when playing, looked as if it was anticipating what its opponents would do. But in fact, it had merely encountered many similar situations with opponents, and thereby learned what kind of actions maximized reward in them. (This is leaving entirely aside more difficult mesa-optimization complications.)
The specific policy gradient method that DeepMind used is V-MPO, which DeepMind also created. They presumably chose this algorithm because tests on V-MPO in the paper introducing the algorithm show that it performs well in a multi-task setting -- i.e., using it DeepMind trained a single agent on all of the Atari-57 tasks, and found that its median performance was still better-than human -- although the open-ended learning paper doesn't not mention this motivation.
This innermost part is necessary, well, because without it no agent would ever learn or change in any way.
The second part of the optimization process is dynamic task generation.
The principal thing that dynamic task generation does is to slowly shift the distribution of tasks on which each agent trains, as each agent improves, so that the distribution is never too hard or too easy. (Bloom's 2-sigma problem anyone?)
The way it works is actually quite easy to understand.
A candidate task is generated, by sampling from the immense space of possible worlds and possible games in those worlds -- excluding those worlds and games in the test set, of course.
Then 20 episodes of the task are played through without training to see if the task is suitable for training. 10 of these episodes are played-through using the agent-that-might-be-trained with the task. 10 of these are played through with a control "agent" which acts randomly or does nothing at all. The candidate task is accepted as an actual training task if and only if the results of these twenty episodes meet three criteria.
There are constant values that determine the cutoffs for all these filters. These constants answer questions like: what is the cutoff for doing too well in 1? How much better does the agent have to do than the control policy for 2? How poorly does the control policy have to do in 3? It is important to note that these values are themselves unique per agent; every agent has their own dynamic task generation filters.
These constants together define what I have sometimes referred to as the "teacher" above -- but teacher implies too much complexity. There's nothing to the teacher but these few numeric constants, together with how they are applied in the three rules above.
Why is this necessary?
Well, to take a step back, a big problem in RL is how you first get some signal from which to learn.
The majority of reinforcement learning algorithms don't learn at all until they've received some variance in the rewards that they receive -- some departure from normal, in either a positive or negative direction. Until they receive some signal in this fashion, they only perform random actions.
This, of course, presents a difficulty because many environments have extremely sparse rewards. Imagine trying to play a video game like StarCraft, but only doing random actions until you (by doing random actions) win your first game. Even with an enormous number of trials, you probably never would win, even if you played until the sun grew dim and the stars disappeared from the sky. And so you would therefore never receive any signal, and would never improve in any way.
There are a number of different ways to approach this problem.
One way is to pre-train agents to imitate humans. For DeepMind's AlphaStar Starcraft-playing agent, for instance, they trained an agent to simply imitate humans first. This at least gets the agent to the point where it is able to win games against easier opponents, after which reinforcement learning in general can start. But this obviously only works in domains where you have a large number of recorded human actions to imitate, which DeepMind does not have here.
Another way to address this problem is through attempting to give agents something like curiosity. Curiosity aims to lead the agent to explore the environment, in some better-than-random-fashion, even in the absence of an extrinsic reward. You could try to implement a curiosity module by rewarding agents for encountering states where they cannot predict the future well. Or you could do something more sophisticated, like learning an ensemble of models and then exploring the places where the models disagree. This method has the advantage, as far as I can tell, of being the most explicitly human-like approach to the task. But no one has really settled on the perfect implementation of curiosity.
DeepMind, of course, settles on curriculum learning, which, as the name implies, works like something like school. You first give the agent a very easy task, which it might be able to do sometimes by random chance. Then you give it a slightly more difficult task, which it is able to do acceptably on because it is no longer doing entirely random actions. And so on. There are many ways to implement curriculum learning -- but DeepMind's is implemented through the dynamic task generation and 3-criteria filtering mechanism mentioned above.
So one function of the dynamic task generation is to provide curricular learning, so that agents can start learning at all. We know that this is important: DeepMind performed an ablation where they removed dynamic task generation, and found that learning proceeded much more slowly. When dynamic task generation was removed and some other tricks to speed up the initial learning were also removed, pretty much no learning occurred at all.
Dynamic task generation and filtering is not only important because of how it helps provide curricular learning though -- it's also important for how it interacts with the next stage, population-based training.
At this level of optimization, the population of agents starts to interact through something like natural selection. The basic notion of how this works is once again relatively simple.
Periodically during training, we compare two agents that we have not recently compared. The comparison takes place along a carefully-developed measure of general competence. This measure emphasizes things like "not failing catastrophically on as many tasks as possible" and "being pretty good at a large number of things" over things like "having a very high average score" and "doing very well at some a limited number of tasks".
If one agent's performance dominates another agent's performance along this measure, then the better agent replaces the worse: the weights of the better neural network are copied over, together with the agent-specific task-generation hyperparameters (with random mutations to the hyperparameters).
Why is this necessary? Isn't each agent already being trained on a broad curriculum of tasks of generally increasing difficulty, which we would expect to lead to general competence in any case? What does the evolutionary selection give us that we don't already have? What problem does this let us avoid?
There are several answers to this question.
The more narrow answer is that this allows the dynamic task generation hyper parameters themselves to shift in a direction that promotes general competence. Neither of the optimization levels beneath us include any way of changing these parameters. But the ideal filtering parameters for the production of general competence might be different at the beginning, or at the middle of training. Or they might be different from agent to agent. Without something like population-based-training, they would have no way of changing and this would hurt performance.
The less narrow answer, I think, is that this ensures that agents are developing broad competence in a way the innermost loop cannot do.
Imagine a military recruit who does very well at some exercises, but only 1/2 of the total set of exercises. So their teacher -- like our dynamic task-generation hyperparameters -- gives them more, harder exercises relating to those routines, and they continue to get better at them, and get harder exercises, and so on. They could still be very bad at the other half of their exercises. They would have failed to develop broad competence even though their teacher kept increasing the difficulty of their exercises as they got better at them.
Similarly, each agent in our population of agents will learn to get better at some distribution of tasks, then, but without population-based-training they might not spread themselves broadly across the entire span of this distribution. Like a student who advances wildly at subjects she prefers, while ignoring subjects she is not good at, our agents might not approach our desired ideal of general competence. Population-based-training helps prevent the scenario by multiplying agent / teacher pairs that do well generally and non-narrowly.
At least that's the theory. DeepMind did perform experiments where they removed PBT, and the performance of the agents on the tasks they were worst at fell, as the theory would predict. As I eyeball it, though, it looks like... it was not an enormous change? But we can be sure PBT is at least doing something.
After training an RL agent on an ever-shifting basket of tasks for a while, its improvement slows down. It stops learning quickly. It might even regress. It's enormously tempting to analogize this to how humans, too, learn more slowly after some time, but would probably be counter-productive to understanding. But, regardless of why it happens, the fact remains that it does happen.
To combat this problem, DeepMind uses generational training. It works like this.
After large number of all the steps from (1) through (3) you stop training your agents. You spin up an entirely new batch of agents, with their neural networks initialized to random values, and start training them.
The difference is that this time, for the first few billion steps of training, you aren't just training the agents to maximize reward -- you train them to output a probability distribution over actions that matches the probability distribution of the best agent from the prior generation. In other words, your agents are trained to match what the prior agent would do in those tasks. After a few billion steps, you stop trying to imitate the prior agent -- but by then, you've boosted the performance of your student to far past where it would otherwise be. And after further reinforcement learning, the "student" agent generally exceeds the "teacher" agent with which it was initially trained.
Why is this important? Well, two reasons.
First: It just improves performance, in a generic way that has nothing to do with open-ended learning specific to this paper. It just helps combat a weakness of our current RL tasks. One of the initial papers on "kickstarting" agents by training them to imitate other, trained agents showed that the kickstarted agents eventually exceeded the performance of the un-kickstarted agents by 42%. The effect here is less dramatic, but DeepMind's ablation shows that the agents which learned from prior agents eventually exceed the performance of those from which they learned, even when those from whom they learned were trained for longer. RL is hard, and DeepMind wants any kind of performance gain it can get, even if the gain is generic and has little to do with open-ended learning, and this provides it.
Second: Generational training provides a convenient way to switch lower-level objectives as generations pass, in order to promote more general learning.
For instance, for the first two of the five generations in their experiment, the objective they optimize the PBT training against focuses even more strongly on ok performance over many tasks than it does in the subsequent three generations; in the first two, it optimizes only for increasing the percent of tasks you get any reward from. So having generations provides a convenient switching-point for changing lower-level features of the training.
I can now review the training process.
At the lowest level, each agent runs through tasks, slowly increasing the probability of the kind of action which is followed by reward and decreasing the probability of the kind of action not followed by reward. The tasks on which each agent trains are dynamically generated to be neither too hard nor too easy; as the agent becomes more intelligent, the tasks become correspondingly harder. Periodically, two agents are compared according to a measure of broad competence. Agents, and dynamic task-generation hyper parameters corresponding to the agents, are removed if they perform poorly, and copied with mutations if they perform well. Over even longer periods, entire generations of agents are removed, and the best used to pre-train subsequent generations of agents.
This goes on for five generations. Daniel Kokotailjo estimated very approximately that this cost about half a million dollars to train. What does that half million get us, in terms of performance?
In short, something probably significantly stupider than a mouse, but which is still among the smartest artificially intelligent agents that have been created, by conventional human standards of intelligence. I’m tempted to say it’s the first artificially intelligent agent it would make sense to compare to a mouse.
Well, as mentioned, the agents can solve -- or, at least get non-zero reward -- in non-trivial problems they have never seen before. Here are some of the hand-made games the agents were able to score in, despite never seeing them before, with DeepMind's names and descriptions:
Catch Em All: A cooperative game where 2 players must gather 5 objects from around the world and bring them together.
Object Permanence Black Cube: The agent starts and can see a yellow cube on the left and a black cube on the right. The agent must choose which path to take (which means the agent loses sight of the cubes) to reach the target cube (black in this case).
Stop Rolling Freeze Gadge: In a world composed of one big slope, the agent must stop the purple sphere from touching the bottom floor without holding it.
Race To Clifftop With Orb: Both players must stand on the top of a cliff edge holding the yellow sphere, without the other player standing on the cliff.
I should note that when acting in this zero-shot capacity, the agents do not perceive the reward that they receive when they accomplish their goal. That is, they must identify when they have accomplished the goal, because the numeric reward that they are given only alters them when they are training -- it isn't perceived by them during their activity. This makes all of the above at least marginally more impressive; they cannot merely fiddle with things until they receive a reward, then stop.
But we also should look into the darkness. Here are some problems the agents cannot get reward in:
Tool Use Climb 1: The agent must use the objects to reach a higher floor with the target object.
Sheep Herder: The agent must make the other player stand near a set of target objects.
Make Follow Easy: The agent must lead the other player (whose policy is to follow) to a particular floor colour.
It's hard to know how much to make of these failures.
On one hand, a human could easily execute them. On the other hand, they are very-out-of-domain of the set of training worlds. "Tool use climb", for instance, apparently involves building multiple ramps to reach a location -- the world-generation procedure explicitly excludes worlds where ramps are required, so the agent would never have seen any such world. I very strongly suspect that if given a slightly more expansive training environment, with places that required a ramp to reach them, then these and similar tasks would have been easily accomplished by the agent.
This possibility-of-further-capacity is reflected by how well the agent does on the dynamically-generated, non-human-generated held-out test set. The agent was able to accomplish some reward on one hundred percent of the tasks where it is possible to receive reward. (Some percent of the tasks were impossible.)
Given that, I very strongly expect that this paper does not expose the upper limit of intelligence of agents trained in this manner. Like GPT-N, it presents us with a curve reaching up and to the right, where our prior theories about the world are likely to shape what shape we think that curve has.
There are some specific features of this work I was a little disappointed by, or would like to see remedied in future works. Here is a partial list. This section is very mildly more technical than the previous sections.
As mentioned above, goals are specified for the agents as sets of "and"s joined by "or"s. So goals are things like "(A and B) or (C and D)" or "(A and B and C) or (D and E) or (F)". The agent cannot receive goals specified in another fashion: the neural network is not built to be able to understand a goal like "((A or B) and (C or D))". And in general the grammar that it understands is quite limited.
This makes some of the accomplishments of the agents a little less impressive. DeepMind has a section explaining how the agent is able to switch between alternatives: it is able to start to try to accomplish one branch of a disjunction, but then to switch to another when it realizes there is an easier way to be rewarded. The way that its brain has been built very specifically to understand such a disjunction makes this accomplishment much, much less impressive.
I nevertheless anticipate that future versions of this work will try to accommodate a more expansive grammar.
The neural network trained through policy optimization has two parts. The recurrent part of it rolls up all of the observations that the agent has made up until that point in the episode; it corresponds to a kind of general awareness of the situation. The non-recurrent part of it receives information about the goal and then queries the recurrent part of it to help figure out what action in that situation leads to the most reward.
During training, the recurrent part of the neural network is trained to predict the truth of all the predicates involved in the goal. This feels like cheating to me; it certainly would be impossible with a sufficiently more expressive grammar defining the goal. I'd be interested in how much the sample-efficient training of the agent depends on this auxiliary goal during training. And I'd be very, very interested in efforts to replace it.
This is my most theory-laden / questionable criticism.
Eyeballing the numbers on page 21 of the paper, it looks like dynamic task generation is one of the most important components of the whole thing. Performance plunges by a lot when it is removed.
I think this is important, because dynamic task generation occupies much the same space that curiosity does, in terms of functionality. Both help address the problem that current RL algorithms simply cannot solve unaided: very sparsely-rewarded action spaces. In the long run, I think, curiosity is a better bet than curricular learning for actual intelligence. It feels like a core part of intelligence, one we cannot yet create. This paper very neatly sidesteps the need for it, but at some point I think it might be no longer able-to-be-sidestepped.