The effectiveness of any machine learning method is critically dependent on its training data. In the case of reinforcement learning (RL), one can rely either on limited data collected by an agent interacting with the real world, or a simulated training environment that can be used to collect as much data as needed. This latter method of training in simulation is increasingly popular, but it has a problem — the RL agent can learn what is built into the simulator, but tends to be bad at generalizing to tasks that are even slightly different than the ones simulated. And obviously building a simulator that covers all the complexity of the real-world is extremely challenging.
An approach to address this is to automatically create more diverse training environments by randomizing all the parameters of the simulator, a process called domain randomization (DR). However, DR can fail even in very simple environments. For example, in the animation below, the blue agent is trying to navigate to the green goal. The left panel shows an environment created with DR where the positions of the obstacles and goal have been randomized. Many of these DR environments were used to train the agent, which was then transferred to the simple Four Rooms environment in the middle panel. Notice that the agent can’t find the goal. This is because it has not learned to walk around walls. Even though the wall configuration from the Four Rooms example could have been generated randomly in the DR training phase, it’s unlikely. As a result, the agent has not spent enough time training on walls similar to the Four Rooms structure, and is unable to reach the goal.
Instead of just randomizing the environment parameters, one could train a second RL agent to learn how to set the environment parameters. This minimax adversary can be trained to minimize the performance of the first RL agent by finding and exploiting weaknesses in its policy, e.g. building wall configurations it has not encountered before. But again there is a problem. The right panel shows an environment built by a minimax adversary in which it is actually impossible for the agent to reach the goal. While the minimax adversary has succeeded in its task — it has minimized the performance of the original agent — it provides no opportunity for the agent to learn. Using a purely adversarial objective is not well suited to generating training environments, either.
In collaboration with UC Berkeley, we propose a new multi-agent approach for training the adversary in “Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design”, a publication recently presented at NeurIPS 2020. In this work we present an algorithm, Protagonist Antagonist Induced Regret Environment Design (PAIRED), that is based on minimax regret and prevents the adversary from creating impossible environments, while still enabling it to correct weaknesses in the agent’s policy. PAIRED incentivizes the adversary to tune the difficulty of the generated environments to be just outside the agent’s current abilities, leading to an automatic curriculum of increasingly challenging training tasks. We show that agents trained with PAIRED learn more complex behavior and generalize better to unknown test tasks. We have released open-source code for PAIRED on our GitHub repo.
To flexibly constrain the adversary, PAIRED introduces a third RL agent, which we call the antagonist agent, because it is allied with the adversarial agent, i.e., the one designing the environment. We rename our initial agent, the one navigating the environment, the protagonist. Once the adversary generates an environment, both the protagonist and antagonist play through that environment.
The adversary’s job is to maximize the antagonist’s reward while minimizing the protagonist’s reward. This means it must create environments that are feasible (because the antagonist can solve them and get a high score), but challenging to the protagonist (exploit weaknesses in its current policy). The gap between the two rewards is the regret — the adversary tries to maximize the regret, while the protagonist competes to minimize it.
The methods discussed above (domain randomization, minimax regret and PAIRED) can be analyzed using the same theoretical framework, unsupervised environment design (UED), which we describe in detail in the paper. UED draws a connection between environment design and decision theory, enabling us to show that domain randomization is equivalent to the Principle of Insufficient Reason, the minimax adversary follows the Maximin Principle, and PAIRED is optimizing minimax regret. This formalism enables us to use tools from decision theory to understand the benefits and drawbacks of each method. Below, we show how each of these ideas works for environment design:
What’s interesting about minimax regret is that it incentivizes the adversary to generate a curriculum of initially easy, then increasingly challenging environments. In most RL environments, the reward function will give a higher score for completing the task more efficiently, or in fewer timesteps. When this is true, we can show that regret incentivizes the adversary to create the easiest possible environment the protagonist can’t solve yet. To see this, let’s assume the antagonist is perfect, and always gets the highest score that it possibly can. Meanwhile, the protagonist is terrible, and gets a score of zero on everything. In that case, the regret just depends on the difficulty of the environment. Since easier environments can be completed in fewer timesteps, they allow the antagonist to get a higher score. Therefore, the regret of failing at an easy environment is greater than the regret of failing on a hard environment:
So, by maximizing regret the adversary is searching for easy environments that the protagonist fails to do. Once the protagonist learns to solve each environment, the adversary must move on to finding a slightly harder environment that the protagonist can’t solve. Thus, the adversary generates a curriculum of increasingly difficult tasks.
We can see the curriculum emerging in the learning curves below, which plot the shortest path length of a maze the agents have successfully solved. Unlike minimax or domain randomization, the PAIRED adversary creates a curriculum of increasingly longer, but possible, mazes, enabling PAIRED agents to learn more complex behavior.
But can these different training schemes help an agent generalize better to unknown test tasks? Below, we see the zero-shot transfer performance of each algorithm on a series of challenging test tasks. As the complexity of the transfer environment increases, the performance gap between PAIRED and the baselines widens. For extremely difficult tasks like Labyrinth and Maze, PAIRED is the only method that can occasionally solve the task. These results provide promising evidence that PAIRED can be used to improve generalization for deep RL.
Admittedly, these simple gridworlds do not reflect the complexities of the real world tasks that many RL methods are attempting to solve. We address this in “Adversarial Environment Generation for Learning to Navigate the Web”, which examines the performance of PAIRED when applied to more complex problems, such as teaching RL agents to navigate web pages. We propose an improved version of PAIRED, and show how it can be used to train an adversary to generate a curriculum of increasingly challenging websites:
Above, you can see websites built by the adversary in the early, middle, and late training stages, which progress from using very few elements per page to many simultaneous elements, making the tasks progressively harder. We test whether agents trained on this curriculum can generalize to standardized web navigation tasks, and achieve a 75% success rate, with a 4x improvement over the strongest curriculum learning baseline:
Deep RL is very good at fitting a simulated training environment, but how can we build simulations that cover the complexity of the real world? One solution is to automate this process. We propose Unsupervised Environment Design (UED) as a framework that describes different methods for automatically creating a distribution of training environments, and show that UED subsumes prior work like domain randomization and minimax adversarial training. We think PAIRED is a good approach for UED, because regret maximization leads to a curriculum of increasingly challenging tasks, and prepares agents to transfer successfully to unknown test tasks.
We would like to recognize the co-authors of “Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design”: Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, and Sergey Levine, as well as the co-authors of “Adversarial Environment Generation for Learning to Navigate the Web”: Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jongwook Choi, Kevin Malta, Manoj Tiwari, Honglak Lee, Aleksandra Faust. In addition, we thank Michael Chang, Marvin Zhang, Dale Schuurmans, Aleksandra Faust, Chase Kew, Jie Tan, Dennis Lee, Kelvin Xu, Abhishek Gupta, Adam Gleave, Rohin Shah, Daniel Filan, Lawrence Chan, Sam Toyer, Tyler Westenbroek, Igor Mordatch, Shane Gu, DJ Strouse, and Max Kleiman-Weiner for discussions that contributed to this work.