A Quick Primer on Self-Play in Deep Reinforcement Learning

Learning from yourself?

4 min readJun 20, 2021

“Train tirelessly to defeat the greatest enemy, yourself, and to discover the greatest master, yourself”
- Amituofo

DeepMind has created AI that will crush any human player in Go, Chess, Shogi, and Starcraft 2. OpenAI has made similar strides in complex strategy games, notably in Dota 2. The agents in these games all achieved mastery using deep reinforcement learning. Yet, this is only part of the story. What was the magic sauce that sent these systems’ playing ability out of the atmosphere? A simple framework called self-play, where your opponent is yourself.

What is self play?

Self-play is a framework where an agent learns to play a game by playing against itself. That is, when an agent is training, it uses its current and past selves as sparring partners. DeepMind’s AlphaGo Zero used this technique. AlphaGo Zero, the reinforcement learning agent, learned to master Go by playing against itself — AlphaGo Zero. This contrasts with the original AlphaGo which learned against expert human players.

Why does this change the game?

Self-play is revolutionary; I will discuss just two reasons why. First, it eliminates human bias. Consider the original AlphaGo which trained against expert human Go players. While this original model became exceptional at Go, its moves were limited by an expert human’s understanding of the game. What exactly do I mean by that? Well, by playing against expert human players, AlphaGo’s data was limited and biased. Expert human players, although excellent at the game, bring certain assumptions to the table by virtue of being human. In other words, humans play Go in a human-like way, and that might not be the optimal way to play the game. An agent that plays against itself unleashes creativity that is unconstrained by human bias or assumptions. This liberation has led AI to develop innovative and brilliant strategies unseen by human experts. AlphaGo Zero’s Go strategy was so creative, innovative, and foreign professional Go player Mok Jin-seok described his experience with AlphaGo Zero as like playing against an alien.

Besides unconstrained creativity, self-play can be a mechanism to solve The Problem Problem — an obstacle in AI research. As an agent continues to improve in an environment, its progress can be slowed by that very environment becoming less challenging. In response, researchers and engineers must design a more difficult environment to get the agent to improve. That is, the rate at which an agent can improve is limited by a researcher’s ability to develop a novel and challenging environment. This is the Problem Problem.

In the AlphaGo example, we can consider the ‘environment’ to include the game board and the expert Go opponent. Here, the agent’s progress is limited by the skill of the Go opponent. In one sense, once the algorithm is better than these expert players, it can be difficult for it to continue to improve. It is here that we run into The Problem Problem. But, self-play can be an antidote. In self-play the environment improves as the agent improves since the agent is playing against itself. This forms what is known as Autocurriculum — the idea of an environment that can change and improve based on how the agent interacts with it. In theory, Autocurriculum is only constrained by computation resources. While self-play faces challenges when attempting to solve The Problem Problem, it is a framework that has shown a ton of promise.

Wrapping it up

So far, we have touched on what self-play is, and why it is important. It would not be a Pierre post if I did not speculate at least a little on the potential of self-play going forward. Here are a couple questions that come to my mind when thinking about the future of self-play:

- Where else can self-play be implemented to show humans a novel or innovative way of doing something?

- What other approaches exist in the universe that are impossible for humans to imagine?

- Could even Sun Tzu have learned something from one of these AI?

- What other domains can AI teach us a more optimal way of doing things?

References

[1]: Joel Z. Leibo, Edward Hughes, Marc Lanctot, and Thore Graepel, Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for Multi-Agent Intelligence Research (2019), https://arxiv.org/abs/1903.00742

[2]: David Silver and Demis Hassabis, AlphaGo Zero: Starting from Scratch (2017), https://deepmind.com/blog/article/alphago-zero-starting-scratch