hlfshell

Google DeepMind's Grandmaster-Level Chess Without Search

tl;dr

Google DeepMind released a paper claiming that, without search, a transformer architecture can be utilized to achieve grandmaster level play in chess. But what does that mean? I’m giving a paper club talk on the subject, so I decided to write up my own analysis of the paper.

[Paper] | [Code] | [My Slides]


Chess Without* Search

This paper got a fair amount of press, but it seems that much of the discussion surrounding it is confused. This is partially due to the title of the paper, and partially due to the hype cycle press that surround AI research. It isn’t a groundbreaking chess paper; but it is an interesting paper demonstrating capabilities of self-attention architectures and their ability to grasp complex domains.

To be clear, the model does not understand any rules in chess, and would gleefully select an illegal move if not for human coded curated selection of moves.

The model accepts a tokenized FEN notation of the current board state (rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1 for the opening move of e3, for instance) and a given move; the model then predicts the action-value function (commonly referred to as Q in reinforcement learning) of that state and move within a a bin of values, ranged 0 to 1 (win percentage) with increments determined by the number of bins. 128 bins were used for most of the paper, save hyperparameter tests on the bin size itself. The bins are utilized as one-hot encoded values, wherein an argmax chooses the highest bin value as the predicted action-value of the given action in the given state.

To build their dataset, they took 10 million games from lichess, extracting all board states from all of the games. They then take all possible legal moves from each state, then asked Stockfish [Wikipedia] with a bounded time-limit (50ms) to evaluate the position, which is returned as a win percentage. The 10 million games, per board state, then per legal move, explode into a dataset of over 15 billion action-value pairs.

To utilize the trained model for game play, it is presented the current state of the board and a given move for each possible legal move; the resulting associated Q value is predicted for each action-value pair, and finally the highest scoring predicted value move chosen.

As quoted by the paper, José Raúl Capablanca once said, “I see only one move ahead, but it is always the correct one", a fitting description of the model’s success. It has no concept the rules of chess, no ability to plan or contemplate its past moves, but instead just estimates Q values.

Without Explicit Search

I was excited when I originally heard of the paper. Search is crucial to performant reinforcement learning in complex domains. Noam Brown (whom helped create a superhuman poker model and also a superb human-level diplomancy AI) gave an excellent break down of the effect of search on scaling in reinforcement learning in this podcast. Proper human coded search (ie a well coded MCTS or similar algorithm, combined with some heuristics to make the search more efficient where applicable) result in a dramatic increase in RL model performance, in line with orders of magnitude in both model size and training compute/data. Since scaling the model, training compute, and data is costly, search has allowed many of the the recent headline-grabbing achievements in RL to be possible.

My initial naive excitement was around the idea that perhaps we have found that self-attention mechanisms have some property that allow them to surpass prior RL model stability; a breakthrough that would get around the requirement of search for these complex domains. This paper does nothing to upend that, however.

While the paper is titled Without Search, the authors specifically utilize the term explicit search within the paper. This distinction is important. Yes, the model is not performing a search during inference (though some might argue that limiting inference to only legal moves is, in itself, search), but search was performed to build the dataset itself. The paper’s approach is an application of knowledge distillation [1] [2] (wherein a larger model is utilized to train a smaller model) of Stockfish’s evaluation engine.

Stockfish’s engine built out a tree of possible moves, expanding each continuation for thousands if not millions of moves (a 4-core CPU performs about 5 million moves per second, so ~250k moves for our 50ms), then upon completion (or more realistically, the cutoff time of 50ms), evaluated the win probability score of a given position. This resulting number is what our target model focused on - nothing about chess. (Ackshually, Stockfish outputs a score in centipawns - hundredths of a pawn’s value, which the authors convert to a win percentage with a standard formula below:)

$$ win \% = 0.5 \frac{2}{1 + e^{-0.00368208 \cdot centipawns}} $$

Tradeoffs

Given the design choice of utilizing the model as a Q value approximator, we have some inherent problems with this approach where it can’t deal with mechanics of the game or as a consequence of its utilization of distilling Stockfish.

Since the model has no concept of how a position was reached (it has no memory or input of prior states), it doesn’t understand, or can’t detect, three-fold repetition. To combat this, the legal move list is coded to detect if three-fold repetition would be triggered by a move, and then set the corresponding win percentage to 50% rather than estimate it. This allows the model to draw if it is indeed the best move, but attempt to avoid it if it can’t. There are positions, however, where it is difficult to avoid the repetition, and the model’s inability to look ahead can rush it towards such a catastrophe. Castling and en passant - both moves that are also dependent on prior history of the game, are properly encoded into FEN notation and thus the model can handle this.

The authors also noted a quirk in Stockfish evaluation that occasionally effected endgame performance. Typically Stockfish evaluates a position win percentage as a centipawn score, but if it detects a forced mate-in-k, it will instead output k. To adjust, the authors will set the action-value to the maximum output bin win percentage. If the position is strong enough such that multiple moves can lead to a checkmate, they are all evaluated in the same way, even if one move may be better than the other. Since the model has no history or ability to plan, it may move in odd ways, failing to complete a mate-in-k sequencing, instead jumping between multiple options. There may also be moves that are rated to be in the highest value bin as well, but does not lead to a guarenteed checkmate, which allows the opponent to escape their situation. The authors attempt to combat this by checking with Stockfish to compare moves in these situations, which introduces search into the AI.

What can we learn from it?

The key take-away from this paper is not that we can build AI’s that are absurdly strong at the game of chess - this has been well known for decades now. Instead, this paper suggests to us that modern self-attention mechanisms can serve as effective approximation functions for enormously complicated domains. We can possibly distill complex decision making processes of complex algorithms that take significant compute time into models whose inference time is flat. The success of this model - achieving a grandmaster rating (then confirming it by beating human grandmasters and other strong players) - demonstrates that such approximations can be quite effective.

Consider the benefits of the flat, linear inference time of the model. (Yes, self-attention mechanisms represent quadratic time O(n^2), but since the input in this application is always a fixed number of tokens, the resulting execution time is linear). Stockfish computation time to evaluate positions vary based on the complexity of the board state and the size of the domain of legal moves as it branches out. The model instead takes a fixed amount of time to evaluate the board state. While this is not significant in this instance - the hard limit of 50ms of computation time for the Stockfish is too short to notice a true benefit in this regard - imagine such an evaluation function being performed on more complex, time and compute intensive domains.