Chris Swierczewski's Website

Learning Reinforcement Learning - Part 3

Nov 1, 2019

In my previous post we implemented and tested several basic reinforcement learning agents. In this post we take a short break from deep learning and explore the FFAI Blood Bowl gym environment. We examine the state space and action space. We will also manually run through a couple actions exploring the changing state as we go. All of the code I used to generate the plots in this post is available in this Jupyter notebook.

Exploring the State Space

The FFAI Blood Bowl environment state consists of three components: the board, game state, and game procedure. The board is a C×H×WC \times H \times W tensor with C=28C = 28 features and a board size of H×W=17×28H \times W = 17 \times 28. Below is a plot of each feature after running a game a little bit past the setup phase.

Some of these data are binary; such as which spaces are occupied, the location of your endzone, which players have a certain skill (block, dodge), which player is currently active, and the available positions that active player can move to. Other data is scaled between zero and one; the values of the various player attributes, the number of players contributing to a given tackle zone, and the dodge roll probabilities.

For example, below is a plot of our own players in dark red, the opposing players in orange, and the roll probabilities of the active player in blue.

We see that this is consistent with the movement rules from the Blood Bowl Living Rule Book v6.0 where the probability of success is inversely proportional to the number of opposing tackle zones on the square that the player is dodging into.

While the board state contains various geometric data, the game state is a 50×150 \times 1 vector containing non-geometric, higher-level, normalized information about the game. For example, the features half and round tell us when we are in the current game, own score and opp score store the numbers of touchdowns scored by each team, and is blitz available and is blitz lets us know the the blitz action can be taken and is currently being used, respectively. Finally, the game procedure is a 19×119 \times 1 one-hot vector encoding which part of the game we are currently in. Every new game begins in the StartGame procedure, represented by the vector (1,0,,0)R19(1, 0, \ldots, 0) \in \mathbb{R}^{19}. It then transitions into the CoinTossFlip, CoinTossKickReceive, Setup, and PlaceBall, procedures before finally entering the Turn procedure during which the bulk of the game takes place. During the turn, however, there are multiple occurrences of the PlayerAction, Block, Push, and FollowUp, procedures where the players needs to make a decision about who to block, if successful choosing which square to push the opponent into, and whether or not to follow up. In other words, these procedures encode when only certain subsets of actions are allowed at a given time.

Given the state of the board above it turns out that we are in the PlayerAction procedure, meaning that the currently active player has a variety of actions available to them like moving and blocking. The game state is,

>>> observation['state']
{'half': 0.0,
 'round': 0.125,
 'is sweltering heat': 0.0,
 'is very sunny': 0.0,
 'is nice': 0.0,
 'is pouring rain': 0.0,
 'is blizzard': 0.0,
 'is own turn': 1.0,
 'is kicking first half': 1.0,
 'is kicking this drive': 1.0,
 'own reserves': 0.0625,
 'own kods': 0.0,
 'own casualites': 0.0,
 'opp reserves': 0.0625,
 'opp kods': 0.0,
 'opp casualties': 0.0,
 'own score': 0.0,
 'own turns': 0.125,
 'own starting rerolls': 0.375,
 'own rerolls left': 0.5,
 'own ass coaches': 0.125,
 'own cheerleaders': 0.25,
 'own bribes': 0.0,
 'own babes': 0.0,
 'own apothecary available': 0.0,
 'own reroll available': 1.0,
 'own fame': 1,
 'opp score': 0.0,
 'opp turns': 0.125,
 'opp starting rerolls': 0.375,
 'opp rerolls left': 0.375,
 'opp ass coaches': 0.125,
 'opp cheerleaders': 0.25,
 'opp bribes': 0.0,
 'opp babes': 0.0,
 'opp apothecary available': 0.0,
 'opp reroll available': 1.0,
 'opp fame': 0,
 'is blitz available': 1.0,
 'is pass available': 0.0,
 'is handoff available': 0.0,
 'is foul available': 1.0,
 'is blitz': 0.0,
 'is quick snap': 0.0,
 'is move action': 0.0,
 'is block action': 0.0,
 'is blitz action': 0.0,
 'is pass action': 0.0,
 'is handoff action': 1.0,
 'is foul action': 0.0}

As mentioned, the game has only been simulated a little bit past setup so we are still in the first half (half == 0.0) and the second round (round == 0.125). The current game has normal weather effects and it is currently our turn, which we already knew based on the current roll probabilities. We elected to kick during setup and the current drive is a result of the kickoff reception, as opposed to inducing a fumble from an opponent’s drive.

Exploring the Action Space

Actions in FFAI are encoded action type and position tuples, (a,x,y)(a, x, y). The action type aa is one of thirty-seven possible actions, such as START_GAME, USE_REROLL, SELECT_PLAYER, MOVE, and BLOCK. Of these thirty-seven actions, twenty-two of them use the positional input (x,y)(x,y). For example, the coordinates of SELECT_PLAYER indicate which player is selected for the next action. Then, once that player is selected they can MOVE to potentially one of eight adjacent locations on the board as long as they are unoccupied. The input coordinates indicate which location the player moves into. Positional input is ignored by the other actions, like START_GAME and USE_REROLL.

Gameplay Walkthough

To get a better idea of how the board, state, and procedure interact with the actions we randomly set up a game and then manually step through several actions. Each step below shows a list of currently available actions we can take, the current procedure, and a plot of the board. The board plot shows our opponent’s players in orange, our own players in red, and the current roll probabilities in blue if available. You can generate these plots for yourself by running through this notebook.

Action #1: START_MOVE – Immediately after setup we lave a long list of available actions. Below we only show the first two action types and available positions but other available action types include START_BLITZ, START_PASS, START_HANDOFF (turns out there was a touchback during kickoff so one of our players begins with the ball!), and START_FOUL.

ActionType.START_MOVE (id = 27)
	{'x': 15, 'y': 3}
	{'x': 14, 'y': 5}
	{'x': 14, 'y': 6}
	{'x': 14, 'y': 7}
	{'x': 14, 'y': 8}
	{'x': 17, 'y': 8}
	{'x': 21, 'y': 8}
	{'x': 14, 'y': 9}
	{'x': 14, 'y': 10}
	{'x': 14, 'y': 11}
	{'x': 15, 'y': 13}
ActionType.START_BLOCK (id = 28)
	{'x': 14, 'y': 5}
	{'x': 14, 'y': 6}
	{'x': 14, 'y': 7}
	{'x': 14, 'y': 8}
	{'x': 14, 'y': 9}
	{'x': 14, 'y': 10}
	{'x': 14, 'y': 11}
... (snip) ...
ActionType.END_TURN (id = 8)

Current procedure: Turn (index = 7)

The action types we will explore throughout this example are START_MOVE and START_BLOCK. For START_MOVE, the available positions correspond to our players who are eligible to move. Here we select START_MOVE = 27 for the player at (x,y)=(14,11)(x,y) = (14, 11). Below we specify the action tuple and step the environment.

action = {
    'action-type': 27, # START_MOVE
    'x': 14,
    'y': 11,
}
observation, reward, is_terminal, _ = environment.step(action)

Action #2: MOVE – The game procedure switches to PlayerAction, indicating that we are are currently deciding on the actions of a particular player. Because our previous action was START_MOVE the only two subequent available action types are MOVE or END_PLAYER_TURN. The board state now includes roll probabilities for movement success.

ActionType.MOVE (id = 22)
	{'x': 15, 'y': 10}
	{'x': 13, 'y': 11}
	{'x': 15, 'y': 11}
	{'x': 13, 'y': 12}
	{'x': 14, 'y': 12}
	{'x': 15, 'y': 12}
ActionType.END_PLAYER_TURN (id = 5)

Current procedure: PlayerAction (index = 8)

We elect to move the currently selected player to (x,y)=(14,9)(x,y) = (14, 9). This isn’t necessarily a good strategy (a block is a better action, here) but I’m just illustrating some of the game states and the corresponding restricted action space.

action = {
    'action-type': 22,  # MOVE
    'x': 13,
    'y': 12,
}
observation, reward, is_terminal, _ = environment.step(action)

Action #3: USE_REROLL – The probability of successfully moving our player to that position was low and one of the defensive linemen tripped up our player. The game is now prompting us if we would like to try the move action again by spending a reroll, as implied by the Dodge procedure. If we don’t then our player is downed.

ActionType.USE_REROLL (id = 6)
ActionType.DONT_USE_REROLL (id = 7)

Current procedure: Dodge (index = 17)

We elect to use one of our rerolls. The reroll action type does not accept positional input so we can enter whatever we like for x and y.

action = {
    'action-type': 6, # USE_REROLL
    'x': -1,
    'y': -1,
}
observation, reward, is_terminal, _ = environment.step(action)

Action #4 START_BLOCK – It turns out we were unsuccessful in the dodge reroll. We know this in two ways: first we re-enter the Turn procedure allowing us a wide possible number of actions with the remaining players on our team. Second, the board feature space no longer include the player we just moved in the “standing players” feature, though is still of course in the “own players” feature. Below, I colored the downed player at (x,y)=(13,12)(x,y) = (13,12) in a lighter shade of red.

ActionType.START_MOVE (id = 27)
	{'x': 15, 'y': 3}
	{'x': 14, 'y': 5}
	{'x': 14, 'y': 6}
	{'x': 14, 'y': 7}
	{'x': 14, 'y': 8}
	{'x': 17, 'y': 8}
	{'x': 21, 'y': 8}
	{'x': 14, 'y': 9}
	{'x': 14, 'y': 10}
	{'x': 13, 'y': 12}
	{'x': 15, 'y': 13}
ActionType.START_BLOCK (id = 28)
	{'x': 14, 'y': 5}
	{'x': 14, 'y': 6}
	{'x': 14, 'y': 7}
	{'x': 14, 'y': 8}
	{'x': 14, 'y': 9}
	{'x': 14, 'y': 10}
... (snip) ...
ActionType.END_TURN (id = 8)

Current procedure: Turn (index = 7)

Let’s see how blocking is played out. On this action we will begin a block action with the player at (x,y)=(14,5)(x,y) = (14,5). For brevity I will omit the action definitions for the rest of this example.

Action #5: BLOCK – After selecting START_BLOCK with the player at (x,y)=(14,5)(x,y) = (14,5) we are given a list of possible targets, which in this case is the single opponent opposite of the blocker. The board’s roll probabilities feature now shows the probability of success for this block. (Shown in blue in the plot below.)

ActionType.BLOCK (id = 23)
	{'x': 13, 'y': 6}
ActionType.END_PLAYER_TURN (id = 5)

Current procedure: PlayerAction (index = 8)

We block our lone opponent in (x,y)=(13,6)(x,y) = (13,6).

Action #7, #8: SELECT_DEFENDER_STUMBLES and PUSH – Blocking in Blood Bowl involves rolling a number of six-sided dice and choosing a result from one of the dice. The possible results are (1) attacker down, (2) both down, (3) pushed, (4) pushed, (5) defender stumbles, and (6) defender down. In our block attempt we are allowed to roll two dice and choose the result we prefer.

ActionType.USE_REROLL (id = 6)
ActionType.SELECT_PUSH (id = 12)
ActionType.SELECT_DEFENDER_STUMBLES (id = 13)

Current procedure: Block (index = 9)

SELECT_DEFENDER_STUMBLES is a better result because our opponent doesn’t have the “dodge” ability and therefore will count as a defender down. Immediately following this action we enter the Push procedure and choose a square into which we push our opponent. Our blocker pushes his opponent into (x,y)=(12,6)(x,y) = (12,6).

Action #9: FOLLOW_UP – Finally, after every successful push we have the option to move into the opponent’s previous location. This option is presented to us as a FOLLOW_UP action with two possible locations: our current location and the defender’s previous location. We elect to enter the opponent’s square. I indicate that the defender was downed by using a lighter shade of orange.

Now that the blocking player has completed their actions we are back to the Turn procedure where we can choose a new player to perform an action.

ActionType.START_MOVE (id = 27)
	{'x': 15, 'y': 3}
	{'x': 14, 'y': 6}
	{'x': 14, 'y': 7}
	{'x': 14, 'y': 8}
	{'x': 17, 'y': 8}
	{'x': 21, 'y': 8}
	{'x': 14, 'y': 9}
	{'x': 14, 'y': 10}
	{'x': 13, 'y': 12}
	{'x': 15, 'y': 13}
ActionType.START_BLOCK (id = 28)
	{'x': 14, 'y': 7}
	{'x': 14, 'y': 8}
	{'x': 14, 'y': 9}
	{'x': 14, 'y': 10}
... (snip) ...
ActionType.END_TURN (id = 8)

Current procedure: Turn (index = 7)

Observations

Depending on how you count them, the environment action space can be quite large. While there are only thirty-seven distinct action types, twenty-two of them require positional input. Given that the standard 17×2817 \times 28 pitch contains four-hundred seventy-six locations is it possible to discretize our action space all of these actions-position pairs? That is, could our action-value function consist of, say, more than four-hundred different MOVE actions, four-hundred different PASS actions, etc.? If we did then there would be a total of, 22×476+15=1048722 \times 476 + 15 = 10487 possible actions.

Is 10487 discrete actions considered large? Let’s compare to DeepMind’s AlphaGo [Silver, 2016] and AlphaStar [Vinyals, 2019] algorithms. Go has 19×19=36119 \times 19 = 361 board locations and the only in-game action is the placement of a board. The introduction to the AlphaStar paper claims there are 102610^{26} possible actions. No number of GPUs or TPUs could possibly output state-action values for this many actions. Instead, the action space is decomposed into approximately 300 action types and 13 action parameters where two of these parameters are the x,yx,y-coordinates of the playing field. See [Vinyals, 2017] (SC2LE) for a full description of the environment and to get a better sense as to how the game is represented.

This idea of action-type/parameter decomposition is also seen in [Hausknecht 2019]. Under this kind of framework, though, we have to re-work the learning algorithms a bit. I’ll dive deeper into some of these things in the next post.

Another key observation is that the one-hot procedure vector is crucial for proper action selection. We are usually limited to a subset of the action space depending on the current procedure. Storing this information as a one-hot is natural for whatever model we choose: suppose a unit at some point of the network is a concatenation of the procedure vector with other data from the game. If this unit is fed into a linear layer then the “one-hottedness” acts like a switch by deactivating some of the weights. We can think of that layer containing several sub-matrices, one for each of the nineteen procedures.

Finally, for testing purposes I am considering filtering out some of the state in order to speed up testing. The board layers “block”, “dodge”, “sure hands”, “catch”, and “pass”, while important in the game of Blood Bowl, are very sparse. As a result, the probability that these skills will be learned is very low. As for the state, features like own/opp fame, own/opp assistant coaches, own/opp babes, and own/opp cheerleaders are very high level features that have relatively small impact on the game.

Unfortunately, due to the complexity of the game and the dependence on position-specific actions I will need to use a post to explore different reinforcement learning algorithms. In my next post I will take a look at models that work well with parametrized actions as well as some improved agent strategies from the three basic strategies discussed in the previous post.


Recent Posts