Categories
Offsites

Policy Gradient Reinforcement Learning in TensorFlow 2

In a series of recent posts, I have been reviewing the various Q based methods of deep reinforcement learning (see here, here, here, here and so on). Deep Q based reinforcement learning operates by training a neural network to learn the Q value for each action of an agent which resides in a certain state s of the environment. The policy which guides the actions of the agent in this paradigm operates by a random selection of actions at the beginning of training (the epsilon greedy method), but then the agent will select actions based on the highest Q value predicted in each state s. The Q value is simply an estimation of future rewards which will result from taking action a. An alternative to the deep Q based reinforcement learning is to forget about the Q value and instead have the neural network estimate the optimal policy directly. Reinforcement learning methods based on this idea are often called Policy Gradient methods.

This post will review the REINFORCE or Monte-Carlo version of the Policy Gradient methodology. This methodology will be used in the Open AI gym Cartpole environment. All code used and explained in this post can be found on this site’s Github repository.  


Eager to build deep learning systems in TensorFlow 2? Get the book here


Policy Gradients and their theoretical foundation

This section will review the theory of Policy Gradients, and how we can use them to train our neural network for deep reinforcement learning. This section will feature a fair bit of mathematics, but I will try to explain each step and idea carefully for those who aren’t as familiar with the mathematical ideas. We’ll also skip over a step at the end of the analysis for the sake of brevity.  

In Policy Gradient based reinforcement learning, the objective function which we are trying to maximise is the following:

$$J(theta) = mathbb{E}_{pi_theta} left[sum_{t=0}^{T-1} gamma^t r_t right]$$
 
The function above means that we are attempting to find a policy ($pi$) with parameters ($theta$) which maximises the expected value of the sum of the discounted rewards of an agent in an environment. Therefore, we need to find a way of varying the parameters of the policy $theta$ such that the expected value of the discounted rewards are maximised. In the deep reinforcement learning case, the parameters $theta$ are the parameters of the neural network.
 
Note the difference to the deep Q learning case – in deep Q based learning, the parameters we are trying to find are those that minimise the difference between the actual Q values (drawn from experiences) and the Q values predicted by the network. However, in Policy Gradient methods, the neural network directly determines the actions of the agent – usually by using a softmax output and sampling from this. 
 
Also note that, because environments are usually non-deterministic, under any given policy ($pi_theta$) we are not always going to get the same reward. Rather, we are going to be sampling from some probability function as the agent operates in the environment, and therefore we are trying to maximise the expected sum of rewards, not the 100% certain, we-will-get-this-every-time reward sum.  
 
Ok, so we want to learn the optimal $theta$. The way we generally learn parameters in deep learning is by performing some sort of gradient based search of $theta$. So we want to iteratively execute the following:
 
$$theta leftarrow theta + alpha nabla J(theta)$$
 
So the question is, how do we find $nabla J(theta)$?

Finding the Policy Gradient

First, let’s make the expectation a little more explicit. Remember, the expectation of the value of a function $f(x)$ is the summation of all the possible values due to variations in x multiplied by the probability of x, like so:

$$mathbb{E}[f(x)] = sum_x P(x)f(x)$$
 
Ok, so what does the cashing out of the expectation in $J(theta)$ look like? First, we have to define the function which produces the rewards, i.e. the rewards equivalent of $f(x)$ above. Let’s call this $R(tau)$ (where
$R(tau) = sum_{t=0}^{T-1}r_t$, ignoring discounting for the moment).  The value $tau$ is the trajectory of the agent “moving” through the environment. It can be defined as:
 
$$tau = (s_0, a_0, r_0, s_1, a_1, r_1, ldots, s_{T-1}, a_{T-1}, r_{T-1}, s_T)$$
 
The trajectory, as can be seen, is the progress of the agent through an episode of a game of length T. This trajectory is the fundamental factor which determines the sum of the rewards – hence $R(tau)$. This covers the $f(x)$ component of the expectation definition. What about the $P(x)$ component? In this case, it is equivalent to $P(tau)$ – but what does this actually look like in a reinforcement learning environment? 
 
It consists of the two components – the probabilistic policy function which yields an action $a_t$ from states $s_t$ with a certain probability, and a probability that state $s_{t+1}$ will result from taking action $a_t$ from state $s_t$. The latter probabilistic component is uncertain due to the random nature of many environments. These two components operating together will “roll out” the trajectory of the agent $tau$. 
 
$P(tau)$ looks like:
 
$$P(tau) = prod_{t=0}^{T-1} P_{pi_{theta}}(a_t|s_t)P(s_{t+1}|s_t,a_t)$$
 
If we take the first step, starting in state $s_0$ – our neural network will produce a softmax output with each action assigned a certain probability. The action is then selected by weighted random sampling subject to these probabilities – therefore, we have a probability of action $a_0$ being selected according to $P_{pi_{theta}}(a_t|s_t)$. This probability is determined by the policy $pi$ which in turn is parameterised according to $theta$ (i.e. a neural network with weights $theta$.  The next term will be $P(s_1|s_0,a_0)$ which expresses any non-determinism in the environment.
 
(Note: the vertical line in the probability functions above are conditional probabilities. $P_{pi_{theta}}(a_t|s_t)$ refers to the probability of action $a_t$ being selected, given the agent is in state $s_t$). 
 
These probabilities are multiplied out over all the steps in the episode of length T to produce the trajectory $tau$. (Note, the probability of being in the first state, $s_0$, has been excluded from this analysis for simplicity). Now we can substitute $P(tau)$ and $R(tau)$ into the original expectation and take the derivative to get to $nabla J(theta)$ which is what we need to do the gradient based optimisation. However, to get there, we first need to apply a trick or two. 
 
First, let’s take the log derivative of $P(tau)$ with respect to $theta$ i.e. $nabla_theta$ and work out what we get:
 
 
$$nabla_theta log P(tau) = nabla log left(prod_{t=0}^{T-1} P_{pi_{theta}}(a_t|s_t)P(s_{t+1}|s_t,a_t)right) $$
$$ =nabla_theta left[sum_{t=0}^{T-1} (log P_{pi_{theta}}(a_t|s_t) + log P(s_{t+1}|s_t,a_t)) right]$$
$$ =nabla_theta sum_{t=0}^{T-1}log P_{pi_{theta}}(a_t|s_t)$$
 
The reason we are taking the log will be made clear shortly. As can be observed, when the log is taken of the multiplicative operator ($prod$) this is converted to a summation (as multiplying terms within a log function is equivalent to adding them separately). In the final line, it can be seen that taking the derivative with respect to the parameters ($theta$) removes the dynamics of the environment ($log P(s_{t+1}|s_t,a_t))$) as these are independent of the neural network parameters / $theta$. 
 
Let’s go back to our original expectation function, substituting in our new trajectory based functions, and apply the derivative (again ignoring discounting for simplicity):
 
$$J(theta) = mathbb{E}[R(tau)]$$
$$ = smallint P(tau) R(tau)$$
$$nabla_theta J(theta) = nabla_theta smallint P(tau) R(tau)$$
 
So far so good. Now, we are going to utilise the following rule which is sometimes called the “log-derivative” trick:
 
$$frac{nabla_theta p(X,theta)}{p(X, theta)} = nabla_theta log p(X,theta)$$
 
We can then apply the $nabla_{theta}$ operator within the integral, and cajole our equation so that we get the $frac{nabla_{theta} P{tau}}{P(tau)}$ expression like so:
 
$$nabla_theta J(theta)=int P(tau) frac{nabla_theta P(tau)}{P(tau)} R(tau)$$
 
Then, using the log-derivative trick and applying the definition of expectation, we arrive at:
 
$$nabla_theta J(theta)=mathbb{E}left[R(tau) nabla_theta logP(tau)right]$$
 
We can them substitute our previous derivation of $nabla_{theta} log P(tau)$ into the above to arrive at:
 
$$nabla_theta J(theta) =mathbb{E}left[R(tau) nabla_theta sum_{t=0}^{T-1} log P_{pi_{theta}}(a_t|s_t)right]$$
 
This is now close to the point of being something we can work with in our learning algorithm. Let’s take it one step further by recognising that, during our learning process, we are randomly sampling trajectories from the environment, and hoping to make informed training steps. Therefore, we can recognise that, to maximise the expectation above, we need to maximise it with respect to its argument i.e. we maximise:
 
$$nabla_theta J(theta) sim R(tau) nabla_theta sum_{t=0}^{T-1} log P_{pi_{theta}}(a_t|s_t)$$
 
Recall that $R(tau)$ is equal to $R(tau) = sum_{t=0}^{T-1}r_t$ (ignoring discounting). Therefore, we have two summations that need to be multiplied out, element by element. It turns out that after doing this, we arrive at an expression like so:
 
$$nabla_theta J(theta) sim left(sum_{t=0}^{T-1} log P_{pi_{theta}}(a_t|r_t)right)left(sum_{t’= t + 1}^{T} gamma^{t’-t-1} r_{t’} right)$$
 
As can be observed, there are two main components that need to be multiplied. However, one should note the differences in the bounds of the summation terms in the equation above – these will be explained in the next section.
 

Calculating the Policy Gradient

The way we compute the gradient as expressed above in the REINFORCE method of the Policy Gradient algorithm involves sampling trajectories through the environment to estimate the expectation, as discussed previously. This REINFORCE method is therefore a kind of Monte-Carlo algorithm. Let’s consider this a bit more concretely. 
 
Let’s say we initialise the agent and let it play a trajectory $tau$ through the environment. The actions of the agent will be selected by performing weighted sampling from the softmax output of the neural network – in other words, we’ll be sampling the action according to $P_{pi_{theta}}(a_t|r_t)$. At each step in the trajectory, we can easily calculate $log P_{pi_{theta}}(a_t|r_t)$ by simply taking the log of the softmax output values from the neural network. So, for the first step in the trajectory, the neural network would take the initial states $s_0$ as input, and it would produce a vector of actions $a_0$ with pseudo-probabilities generated by the softmax operation in the final layer. 
 
What about the second part of the $nabla_theta J(theta)$ equation – $sum_{t’= t + 1}^{T} gamma^{t’-t-1} r_{t’}$? We can see that the summation term starts at $t’ = t + 1 = 1$. The summation then goes from t=1 to the total length of the trajectory T – in other words, from t=1 to the total length of the episode. Let’s say the episode length was 4 states long – this term would then look like $gamma^0 r_1 + gamma^1 r_2 + gamma^2 r_3$, where $gamma$ is the discounting factor and is < 1. 
 
Straight-forward enough. However, you may have realised that, in order to calculate the gradient $nabla_theta J(theta)$ at the first step in the trajectory/episode, we need to know the reward values of all subsequent states in the episode. Therefore, in order to execute this method of learning, we can only take gradient learning steps after the full episode has been played to completion. Only after the episode is complete can we perform the training step. 
 
We are almost ready to move onto the code part of this tutorial. However, this is a good place for a quick discussion about how we would actually implement the calculations $nabla_theta J(theta)$ equation in TensorFlow 2 / Keras. It turns out we can just use the standard cross entropy loss function to execute these calculations. Recall that cross entropy is defined as (for a deeper explanation of entropy, cross entropy, information and KL divergence, see this post):
 
$$CE = -sum p(x) log(q(x))$$
 
Which is just the summation between one function $p(x)$ multiplied by the log of another function $q(x)$ over the possible values of the argument x. If we look at the source code of the Keras implementation of cross-entropy, we can see the following:
 
Keras output of cross-entropy loss function

Keras output of cross-entropy loss function

The output tensor here is simply the softmax output of the neural network, which, for our purposes, will be a tensor of size (num_steps_in_episode, num_actions). Note that the log of output is calculated in the above. The target value, for our purposes, can be all the discounted rewards calculated at each step in the trajectory, and will be of size (num_steps_in_episode, 1). The summation of the multiplication of these terms is then calculated (reduce_sum). Gradient based training in TensorFlow 2 is generally a minimisation of the loss function, however, we want to maximise the calculation as discussed above. The good thing is, the sign of cross entropy calculation shown above is inverted – so we are good to go. 

To call this training step utilising Keras, all we have to do is execute something like the following:

network.train_on_batch(states, discounted_rewards)

Here, we supply all the states gathered over the length of the episode, and the discounted rewards at each of those steps. The Keras backend will pass the states through network, apply the softmax function, and this will become the output variable in the Keras source code snippet above. Likewise, discounted_rewards is the same as target in the source code snippet above. 

Now that we have covered all the pre-requisite knowledge required to build a REINFORCE-type method of Policy Gradient reinforcement learning, let’s have a look at how this can be coded and applied to the Cartpole environment.

Policy Gradient reinforcement learning in TensorFlow 2 and Keras

In this section, I will detail how to code a Policy Gradient reinforcement learning algorithm in TensorFlow 2 applied to the Cartpole environment. As always, the code for this tutorial can be found on this site’s Github repository.

First, we define the network which we will use to produce $P_{pi_{theta}}(a_t|r_t)$ with the state as the input:

GAMMA = 0.95

env = gym.make("CartPole-v0")
state_size = 4
num_actions = env.action_space.n

network = keras.Sequential([
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(num_actions, activation='softmax')
])
network.compile(loss='categorical_crossentropy',optimizer=keras.optimizers.Adam())

As can be observed, first the environment is initialised. Next, the network is defined using the Keras Sequential API. The network consists of 3 densely connected layers. The first 2 layers have ReLU activations, and the final layer has a softmax activation to produce the pseudo-probabilities to approximate $P_{pi_{theta}}(a_t|r_t)$. Finally, the network is compiled with a cross entropy loss function and an Adam optimiser. 

The next part of the code chooses the action from the output of the model:

def get_action(network, state, num_actions):
    softmax_out = network(state.reshape((1, -1)))
    selected_action = np.random.choice(num_actions, p=softmax_out.numpy()[0])
    return selected_action

As can be seen, first the softmax output is extracted from the network by inputing the current state. The action is then selected by making a random choice from the number of possible actions, with the probabilities weighted according to the softmax values. 

The next function is the main function involved in executing the training step:

def update_network(network, rewards, states, actions, num_actions):
    reward_sum = 0
    discounted_rewards = []
    for reward in rewards[::-1]:  # reverse buffer r
        reward_sum = reward + GAMMA * reward_sum
        discounted_rewards.append(reward_sum)
    discounted_rewards.reverse()
    discounted_rewards = np.array(discounted_rewards)
    # standardise the rewards
    discounted_rewards -= np.mean(discounted_rewards)
    discounted_rewards /= np.std(discounted_rewards)
    states = np.vstack(states)
    loss = network.train_on_batch(states, discounted_rewards)
    return loss

First, the discounted rewards list is created: this is a list where each element corresponds to the summation from t + 1 to T according to $sum_{t’= t + 1}^{T} gamma^{t’-t-1} r_{t’}$. The input argument rewards is a list of all the rewards achieved at each step in the episode. The rewards[::-1] operation reverses the order of the rewards list, so the first run through the for loop will deal with last reward recorded in the episode. As can be observed, a reward sum is accumulated each time the for loop is executed. Let’s say that the episode length is equal to 4 – $r_3$ will refer to the last reward recorded in the episode. In this case, the discounted_rewards list would look like:

[$r_3$, $r_2 + gamma r_3$, $r_1 + gamma r_2 + gamma^2 r_3$, $r_0 + gamma r_1 + gamma^2 r_2 + gamma^3 r_3$]

 

This list is in reverse to the order of the actual state value list (i.e. [$s_0$, $s_1$, $s_2$, $s_3$]), so the next line after the for loop reverses the list (discounted_rewards.reverse()). 

Next, the list is converted into a numpy array, and the rewards are normalised to reduce the variance in the training. Finally, the states list is stacked into a numpy array and both this array and the discounted rewards array are passed to the Keras train_on_batch function, which was detailed earlier. 

The next part of the code is the main episode and training loop:

num_episodes = 10000000
train_writer = tf.summary.create_file_writer(STORE_PATH + f"/PGCartPole_{dt.datetime.now().strftime('%d%m%Y%H%M')}")
for episode in range(num_episodes):
    state = env.reset()
    rewards = []
    states = []
    actions = []
    while True:
        action = get_action(network, state, num_actions)
        new_state, reward, done, _ = env.step(action)
        states.append(state)
        rewards.append(reward)
        actions.append(action)

        if done:
            loss = update_network(network, rewards, states, actions, num_actions)
            tot_reward = sum(rewards)
            print(f"Episode: {episode}, Reward: {tot_reward}, avg loss: {loss:.5f}")
            with train_writer.as_default():
                tf.summary.scalar('reward', tot_reward, step=episode)
                tf.summary.scalar('avg loss', loss, step=episode)
            break

        state = new_state

As can be observed, at the beginning of each episode, three lists are created which will contain the state, reward and action values for each step in the episode / trajectory. These lists are appended to until the done flag is returned from the environment signifying that the episode is complete. At the end of the episode, the training step is performed on the network by running update_network. Finally, the rewards and loss are logged in the train_writer for viewing in TensorBoard. 

The training results can be observed below:

Training progress of Policy Gradient RL on Cartpole environment

Training progress of Policy Gradient RL in Cartpole environment

As can be observed, the rewards steadily progress until they “top out” at the maximum possible reward summation for the Cartpole environment, which is equal to 200. However, the user can verify that repeated runs of this version of Policy Gradient training has a high variance in its outcomes. Therefore, improvements in the Policy Gradient REINFORCE algorithm are required and available – these improvements will be detailed in future posts.

 


Eager to build deep learning systems in TensorFlow 2? Get the book here


 

 

The post Policy Gradient Reinforcement Learning in TensorFlow 2 appeared first on Adventures in Machine Learning.

Categories
Offsites

Atari Space Invaders and Dueling Q RL in TensorFlow 2

In previous posts (here and here) I introduced Double Q learning and the Dueling Q architecture. These followed on from posts about deep Q learning, and showed how double Q and dueling Q learning is superior to vanilla deep Q learning. However, these posts only included examples of simplistic environments like the OpenAI Cartpole environment. These types of environments are good to learn on, but more complicated environments are both more interesting and fun. They also demonstrate better the complexities of implementing deep reinforcement learning in realistic cases. In this post, I’ll use similar code to that shown in my Dueling Q TensorFlow 2 but in this case apply it to the Open AI Atari Space Invaders environment. All code for this post can be found on this site’s Github repository. Also, as mentioned in the title, the example code for this post is written using TensorFlow 2. TensorFlow 2 is now released and installation instructions can be found here.


Eager to build deep learning systems in TensorFlow 2? Get the book here


Double and Dueling Q learning recap

Double Q recap

Double Q learning was created to address two problems with vanilla deep Q learning. These are:

  1. Using the same network to both choose the best action and evaluate the quality of that action is a source of feedback / learning instability.
  2. The max function used in calculating the target Q value (see formula below), which the neural network is to learn, tends to bias the network towards high, noisy, rewards. This again hampers learning and makes it more erratic

The problematic Bellman equation is shown below: $$Q_{target} = r_{t+1} + gamma max_{{a}}Q(s_{t+1}, a;theta_t)$$ The Double Q solution to the two problems above involves creating another target network, which is initially created with weights equal to the primary network. However, during training the primary network and the target network are allowed to “drift” apart. The primary network is trained as per usual, but the target network is not. Instead, the target network weights are either periodically (but not frequently) set equal to the primary network weights, or they are only gradually “blended” with the primary network in a weighted average fashion. The benefit then comes from the fact that in Double Q learning, the Q value of the best action in the next state ($s_{t + 1}$) is extracted from the target network, not the primary network. The primary network is still used to evaluate what the best action will be, a*, by taking an argmax of the outputs from the primary network, but the Q value for this action is evaluated from the target network. This can be observed in the formulation below: $$a* = argmax Q(s_{t+1}, a; theta_t)$$ $$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, a*; theta^-_t)$$ Notice the different weights involved in the formulas above – the best action, a*, is calculated from the network with $theta_t$ weights – this is the primary network weights. However the $Q_{target}$ calculation uses the target network, with weights $theta^-_t$, to estimate the Q value for this chosen action. This Double Q methodology decouples the choosing of an action from the evaluation of the Q value of such an action. This provides more stability to the learning – for more details and a demonstration of the superiority of the Double Q methodology over vanilla Deep Q learning, see this post.

Dueling Q recap

The Dueling Q architecture, discussed in detail in this post, is an improvement to the Double Q network. It uses the same methodology of a target and a primary network, with periodic updates or blending of the target network weights to the primary network weights. However, it builds two important concepts into the architecture of the network. These are the advantage and value functions:

  • Advantage function A(s, a): The advantage function is the relative benefit of choosing a certain action in state s over the other possible actions in state s
  • Value function V(s): The value function is the value of being in state s, independent of the relative benefits of the actions within that state

The Q function is the simple addition of these two functions: $$Q(s, a) = V(s) + A(s, a)$$ The motivation of splitting these two functions explicitly in the architecture is that there can be inherently good or bad states for the agent to be in, regardless of the relative benefit of any actions within that state. For instance, in a certain state, all actions may lead to the agent “dying” in a game – this is an inherently bad state to be in, and there is no need to waste computational resources trying to determine the best action in this state. The converse can also be true. Ideally, this “splitting” into the advantage function and value function should be learnt implicitly during training. However, the Dueling Q architecture makes this split explicit, which acts to improve training. The Dueling Q architecture can be observed in the figure below:  

Dueling Q architecture

Dueling Q architecture

It can be observed that in the Dueling Q architecture, there are common Convolutional Neural Network layers which perform image processing. The output from these layers is then flattened and the network then bifurcates into a Value function stream V(s) and an Advantage function stream A(s, a). The output of these separate streams are then aggregated in a special layer, before finally outputting Q values from the network. The aggregation layer does not perform a simple addition of the Value and Advantage streams – this would result in problems of identifiability (for more details on this, see the original Dueling Q post). Instead, the following aggregation function is performed: $$Q(s,a) = V(s) + A(s,a) – frac{1}{|a|}sum_{a’}A(s,a’)$$ In this post, I’ll demonstrate how to use the Dueling Q architecture to train an agent in TensorFlow 2 to play Atari Space Invaders. However, in this post I will concentrate on the extra considerations required to train the agent via an image stream from an Atari game. For more extra details, again, refer to the original Dueling Q post.

Considerations for training in an Atari environment

Training reinforcement learning agents on Atari environments is hard – it can be a very time consuming process as the environment complexity is high, especially when the agent needs to visually interpret objects direct from images. As such, each environment needs to be considered to determine legitimate ways of reducing the training burden and improving the performance. Three methods will be used in this post:

  1. Converting images to greyscale
  2. Reducing the image size
  3. Stacking frames

Converting Atari images to greyscale and reducing the image size

The first, relatively easy, step in reducing the computational training burden is to convert all the incoming Atari images from depth-3 RGB colour images to depth-1 greyscale images. This reduces the number of input CNN filters required in the first layer by 3. Another step which can be performed to reduce the size of the input CNN filters is to resize the image inputs to make them smaller. There is obviously a limit in the reduction of the image sizes before learning performance is affected, however, in this case, a halving of the image size by rescaling is possible without affecting performance too much. The original image sizes from the Atari Space Invaders game are (210, 160, 3) – after converting to greyscale and resizing by half, the new image size is (105, 80, 1). Both of these operations are easy enough to implement in TensorFlow 2:

def image_preprocess(image, new_size=(105, 80)):
    # convert to greyscale, resize and normalize the image
    image = tf.image.rgb_to_grayscale(image)
    image = tf.image.resize(image, new_size)
    image = image / 255
    return image

Stacking image frames

The next step that is commonly performed when training agents on Atari games is the practice of stacking image frames, and feeding all these frames into the input CNN layers. The purpose of this is to allow the neural network to get some sense of direction of the objects moving within the image. Consider a single, static image – examining such an image on its own will give no information about which direction any of the objects moving within this image are travelling (or their respective speeds). Therefore, for each sample fed into the neural network, a stack of frames is presented to the input – this gives the neural network both time and spatial information to work with. The input dimension to the network are not, then, of size (105, 80, 1) but rather (105, 80, NUM_FRAMES). In this case, we’ll use 3 frames to feed into the network i.e. NUM_FRAMES = 3. The specifics of how these stacked frames are stored, extracted and updated will be revealed as we step through the code in the next section. Additional steps can be taken to improve performance in complex Atari environment and similar cases. These include the skipping of frames and prioritised experience replay (PER). However, these have not been implemented in this example. A future post will discuss the benefits of PER and how to implement it.

Atari Space Invaders TensorFlow 2 implementation

The section below details the TensorFlow 2 implementation of training an agent on the Atari Space Invaders environment. In this post, comprehensive details of the Dueling Q architecture and training implementation will not be given – for a step by step discussion on these details, see my Dueling Q introductory post. However, detailed information will be given about the specific new steps required to train in the Atari environment. As stated at the beginning of the post, all code can be found on this site’s Github repository.

Model definition

First we define the Double/Dueling Q model class with its structure:

env = gym.make("SpaceInvaders-v0")
num_actions = env.action_space.n


class DQModel(keras.Model):
    def __init__(self, hidden_size: int, num_actions: int, dueling: bool):
        super(DQModel, self).__init__()
        self.dueling = dueling
        self.conv1 = keras.layers.Conv2D(16, (8, 8), (4, 4), activation='relu')
        self.conv2 = keras.layers.Conv2D(32, (4, 4), (2, 2), activation='relu')
        self.flatten = keras.layers.Flatten()
        self.adv_dense = keras.layers.Dense(hidden_size, activation='relu',
                                         kernel_initializer=keras.initializers.he_normal())
        self.adv_out = keras.layers.Dense(num_actions,
                                          kernel_initializer=keras.initializers.he_normal())
        if dueling:
            self.v_dense = keras.layers.Dense(hidden_size, activation='relu',
                                         kernel_initializer=keras.initializers.he_normal())
            self.v_out = keras.layers.Dense(1, kernel_initializer=keras.initializers.he_normal())
            self.lambda_layer = keras.layers.Lambda(lambda x: x - tf.reduce_mean(x))
            self.combine = keras.layers.Add()

    def call(self, input):
        x = self.conv1(input)
        x = self.conv2(x)
        x = self.flatten(x)
        adv = self.adv_dense(x)
        adv = self.adv_out(adv)
        if self.dueling:
            v = self.v_dense(x)
            v = self.v_out(v)
            norm_adv = self.lambda_layer(adv)
            combined = self.combine([v, norm_adv])
            return combined
        return adv

primary_network = DQModel(256, num_actions, True)
target_network = DQModel(256, num_actions, True)
primary_network.compile(optimizer=keras.optimizers.Adam(), loss='mse')
# make target_network = primary_network
for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
    t.assign(e)

primary_network.compile(optimizer=keras.optimizers.Adam(), loss=tf.keras.losses.Huber())

In the code above, first the Space Invaders environment is created. After this, the DQModel class is defined as a keras.Model base class. In this model, you can observe that first a number of convolutional layers are created, then a flatten layer and dedicated fully connected layers to enact the value and advantage streams. This structure is then implemented in the model call function. After this model class has been defined, two versions of it are implemented corresponding to the primary_network and the target_network – as discussed above, both of these will be utilised in the Double Q component of the learning. The target_network weights are then set to be initially equal to the primary_network weights. Finally the primary_network is compiled for training using an Adam optimizer and a Huber loss function. As stated previously, for more details see this post.

The Memory class

Next we will look at the Memory class, which is to hold all the previous experiences of the agent. This class is a little more complicated in the Atari environment case, due to the necessity of dealing with stacked frames:

class Memory:
    def __init__(self, max_memory):
        self._max_memory = max_memory
        self._actions = np.zeros(max_memory, dtype=np.int32)
        self._rewards = np.zeros(max_memory, dtype=np.float32)
        self._frames = np.zeros((POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], max_memory), dtype=np.float32)
        self._terminal = np.zeros(max_memory, dtype=np.bool)
        self._i = 0

In the class __init__ function, it can be observed that all the various memory buffers (for actions, rewards etc.) are initialized according to max_memory at the get-go. This is in opposition to a memory approach which involves appending to lists. This is performed so that it can be determined whether there will be a memory problem during training from the very beginning (as opposed to the code falling over after you’ve already been running it for 3 days!). It also increases the efficiency of the memory allocation process (as appending / growing memory dynamically is an inefficient process). You’ll also observe the creation of a counter variable, self._i. This is to record the present location of stored samples in the memory buffer, and will ensure that the memory is not overflowed. The next function within the class shows how samples are stored within the class:

def add_sample(self, frame, action, reward, terminal):
    self._actions[self._i] = action
    self._rewards[self._i] = reward
    self._frames[:, :, self._i] = frame[:, :, 0]
    self._terminal[self._i] = terminal
    if self._i % (self._max_memory - 1) == 0 and self._i != 0:
        self._i = BATCH_SIZE + NUM_FRAMES + 1
    else:
        self._i += 1

As will be shown shortly, for every step in the Atari environment, the current image frame, the action taken, the reward received and whether the state is terminal (i.e. the agent ran out of lives and the game ends) is stored in memory. Notice that nothing special as yet is being done with the stored frames – they are simply stored in order as the game progresses. The frame stacking process occurs during the sample extraction method to be covered next. One thing to notice is that once self._i reaches max_memory the index is reset back to the beginning of the memory buffer (but offset by the batch size and the number of frames). This reset means that, once the memory buffer reaches it’s maximum size, it will begin to overwrite the older samples. The next method in the class governs how random sampling from the memory buffer occurs:

def sample(self):
    if self._i < BATCH_SIZE + NUM_FRAMES + 1:
        raise ValueError("Not enough memory to extract a batch")
    else:
        rand_idxs = np.random.randint(NUM_FRAMES + 1, self._i, size=BATCH_SIZE)
        states = np.zeros((BATCH_SIZE, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES),
                         dtype=np.float32)
        next_states = np.zeros((BATCH_SIZE, POST_PROCESS_IMAGE_SIZE[0], POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES),
                         dtype=np.float32)
        for i, idx in enumerate(rand_idxs):
            states[i] = self._frames[:, :, idx - 1 - NUM_FRAMES:idx - 1]
            next_states[i] = self._frames[:, :, idx - NUM_FRAMES:idx]
        return states, self._actions[rand_idxs], self._rewards[rand_idxs], next_states, self._terminal[rand_idxs]

First, a simple check is performed to ensure there are enough samples in the memory to actually extract a batch. If so, a set of random indices rand_idxs is selected. These random integers are selected from a range with a lower bound of NUM_FRAMES + 1 and an upper bound of self._i. In other words, it is possible to select any indices from the start of the memory buffer to the current filled location of the buffer – however, because NUM_FRAMES of images prior to the selected indices is extracted, indices less than NUM_FRAMES are not allowed. The number of random indices selected is equal to the batch size.

Next, some numpy arrays are initialised which will hold the current states and the next states – in this example, these are of size (32, 105, 80, 3) where 3 is the number of frames to be stacked (NUM_FRAMES). A loop is then entered into for each of the randomly selected memory indices. As can be observed, the states batch row is populated by the stored frames ranging from idx – 1 – NUM_FRAMES to idx – 1. In other words, it is the 3 frames including and prior to the randomly selected index idx – 1. Alternatively, the batch row for next_states is the 3 frames including and prior to the randomly selected index idx (think of a window of 3 frames shifted along by 1 position). These variables states and next_states are then returned from this function, along with the corresponding actions, rewards and terminal flags. The terminal flags communicate whether the game finished for during the randomly selected states. Finally, the memory class is instantiated with the memory size as the argument:

memory = Memory(200000)

The memory size should ideally be as large as possible, but considerations must be given to the amount of memory available on whatever computing platform is being used to run the training.

Miscellaneous functions

The following two functions are standard functions to choose the actions and update the target network:

def choose_action(state, primary_network, eps, step):
    if step < DELAY_TRAINING:
        return random.randint(0, num_actions - 1)
    else:
        if random.random() < eps:
            return random.randint(0, num_actions - 1)
        else:
            return np.argmax(primary_network(tf.reshape(state, (1, POST_PROCESS_IMAGE_SIZE[0],
                                                           POST_PROCESS_IMAGE_SIZE[1], NUM_FRAMES)).numpy()))


def update_network(primary_network, target_network):
    # update target network parameters slowly from primary network
    for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
        t.assign(t * (1 - TAU) + e * TAU)

The choose_action function performs the  epsilon-greedy action selection policy, where a random action is selected if a random value falls below eps, otherwise it is selected by choosing the action with the highest Q value from the network. The update_network function slowly shifts the target network weights towards the primary network weights in accordance with the Double Q learning methodology. The next function deals with the “state stack” which is an array which holds the last NUM_FRAMES of the episode:

def process_state_stack(state_stack, state):
    for i in range(1, state_stack.shape[-1]):
        state_stack[:, :, i - 1].assign(state_stack[:, :, i])
    state_stack[:, :, -1].assign(state[:, :, 0])
    return state_stack

This function takes the existing state stack array, and the newest state to be added. It then shuffles all the existing frames within the state stack “back” one position. In other words, the most recent state, in this case, sitting in row 2 of the state stack, if shuffled back to row 1. The frame / state in row 1 is shuffled to row 0. Finally, the most recent state or frame is stored in the newly vacated row 2 of the state stack. The state stack is required so that it can be fed into the neural network in order to choose actions, and its updating can be observed in the main training loop, as will be reviewed shortly.

The Dueling Q / Double Q training function

Next up is the training function:

def train(primary_network, memory, target_network=None):
    states, actions, rewards, next_states, terminal = memory.sample()
    # predict Q(s,a) given the batch of states
    prim_qt = primary_network(states)
    # predict Q(s',a') from the evaluation network
    prim_qtp1 = primary_network(next_states)
    # copy the prim_qt tensor into the target_q tensor - we then will update one index corresponding to the max action
    target_q = prim_qt.numpy()
    updates = rewards
    valid_idxs = terminal != True
    batch_idxs = np.arange(BATCH_SIZE)
    if target_network is None:
        updates[valid_idxs] += GAMMA * np.amax(prim_qtp1.numpy()[valid_idxs, :], axis=1)
    else:
        prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1)
        q_from_target = target_network(next_states)
        updates[valid_idxs] += GAMMA * q_from_target.numpy()[batch_idxs[valid_idxs], prim_action_tp1[valid_idxs]]
    target_q[batch_idxs, actions] = updates
    loss = primary_network.train_on_batch(states, target_q)
    return loss

This train function is very similar to the train function reviewed in my first Dueling Q tutorial. Essentially, it first extracts batches of data from the memory buffer. Next the Q values from the current state (states) and the following states (next_states) are extracted from the primary network – these values are returned in prim_qt and prim_qtp1 respectively (where qtp1 refers to the Q values for the time t + 1). Next, the target Q values are initialized from the prim_qt values. After this, the updates variable is created – this holds the target Q values for the actions. These target values will be the Q values which the network will “step towards” during the optimization step – hence the name “target” Q values. 

The variable valid_idxs specifies those indices which don’t include terminal states – obviously for terminal states (states where the game ended), there are no future rewards to discount from, so the target value for these states is the rewards value. For other states, which do have future rewards, these need to be discounted and added to the current reward for the target Q values. If no target_network is provided, it is assumed vanilla Q learning should be used to provide the discounted target Q values. If not, double Q learning is implemented.

According to that methodology, first the a* actions are selected which are those actions with the highest Q values in the next state (t + 1). These actions are taken from the primary network, using the numpy argmax function. Next, the Q values from the target network are extracted from the next state (t + 1). Finally, the updates value is incremented for valid indices by adding the discounted future Q values from the target network, for the actions a* selected from the primary network. Finally, the network is trained using the Keras train_on_batch function.

The main Atari training loop

Now it is time to review the main training loop:

num_episodes = 1000000
eps = MAX_EPSILON
render = False
train_writer = tf.summary.create_file_writer(STORE_PATH + f"/DuelingQSI_{dt.datetime.now().strftime('%d%m%Y%H%M')}")
double_q = True
steps = 0
for i in range(num_episodes):
    state = env.reset()
    state = image_preprocess(state)
    state_stack = tf.Variable(np.repeat(state.numpy(), NUM_FRAMES).reshape((POST_PROCESS_IMAGE_SIZE[0],
                                                                            POST_PROCESS_IMAGE_SIZE[1],
                                                                            NUM_FRAMES)))
    cnt = 1
    avg_loss = 0
    tot_reward = 0
    if i % GIF_RECORDING_FREQ == 0:
        frame_list = []
    while True:
        if render:
            env.render()
        action = choose_action(state_stack, primary_network, eps, steps)
        next_state, reward, done, info = env.step(action)
        tot_reward += reward
        if i % GIF_RECORDING_FREQ == 0:
            frame_list.append(tf.cast(tf.image.resize(next_state, (480, 320)), tf.uint8).numpy())
        next_state = image_preprocess(next_state)
        state_stack = process_state_stack(state_stack, next_state)
        # store in memory
        memory.add_sample(next_state, action, reward, done)

        if steps > DELAY_TRAINING:
            loss = train(primary_network, memory, target_network if double_q else None)
            update_network(primary_network, target_network)
        else:
            loss = -1
        avg_loss += loss

        # linearly decay the eps value
        if steps > DELAY_TRAINING:
            eps = MAX_EPSILON - ((steps - DELAY_TRAINING) / EPSILON_MIN_ITER) * 
                  (MAX_EPSILON - MIN_EPSILON) if steps < EPSILON_MIN_ITER else 
                MIN_EPSILON
        steps += 1

        if done:
            if steps > DELAY_TRAINING:
                avg_loss /= cnt
                print(f"Episode: {i}, Reward: {tot_reward}, avg loss: {avg_loss:.5f}, eps: {eps:.3f}")
                with train_writer.as_default():
                    tf.summary.scalar('reward', tot_reward, step=i)
                    tf.summary.scalar('avg loss', avg_loss, step=i)
            else:
                print(f"Pre-training...Episode: {i}")
            if i % GIF_RECORDING_FREQ == 0:
                record_gif(frame_list, i)
            break

        cnt += 1

This training loop is very similar to the training loop in my Dueling Q tutorial, so for a detailed review, please see that post. The main differences relate to how the frame stacking is handled. First, you’ll notice at the start of the loop that the environment is reset, and the first state / image is extracted. This state or image is pre-processed and then repeated NUM_FRAMES times and reshaped to create the first state or frame stack, of size (105, 80, 3) in this example. Another point to note is that a gif recording function has been created which is called every GIF_RECORDING_FREQ episodes. This function involves simply outputting every frame to a gif so that the training progress can be monitored by observing actual gameplay. As such, there is a frame list which is filled whenever each GIF_RECORDING_FREQ episode comes around, and this frame list is passed to the gif recording function. Check out the code for this tutorial for more details. Finally, it can be observed that after every state, the state stack is processed by shuffling along each recorded frame / state in that stack. 

Space Invader Atari training results

The image below shows how the training progresses through each episode with respect to the total reward received for each episode:    

Atari Space Invaders - Dueling Q training reward

Atari Space Invaders – Dueling Q training reward

As can be observed from the plot above, the reward steadily increases over 1500 episodes of game play. Note – if you wish to replicate this training on your own, you will need GPU processing support in order to reduce the training timeframes to a reasonable level. In this case, I utilised the Google Cloud Compute Engine and a single GPU. The gifs below show the progress of the agent in gameplay between episode 50 and episode 1450:

Atari Space Invaders - gameplay episode 50

Atari Space Invaders – gameplay episode 50

 

Atari Space Invaders - gameplay episode 1450

Atari Space Invaders – gameplay episode 1450

As can be observed, after 50 epsiodes the agent still moves around randomly and is quickly killed, achieving a score of only 60 points. However, after 1450 episodes, the agent can be seen to be playing the game much more effectively, even having learnt to destroy the occasional purple “master ship” flying overhead to gain extra points. 

This post has demonstrated how to effectively train agents to operate in Atari environments such as Space Invaders. In particular it has demonstrated how to use the Dueling Q reinforcement learning algorithm to train the agent. A future post will demonstrate how to make the training even more efficient using the Prioritised Experience Replay (PER) approach. 


Eager to build deep learning systems in TensorFlow 2? Get the book here

The post Atari Space Invaders and Dueling Q RL in TensorFlow 2 appeared first on Adventures in Machine Learning.