Categories
Misc

Stuttgart Supercomputing Center Shifts into AI Gear

Stuttgart’s supercomputer center has been cruising down the autobahn of high performance computing like a well-torqued coupe, and now it’s making a pitstop for some AI fuel. Germany’s High-Performance Computing Center Stuttgart (HLRS), one of Europe’s largest supercomputing centers, has tripled the size of its staff and increased its revenues from industry collaborations 20x since Read article >

The post Stuttgart Supercomputing Center Shifts into AI Gear appeared first on The Official NVIDIA Blog.

Categories
Misc

[Question] Sampling Images from a normal distribution for VAE

Hello there,

I am currently working on a VAE using
tensorflow-probability. I would like to later train it on celeb_a,
but right now I am using mnist to test everything.

My model looks like this, inspired by
this example
“` prior =
tfd.Independent(tfd.Normal(loc=tf.zeros(encoded_size), scale=1),
reinterpreted_batch_ndims=1)

inputs = tfk.Input(shape=input_shape) x = tfkl.Lambda(lambda x:
tf.cast(x, tf.float32) – 0.5)(inputs) x = tfkl.Conv2D(base_depth,
5, strides=1, padding=’same’, activation=tf.nn.leaky_relu)(x) x =
tfkl.Conv2D(base_depth, 5, strides=2, padding=’same’,
activation=tf.nn.leaky_relu)(x) x = tfkl.Conv2D(2 * base_depth, 5,
strides=1, padding=’same’, activation=tf.nn.leaky_relu)(x) x =
tfkl.Conv2D(2 * base_depth, 5, strides=2, padding=’same’,
activation=tf.nn.leaky_relu)(x) x = tfkl.Conv2D(4 * encoded_size,
7, strides=1, padding=’valid’, activation=tf.nn.leaky_relu)(x) x =
tfkl.Flatten()(x) x =
tfkl.Dense(tfpl.IndependentNormal.params_size(encoded_size))(x) x =
tfpl.IndependentNormal(encoded_size,
activity_regularizer=tfpl.KLDivergenceRegularizer(prior))(x)

encoder = tfk.Model(inputs, x, name=’encoder’)
encoder.summary()

inputs = tfk.Input(shape=(encoded_size,)) x = tfkl.Reshape([1,
1, encoded_size])(inputs) x = tfkl.Conv2DTranspose(2 * base_depth,
7, strides=1, padding=’valid’, activation=tf.nn.leaky_relu)(x) x =
tfkl.Conv2DTranspose(2 * base_depth, 5, strides=1, padding=’same’,
activation=tf.nn.leaky_relu)(x) x = tfkl.Conv2DTranspose(2 *
base_depth, 5, strides=2, padding=’same’,
activation=tf.nn.leaky_relu)(x) x =
tfkl.Conv2DTranspose(base_depth, 5, strides=1, padding=’same’,
activation=tf.nn.leaky_relu)(x) x =
tfkl.Conv2DTranspose(base_depth, 5, strides=2, padding=’same’,
activation=tf.nn.leaky_relu)(x) x =
tfkl.Conv2DTranspose(base_depth, 5, strides=1, padding=’same’,
activation=tf.nn.leaky_relu)(x) mu = tfkl.Conv2D(filters=1,
kernel_size=5, strides=1, padding=’same’, activation=None)(x) mu =
tfkl.Flatten()(mu) sigma = tfkl.Conv2D(filters=1, kernel_size=5,
strides=1, padding=’same’, activation=None)(x) sigma =
tf.exp(sigma) sigma = tfkl.Flatten()(sigma) x = tf.concat((mu,
sigma), axis=1) x = tfkl.LeakyReLU()(x) x =
tfpl.IndependentNormal(input_shape)(x)

decoder = tfk.Model(inputs, x) decoder.summary()

negloglik = lambda x, rv_x: -rv_x.log_prob(x)

vae.compile(optimizer=tf.optimizers.Adam(learning_rate=1e-4),
loss=negloglik)

mnist_digits are normed between 0.0 and 1.0

history = vae.fit(mnist_digits, mnist_digits, epochs=100,
batch_size=300) “`

My problem here is that the loss function stops decreasing at
around ~470 and the images sampled from the returned distribution
look like random noise. When using a bernoulli distribution instead
of the normal distribution in the decoder, the loss steadily
decrease and the sampled images look like they should. I can’t use
a bernoulli distribution for rgb tho, which I have to when I want
to train the model on celeb_a. I also can’t just use a
deterministic decoder, as I want to later decompose the elbo (loss
term – KL divergence) as seen in this.

Can someone explain to me why the normal distribution just
“doesn’t work”? How can I improve it so that it actually learns a
distribution that I can sample.

submitted by /u/tadachs

[visit reddit]

[comments]

Categories
Misc

How to use a consecutive sequence of one channel images to predict next frame label with Conv1D and LSTM?

Hi,

I am quite new to temporal forecast with images and LSTM. I
really appreciate your help.

Input is a sequence of images where each image size is 28*28,
and the number of this sequence of images is set as the batch size
as None the first argument of input_shape.

suppose 4 consecutive seconds of images were fed into the NN,
and the expected output of the NN would be No. 5 second label.

But I have hard time making Conv1D and LSTM working together and
ending up with one numerical label.

model = Sequential() model.add(Conv1D(40,2, strides=2,padding='same', activation='relu', input_shape=(None,28,28))) model.add(Reshape((None,576))) # or model.add(Flatten()) model.add(LSTM(10, activation='relu',stateful=True,return_sequences=True)) 
  1. Is the Batch size set properly?
  2. how to make Conv1D and LSTM linked together? I mean the data
    dimensionality stuff. Is it necessary to get the numerical labels
    from Conv1D and then pass them to LSTM? or pass the original
    dimensional data directly from Conv1D to LSTM then from the LSTM
    result, to compute one numerical label as the final result of
    NN?
  3. also is TimeDistributed() layer needed?

Thank you so much!

submitted by /u/boydbuilding

[visit reddit]

[comments]

Categories
Misc

Warhammer 40,000 The New Edition – Trailer (Remastered 8K 60FPS) Resolution increased using neural networks to 8K 60FPS


Warhammer 40,000 The New Edition - Trailer (Remastered 8K 60FPS) Resolution increased using neural networks to 8K 60FPS
submitted by /u/stepanmetior

[visit reddit]

[comments]
Categories
Misc

Free browser extension for ML community that thousands of machine learning engineers/data scientists use everyday! Drop a comment for any questions/feature requests you may have!


Free browser extension for ML community that thousands of machine learning engineers/data scientists use everyday! Drop a comment for any questions/feature requests you may have!
submitted by /u/MLtinkerer

[visit reddit]

[comments]
Categories
Misc

A TensorFlow tip to Optimize your Training


A TensorFlow tip to Optimize your Training

Originally posted here.

💡 #TensorFlowTip

Use .prefetch to reduce your step time of training and
extracting data

  • overlap the preprocessing and model execution
  • while the model executes training step n the input pipeline is
    reading the data for n+1 step
  • reduces the idle time for the GPU and CPU

See the speedup with .prefetch in this image. Try it for
yourself
here
.


Speedup with .prefetch

submitted by /u/Rishit-dagli

[visit reddit]

[comments]

Categories
Offsites

Metrics and summaries in TensorFlow 2

In this relatively short post, I’m going to show you how to deal with metrics and summaries in TensorFlow 2. Metrics, which can be used to monitor various important variables during the training of deep learning networks (such as accuracy or various losses), were somewhat unwieldy in TensorFlow 1.X. Thankfully in the new TensorFlow 2.0 they are much easier to use. Summary logging, for visualization of training in the TensorBoard interface, has also undergone some changes in TensorFlow 2 that I will be demonstrating. Please note – at time of writing, only the alpha version of TensorFlow 2 is available, but it is probably safe to assume that the syntax and forms demonstrated in this tutorial will remain the same in TensorFlow 2.0. To install the alpha version, use the following command:
pip install tensorflow==2.0.0-alpha0
In this tutorial, I’ll be using a generic MNIST Convolutional Neural Network example, but utilizing full TensorFlow 2 design paradigms. To learn more about CNNs, see this tutorial – to understand more about TensorFlow 2 paradigms, see this tutorial. All the code for this tutorial can be found as a Google Colaboratory file on my Github repository.

Eager to build deep learning systems? Get the book here

 

TensorFlow 2 metrics

Metrics in TensorFlow 2 can be found in the TensorFlow Keras distribution – tf.keras.metrics. Metrics, along with the rest of TensorFlow 2, are now computed in an Eager fashion. In TensorFlow 1.X, metrics were gathered and computed using the imperative declaration, tf.Session style. All that is required now is to declare the metrics as a Python variable, use the method update_state() to add a state to the metric, result() to summarize the metric, and finally reset_states() to reset all the states of the metric.  The code below shows a simple implementation of a Mean metric:
mean_metric = tf.keras.metrics.Mean()
mean_metric.update_state(2.0)
mean_metric.update_state(3.0)
mean_metric.update_state(4.0)
print(mean_metric.result().numpy())
This will print the average result -> 3.0. As can be observed, there is an internal memory for the metric, which can be appended to using update_state(). The Mean metric operation is executed when result() is called. Finally, to reset the memory of the metric, we can use reset_states() as follows:
mean_metric.reset_states()
print(mean_metric.result().numpy())
This will print the default response of an empty metric – 0.0.

TensorFlow 2 summaries

Metrics fit hand-in-glove with summaries in TensorFlow 2. In order to log summaries in TensorFlow 2, the developer uses the with Python context manager. First, one creates a summary_writer object like so:
summary_writer = tf.summary.create_file_writer('/log')
To log something to the summary writer, the developer must first enclose the “space” within your code which does the logging with a Python with statement. The logging looks like so:
with summary_writer.as_default():
  tf.summary.scalar('mean', mean_metric.result(), step=1)
The with context can surround the full training loop, or just the area of the code where you are storing the summaries. As can be observed, the logged scalar value is set by using the metric result() method. The step value needs to be provided to the summary – this allows TensorBoard to plot the variation of various values, images etc. between training steps. The step number can be tracked manually, but the easiest way is to use the iterations property of whatever optimizer you are using. This will be demonstrated in the example below.

TensorFlow 2 metrics and summaries – CNN example

In this example, I’ll show how to use metrics and summaries in the context of a CNN MNIST classification example. In this example, I’ll use a custom training loop, rather than a Keras fit loop. In the next section, I’ll show you how to implement custom metrics even within the Keras fit functionality. As usual for any machine learning task, the first step is to prepare the training and validation data. In this case, we’ll be using the prepackaged Keras MNIST dataset, then converting the numpy data arrays into a TensorFlow dataset (for more on TensorFlow datasets, see here and here). This looks like the following:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
BATCH_SIZE=64
# first the training set
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(BATCH_SIZE).shuffle(10000)
train_dataset = train_dataset.map(lambda x, y: (tf.cast(x, tf.float32) / 255.0, y))
train_dataset = train_dataset.map(lambda x, y: (tf.expand_dims(x, -1) / 255.0, y))
train_dataset = train_dataset.repeat()
# now the validation set
valid_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(5000).shuffle(10000)
valid_dataset = valid_dataset.map(lambda x, y: (tf.cast(x, tf.float32) / 255.0, y))
valid_dataset = valid_dataset.map(lambda x, y: (tf.expand_dims(x, -1) / 255.0, y))
valid_dataset = valid_dataset.repeat()
In the lines above, some preprocessing is applied to the image data to normalize it (divide the pixel values by 255, make the tensors 4D for consumption into CNN layers). Next I define the CNN model, using the Keras sequential paradigm:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Conv2D(32, 2, 1, activation='relu', input_shape=(28, 28, 1)))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Conv2D(32, 2, 1, activation='relu'))
model.add(tf.keras.layers.MaxPool2D(2))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(10))
The model declaration above is all standard Keras – for more on the sequential model type of Keras, see here. Next, we create a custom training loop function in TensorFlow. It is now best practice to encapsulate core parts of your code in Python functions – this is so that the @tf.function decorator can be applied easily to the function. This signals to TensorFlow to perform Just In Time (JIT) compilation of the relevant code into a graph, which allows the performance benefits of a static graph as per TensorFlow 1.X. Otherwise, the code will execute eagerly, which is not a big deal, but if one is building production or performance dependent code it is better to decorate with @tf.function. Here’s the training loop and optimization/loss function definitions:
optimizer = tf.keras.optimizers.Adam()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
def train(ds_train, optimizer, loss_fn, model, num_batches, log_freq=10):
  avg_loss = tf.keras.metrics.Mean()
  avg_acc = tf.keras.metrics.SparseCategoricalAccuracy()
  batch_idx = 0
  for batch_idx, (images, labels) in enumerate(ds_train):
    images = tf.expand_dims(images, -1)
    with tf.GradientTape() as tape:
      logits = model(images)
      loss_value = loss_fn(labels, logits)
    grads = tape.gradient(loss_value, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    avg_loss.update_state(loss_value)
    avg_acc.update_state(labels, logits)
    if batch_idx % log_freq == 0:
      print(f"Batch {batch_idx}, average loss is {avg_loss.result().numpy()}, average accuracy is {avg_acc.result().numpy()}")
      tf.summary.scalar('loss', avg_loss.result(), step=optimizer.iterations)
      tf.summary.scalar('acc', avg_acc.result(), step=optimizer.iterations)
      avg_loss.reset_states()
      avg_acc.reset_states()
    if batch_idx > num_batches:
      break
As can be observed, I have created two metrics for use in this training loop – avg_loss and avg_acc. These are Mean and SparseCategoricalAccuracy metrics, respectively. The Mean metric has been discussed previously. The SparseCategoricalAccuracy metric takes, as input, the training labels and logits (raw, unactivated outputs from your model). Because it is a sparse categorical accuracy measure, it can take the training labels in scalar integer form, rather than one-hot encoded label vectors. Calling result() on this metric will calculate the average accuracy of all the labels/logits pairs passed during the update_state() call – see line 15 above. Every log_freq number of batches, the results of the metrics are printed and also passed as summary scalars. After the metrics are logged in the summaries, their states are reset. You will notice that I have not provided a with context for these summaries – this is applied in the outer epoch loop is shown below:
num_epochs = 10
summary_writer = tf.summary.create_file_writer('./log/{}'.format(dt.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")))
for i in range(num_epochs):
  print(f"Epoch {i + 1} of {num_epochs}")
  with summary_writer.as_default():
    train(train_dataset, optimizer, loss_fn, model, 10000//BATCH_SIZE)
As can be observed, the summary_writer.as_default() is supplied as context to the whole train function. So far so good. However, this is utilizing a “manual” TensorFlow training loop, which is no longer the easiest way to train in TensorFlow 2, given the tight Keras integration. In the next example, I’ll show you how to include run of the mill metrics in the Keras API, but also custom metrics.

TensorFlow 2 Keras metrics and summaries

To include normal metrics such as the accuracy in Keras is straight-forward – one supplies a list of metrics to be logged in the compile statement like so:
metric_model.compile(optimizer=tf.optimizers.Adam(),
                     loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                     metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
However, if one wishes to log more complicated or custom metrics, it becomes difficult to see how to set this up in Keras. One easy way of doing so is by creating a custom Keras layer whose sole purpose is to add a metric to the model / training. In the example below, I have created a custom layer which adds the standard deviation of the kernel weights as a metric:
class MetricLayer(tf.keras.layers.Layer):
  def __init__(self, layer_to_log):
    super(MetricLayer, self).__init__()
    self.layer_to_log = layer_to_log
    
  def call(self, input):
    self.add_metric(tf.keras.backend.std(self.layer_to_log.variables[0]),
                    name=f'std_of_{self.layer_to_log.name}_kernel',
                    aggregation='mean')
    return input
A few things to notice about the creation of the custom layer above. First, notice that the layer is defined as a Python class object which inherits from the keras.layers.Layer object. The only variable passed to the initialization of this custom class is the layer with the kernel weights which we wish to log. The call method tells Keras / TensorFlow what to do when the layer is called in a feed forward pass. In this case, the input is passed straight through to the output – it is, in essence, a dummy layer. However, you’ll notice within the call a metric is added. The value of the metric is the standard deviation of layer_to_log.variables[0]. For a CNN layer, the zero index [0] of the layer variables is the kernel weights. A name is provided to the metric for ease of viewing during training, and finally the aggregation method of the metric is specified – in this case, a ‘mean’ aggregation of the standard deviations. To include this layer, one can just add it as a sequential element in the Keras model. In the below I take the existing CNN model created in the previous example, and create a new model with the custom metric layer appended to the end:
metric_model = tf.keras.Sequential()
metric_model.add(model)
metric_model.add(MetricLayer(model.layers[0]))
As can be observed in the above, the first layer of the previous model is passed to the custom MetricLayer. Running the fit training method on this model will now generate both the SparseCategoricalAccuracy metric, along with the custom standard deviation from the first layer. To monitor in TensorBoard, one must also include the TensorBoard callback. All of this looks like the following:
metric_model.compile(optimizer=tf.optimizers.Adam(),
                     loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                     metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

callbacks = [
  # Write TensorBoard logs to `./logs` directory
  tf.keras.callbacks.TensorBoard(log_dir='./log/{}'.format(dt.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")), update_freq='batch')
]

metric_model.fit(train_dataset, steps_per_epoch=10000//BATCH_SIZE, epochs=5,
                 validation_data=valid_dataset, validation_steps=5,
                 callbacks=callbacks)
The code above will perform the training and ensure all the metrics (including the metric added in the custom metric layer) are output to TensorBoard via the TensorBoard callback. This concludes my quick introduction to metrics and summaries in TensorFlow 2. Watch out for future posts and updates of existing posts as the transition to TensorFlow 2 develops.  

Eager to build deep learning systems? Get the book here

 

The post Metrics and summaries in TensorFlow 2 appeared first on Adventures in Machine Learning.

Categories
Offsites

An introduction to Global Average Pooling in convolutional neural networks

For those familiar with convolutional neural networks (if you’re not, check out this post), you will know that, for many architectures, the final set of layers are often of the fully connected variety. This is like bolting a standard neural network classifier onto the end of an image processor. The convolutional neural network starts with a series of convolutional (and, potentially, pooling) layers which create feature maps which represent different components of the input images. The fully connected layers at the end then “interpret” the output of these features maps and make category predictions. However, as with many things in the fast moving world of deep learning research, this practice is starting to fall by the wayside in favor of something called Global Average Pooling (GAP). In this post, I’ll introduce the benefits of Global Average Pooling and apply it on the Cats vs Dogs image classification task using TensorFlow 2. In the process, I’ll compare its performance to the standard fully connected layer paradigm. The code for this tutorial can be found in a Jupyter Notebook on this site’s Github repository, ready for use in Google Colaboratory.


Eager to build deep learning systems? Get the book here


Global Average Pooling

Global Average Pooling is an operation that calculates the average output of each feature map in the previous layer. This fairly simple operation reduces the data significantly and prepares the model for the final classification layer. It also has no trainable parameters – just like Max Pooling (see here for more details). The diagram below shows how it is commonly used in a convolutional neural network:

Global Average Pooling in a CNN architecture

Global Average Pooling in a CNN architecture

As can be observed, the final layers consist simply of a Global Average Pooling layer and a final softmax output layer. As can be observed, in the architecture above, there are 64 averaging calculations corresponding to the 64, 7 x 7 channels at the output of the second convolutional layer. The GAP layer transforms the dimensions from (7, 7, 64) to (1, 1, 64) by performing the averaging across the 7 x 7 channel values. Global Average Pooling has the following advantages over the fully connected final layers paradigm:

  • The removal of a large number of trainable parameters from the model. Fully connected or dense layers have lots of parameters. A 7 x 7 x 64 CNN output being flattened and fed into a 500 node dense layer yields 1.56 million weights which need to be trained. Removing these layers speeds up the training of your model.
  • The elimination of all these trainable parameters also reduces the tendency of over-fitting, which needs to be managed in fully connected layers by the use of dropout.
  • The authors argue in the original paper that removing the fully connected classification layers forces the feature maps to be more closely related to the classification categories – so that each feature map becomes a kind of “category confidence map”.
  • Finally, the authors also argue that, due to the averaging operation over the feature maps, this makes the model more robust to spatial translations in the data. In other words, as long as the requisite feature is included / or activated in the feature map somewhere, it will still be “picked up” by the averaging operation.

To test out these ideas in practice, in the next section I’ll show you an example comparing the benefits of the Global Average Pooling with the historical paradigm. This example problem will be the Cats vs Dogs image classification task and I’ll be using TensorFlow 2 to build the models. At the time of writing, only TensorFlow 2 Alpha is available, and the reader can follow this link to find out how to install it.

Global Average Pooling with TensorFlow 2 and Cats vs Dogs

To download the Cats vs Dogs data for this example, you can use the following code:

import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_datasets as tfds

split = (80, 10, 10)
splits = tfds.Split.TRAIN.subsplit(weighted=split)

(cat_train, cat_valid, cat_test), info = tfds.load('cats_vs_dogs', split=list(splits), with_info=True, as_supervised=True)

The code above utilizes the TensorFlow Datasets repository which allows you to import common machine learning datasets into TF Dataset objects.  For more on using Dataset objects in TensorFlow 2, check out this post. A few things to note. First, the split tuple (80, 10, 10) signifies the (training, validation, test) split as percentages of the dataset. This is then passed to the tensorflow_datasets split object which tells the dataset loader how to break up the data. Finally, the tfds.load() function is invoked. The first argument is a string specifying the dataset name to load. Following arguments relate to whether a split should be used, whether to return an argument with information about the dataset (info) and whether the dataset is intended to be used in a supervised learning problem, with labels being included. In order to examine the images in the data set, the following code can be run:

import matplotlib.pylab as plt

for image, label in cat_train.take(2):
  plt.figure()
  plt.imshow(image)

This produces the following images: As can be observed, the images are of varying sizes. This will need to be rectified so that the images have a consistent size to feed into our model. As usual, the image pixel values (which range from 0 to 255) need to be normalized – in this case, to between 0 and 1. The function below performs these tasks:

IMAGE_SIZE = 100
def pre_process_image(image, label):
  image = tf.cast(image, tf.float32)
  image = image / 255.0
  image = tf.image.resize(image, (IMAGE_SIZE, IMAGE_SIZE))
  return image, label

In this example, we’ll be resizing the images to 100 x 100 using tf.image.resize. To get state of the art levels of accuracy, you would probably want a larger image size, say 200 x 200, but in this case I’ve chosen speed over accuracy for demonstration purposes. As can be observed, the image values are also cast into the tf.float32 datatype and normalized by dividing by 255. Next we apply this function to the datasets, and also shuffle and batch where appropriate:

TRAIN_BATCH_SIZE = 64
cat_train = cat_train.map(pre_process_image).shuffle(1000).repeat().batch(TRAIN_BATCH_SIZE)
cat_valid = cat_valid.map(pre_process_image).repeat().batch(1000)

For more on TensorFlow datasets, see this post. Now it is time to build the model – in this example, we’ll be using the Keras API in TensorFlow 2. In this example, I’ll be using a common “head” model, which consists of layers of standard convolutional operations – convolution and max pooling, with batch normalization and ReLU activations:

head = tf.keras.Sequential()
head.add(layers.Conv2D(32, (3, 3), input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

head.add(layers.Conv2D(32, (3, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

head.add(layers.Conv2D(64, (3, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

Next, we need to add the “back-end” of the network to perform the classification.

Standard fully connected classifier results

In the first instance, I’ll show the results of a standard fully connected classifier, without dropout. Because, for this example, there are only two possible classes – “cat” or “dog” – the final output layer is a dense / fully connected layer with a single node and a sigmoid activation.

standard_classifier = tf.keras.Sequential()
standard_classifier.add(layers.Flatten())
standard_classifier.add(layers.BatchNormalization())
standard_classifier.add(layers.Dense(100))
standard_classifier.add(layers.Activation('relu'))
standard_classifier.add(layers.BatchNormalization())
standard_classifier.add(layers.Dense(100))
standard_classifier.add(layers.Activation('relu'))
standard_classifier.add(layers.Dense(1))
standard_classifier.add(layers.Activation('sigmoid'))

As can be observed, in this case, the output classification layers includes 2 x 100 node dense layers. To combine the head model and this standard classifier, the following commands can be run:

standard_model = tf.keras.Sequential([
    head, 
    standard_classifier
])

Finally, the model is compiled, a TensorBoard callback is created for visualization purposes, and the Keras fit command is executed:

standard_model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

callbacks = [tf.keras.callbacks.TensorBoard(log_dir='./log/{}'.format(dt.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")))]

standard_model.fit(cat_train, steps_per_epoch = 23262//TRAIN_BATCH_SIZE, epochs=10, validation_data=cat_valid, validation_steps=10, callbacks=callbacks)

Note that the loss used is binary crossentropy, due to the binary classes for this example. The training progress over 7 epochs can be seen in the figure below:

Standard classifier without average pooling

Standard classifier accuracy (red – training, blue – validation)

Standard classifier loss without average pooling

Standard classifier loss (red – training, blue – validation)

As can be observed, with a standard fully connected classifier back-end to the model (without dropout), the training accuracy reaches high values but it overfits with respect to the validation dataset. The validation dataset accuracy stagnates around 80% and the loss begins to increase – a sure sign of overfitting.

Global Average Pooling results

The next step is to test the results of the Global Average Pooling in TensorFlow 2. To build the GAP layer and associated model, the following code is added:

average_pool = tf.keras.Sequential()
average_pool.add(layers.AveragePooling2D())
average_pool.add(layers.Flatten())
average_pool.add(layers.Dense(1, activation='sigmoid'))

pool_model = tf.keras.Sequential([
    head, 
    average_pool
])

The accuracy results for this model, along with the results of the standard fully connected classifier model, are shown below:

Global Average Pooling accuracy

Global average pooling accuracy vs standard fully connected classifier model (pink – GAP training, green – GAP validation, blue – FC classifier validation)

As can be observed from the graph above, the Global Average Pooling model has a higher validation accuracy by the 7th epoch than the fully connected model. The training accuracy is lower than the FC model, but this is clearly due to overfitting being reduced in the GAP model. A final comparison including the case of the FC model with a dropout layer inserted is shown below:

standard_classifier_with_do = tf.keras.Sequential()
standard_classifier_with_do.add(layers.Flatten())
standard_classifier_with_do.add(layers.BatchNormalization())
standard_classifier_with_do.add(layers.Dense(100))
standard_classifier_with_do.add(layers.Activation('relu'))
standard_classifier_with_do.add(layers.Dropout(0.5))
standard_classifier_with_do.add(layers.BatchNormalization())
standard_classifier_with_do.add(layers.Dense(100))
standard_classifier_with_do.add(layers.Activation('relu'))
standard_classifier_with_do.add(layers.Dense(1))
standard_classifier_with_do.add(layers.Activation('sigmoid'))
Global Average Pooling accuracy vs FC with dropout

Global average pooling validation accuracy vs FC classifier with and without dropout (green – GAP model, blue – FC model without DO, orange – FC model with DO)

As can be seen, of the three model options sharing the same convolutional front end, the GAP model has the best validation accuracy after 7 epochs of training (x – axis in the graph above is the number of batches). Dropout improves the validation accuracy of the FC model, but the GAP model is still narrowly out in front. Further tuning could be performed on the fully connected models and results may improve. However, one would expect Global Average Pooling to be at least equivalent to a FC model with dropout – even though it has hundreds of thousands of fewer parameters. I hope this short tutorial gives you a good understanding of Global Average Pooling and its benefits. You may want to consider it in the architecture of your next image classifier design.


Eager to build deep learning systems? Get the book here

The post An introduction to Global Average Pooling in convolutional neural networks appeared first on Adventures in Machine Learning.

Categories
Offsites

Transfer learning in TensorFlow 2 tutorial

In this post, I’m going to cover the very important deep learning concept called transfer learning. Transfer learning is the process whereby one uses neural network models trained in a related domain to accelerate the development of accurate models in your more specific domain of interest. For instance, a deep learning practitioner can use one of the state-of-the-art image classification models, already trained, as a starting point for their own, more specialized, image classification task. In this tutorial, I’ll be showing you how to perform transfer learning using an advanced, pre-trained image classification model – ResNet50 – to improve a more specific image classification task – the cats vs dogs classification problem. In particular, I’ll be showing you how to do this using TensorFlow 2. The code for this tutorial, in a Google Colaboratory notebook format, can be found on this site’s Github repository here. This code borrows some components from the official TensorFlow tutorial.


Eager to build deep learning systems? Get the book here


What are the benefits of transfer learning?

Transfer learning has many benefits, these are:

  1. It speeds up learning: For state of the art results in deep learning, one often needs to build very deep networks with many layers. In order to train such networks, one needs lots of data, computational power and time. These three things are often not readily available.
  2. It needs less data: As will be shown, transfer learning usually only adds a few extra layers to the pre-trained model, and the weights in the pre-trained model are generally fixed. Therefore, during the fine tuning of the model, only those few extra layers, or a small subset of the total number of layers, is subjected to training. This requires much less data to get good results.
  3. You can leverage the expert tuning of state-of-the-art models: As anyone who has been involved in building deep learning systems can tell you, it requires a lot of patience and tuning of the models to get the best results. By utilizing pre-trained, state-of-the-art models, you can skip a lot of this arduous work and rely on the efforts of experts in the field.

For these reasons, if you are performing some image recognition task, it may be worth using some of the pre-trained, state-of-the-art image classification models, like ResNet, DenseNet, InceptionNet and so on. How does one use these pre-trained models?

How to create a transfer learning model

To create a transfer learning model, all that is required is to take the pre-trained layers and “bolt on” your own network. This could be either at the beginning or end of the pre-trained model. Usually, one disables the pre-trained layer weights and only trains the “bolted on” layers which have been added. For image classification transfer learning, one usually takes the convolutional neural network (CNN) layers from the pre-trained model and adds one or more densely connected “classification” layers at the end (for more on convolutional neural networks, see this tutorial).  The pre-trained CNN layers act as feature extractors / maps, and the classification layer/s at the end can be “taught” to “interpret” these image features. The transfer learning model architecture that will be used in this example is shown below:  

 

Transfer learning TensorFlow 2 architecture

ResNet50 transfer learning architecture

The full ResNet50 model shown in the image above, in addition to a Global Average Pooling (GAP) layer, contains a 1000 node dense / fully connected layer which acts as a “classifier” of the 2048 (4 x 4) feature maps output from the ResNet CNN layers. For more on Global Average Pooling, see my tutorial. In this transfer learning task, we’ll be removing these last two layers (GAP and Dense layer) and replacing these with our own GAP and dense layer (in this example, we have a binary classification task – hence the output size is only 1). The GAP layer has no trainable parameters, but the dense layer obviously does – these will be the only parameters trained in this example. All of this is performed quite easily in TensorFlow 2, as will be shown in the next section.

Transfer learning in TensorFlow 2

In this example, we’ll be using the pre-trained ResNet50 model and transfer learning to perform the cats vs dogs image classification task. I’ll also train a smaller CNN from scratch to show the benefits of transfer learning. To access the image dataset, we’ll be using the tensorflow_datasets package which contains a number of common machine learning datasets. To load the data, the following commands can be run:

import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_datasets as tfds

split = (80, 10, 10)
splits = tfds.Split.TRAIN.subsplit(weighted=split)

(cat_train, cat_valid, cat_test), info = tfds.load('cats_vs_dogs', split=list(splits), with_info=True, as_supervised=True)

A few things to note about the code snippet above. First, the split tuple (80, 10, 10) signifies the (training, validation, test) split as percentages of the dataset. This is then passed to the tensorflow_datasets split object which tells the dataset loader how to break up the data. Finally, the tfds.load() function is invoked. The first argument is a string specifying the dataset name to load. Following arguments relate to whether a split should be used, whether to return an argument with information about the dataset (info) and whether the dataset is intended to be used in a supervised learning problem, with labels being included. The variables cat_train, cat_valid and cat_test are TensorFlow Dataset objects – to learn more about these, check out my previous post. In order to examine the images in the data set, the following code can be run:

import matplotlib.pylab as plt

for image, label in cat_train.take(2):
  plt.figure()
  plt.imshow(image)

This produces the following images: As can be observed, the images are of varying sizes. This will need to be rectified so that the images have a consistent size to feed into our model. As usual, the image pixel values (which range from 0 to 255) need to be normalized – in this case, to between 0 and 1. The function below performs these tasks:

IMAGE_SIZE = 100
def pre_process_image(image, label):
  image = tf.cast(image, tf.float32)
  image = image / 255.0
  image = tf.image.resize(image, (IMAGE_SIZE, IMAGE_SIZE))
  return image, label

In this example, we’ll be resizing the images to 100 x 100 using tf.image.resize. To get state of the art levels of accuracy, you would probably want a larger image size, say 200 x 200, but in this case I’ve chosen speed over accuracy for demonstration purposes. As can be observed, the image values are also cast into the tf.float32 datatype and normalized by dividing by 255. Next we apply this function to the datasets, and also shuffle and batch where appropriate:

TRAIN_BATCH_SIZE = 64
cat_train = cat_train.map(pre_process_image).shuffle(1000).repeat().batch(TRAIN_BATCH_SIZE)
cat_valid = cat_valid.map(pre_process_image).repeat().batch(1000)

First, we’ll build a smaller CNN image classifier which will be trained from scratch.

A smaller CNN model

In the code below, a 3 x CNN layer head, a GAP layer and a final densely connected output layer is created. The Keras API, which is the encouraged approach for TensorFlow 2, is used in the model definition below. For more on Keras, see this and this tutorial.

head = tf.keras.Sequential()
head.add(layers.Conv2D(32, (3, 3), input_shape=(IMAGE_SIZE, IMAGE_SIZE, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

head.add(layers.Conv2D(32, (3, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

head.add(layers.Conv2D(64, (3, 3)))
head.add(layers.BatchNormalization())
head.add(layers.Activation('relu'))
head.add(layers.MaxPooling2D(pool_size=(2, 2)))

average_pool = tf.keras.Sequential()
average_pool.add(layers.AveragePooling2D())
average_pool.add(layers.Flatten())
average_pool.add(layers.Dense(1, activation='sigmoid'))

standard_model = tf.keras.Sequential([
    head, 
    average_pool
])

To train the model we run:

standard_model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

callbacks = [tf.keras.callbacks.TensorBoard(log_dir='./log/standard_model', update_freq='batch')]

standard_model.fit(cat_train, steps_per_epoch = 23262//TRAIN_BATCH_SIZE, epochs=7, 
               validation_data=cat_valid, validation_steps=10, callbacks=callbacks)

Note that the loss function is ‘binary cross-entropy’, due to the fact that the cats vs dogs image classification task is a binary classification problem (i.e. 0 = cat, 1 = dog or vice-versa). Running the code above, after 7 epochs, gives a training accuracy of around 89% and a validation accuracy of around 85%. Next we’ll see how this compares to the transfer learning case.

ResNet50 transfer learning example

To download the ResNet50 model, you can utilize the tf.keras.applications object to download the ResNet50 model in Keras format with trained parameters. To do so, run the following code:

IMG_SHAPE = (IMAGE_SIZE, IMAGE_SIZE, 3)
res_net = tf.keras.applications.ResNet50(weights='imagenet', include_top=False, input_shape=IMG_SHAPE)

The weights argument ‘imagenet’ denotes that the weights to be used are those generated by being trained on the ImageNet dataset. The include_top argument states that we only want the CNN-feature maps part of the ResNet50 model – not its final GAP and dense connected layers. Finally, we need to specify what input shape we want the model being setup to receive. Next, we need to disable the training of the parameters within this Keras model. This is performed really easily:

res_net.trainable = False

Next we create a Global Average Pooling layer, along with a final densely connected output layer with sigmoid activation. Then the model is combined using the Keras sequential framework where Keras models can be chained together:

global_average_layer = layers.GlobalAveragePooling2D()
output_layer = layers.Dense(1, activation='sigmoid')
tl_model = tf.keras.Sequential([
  res_net,
  global_average_layer,
  output_layer
])

That’s all that’s required – TensorFlow 2 and Keras make many deep learning tasks quite easy. Running tl_model.summary() gives the following output:

Layer (type)                 Output Shape              Param #   
=================================================================
resnet50 (Model)             (None, 4, 4, 2048)        23587712  
_________________________________________________________________
global_average_pooling2d (Gl (None, 2048)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 2049      
=================================================================
Total params: 23,589,761
Trainable params: 2,049
Non-trainable params: 23,587,712
_________________________________________________________________

As can be observed, while the total number of parameters is large (i.e. 23 million) the number of trainable parameters, corresponding to the weights of the final output layer, is only 2,049. 

To train the model we run:

tl_model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

callbacks = [tf.keras.callbacks.TensorBoard(log_dir='./log/transer_learning_model', update_freq='batch')]

tl_model.fit(cat_train, steps_per_epoch = 23262//TRAIN_BATCH_SIZE, epochs=7, 
             validation_data=cat_valid, validation_steps=10, callbacks=callbacks)

Comparing the models

The graphs below from TensorBoard show the relative performance of the small CNN model trained from scratch and the ResNet50 transfer learning model:

Transfer learning TensorFlow 2 training accuracy comparison

Transfer learning training accuracy comparison (blue – ResNet50, pink – smaller CNN model)

Transfer learning TensorFlow 2 validation accuracy comparison

Transfer learning validation accuracy comparision (red – ResNet50 model, green – smaller CNN model)

The results above show that the ResNet50 model reaches higher levels of both training and validation accuracy much quicker than the smaller CNN model that was trained from scratch. This illustrates the benefit of using these powerful pre-trained models as a starting point for your more domain specific deep learning tasks. I hope this post has been a help and given you a good understanding of the benefits of transfer learning, and also how to implement it easily in TensorFlow 2.  


Eager to build deep learning systems? Get the book here

The post Transfer learning in TensorFlow 2 tutorial appeared first on Adventures in Machine Learning.

Categories
Offsites

Double Q reinforcement learning in TensorFlow 2

In previous posts (here and here), deep Q reinforcement learning was introduced. In these posts, examples were presented where neural networks were used to train an agent to act within an environment to maximize rewards. The neural network was trained using something called Q-learning. However, deep Q learning (DQN) has a flaw – it can be unstable due to biased estimates of future rewards, and this slows learning. In this post, I’ll introduce Double Q learning which can solve this bias problem and produce better Q-learning outcomes. We’ll be running a Double Q network on a modified version of the Cartpole reinforcement learning environment. We’ll also be developing the network in TensorFlow 2 – at the time of writing, TensorFlow 2 is in beta and installation instructions can be found here. The code examined in this post can be found here.


Eager to build deep learning systems in TensorFlow 2? Get the book here


A recap of deep Q learning

As mentioned above, you can go here and here to review deep Q learning. However, a quick recap is in order. The goal of the neural network in deep Q learning is to learn the function $Q(s_t, a_t; theta_t)$. At a given time in the game / episode, the agent will be in a state $s_t$. This state is fed into the network, and various Q values will be returned for each of the possible actions $a_t$ from state $s_t$. The $theta_t$ refers to the parameters of the neural network (i.e. all the weight and bias values).

The agent chooses an action based on an epsilon-greedy policy $pi$. This policy is a combination of randomly selected actions combined with the output of the deep Q neural network – with the probability of a randomly selected action decreasing over the training time. When the deep Q network is used to select an action, it does so by taking the maximum Q value returned over all the actions, for state $s_t$. For example, if an agent is in state 1, and this state has 4 possible actions which the agent can perform, it will output 4 Q values. The action which has the highest Q value is the action which will be selected. This can be expressed as:

$$a = argmax Q(s_t, a; theta_t)$$

Where the argmax is performed over all the actions / output nodes of the neural network. That’s how actions are chosen in deep Q learning. How does training occur? It occurs by utilising the Q-learning / Bellman equation. The equation looks like this:

$$Q_{target} = r_{t+1} + gamma max_{{a}}Q(s_{t+1}, a;theta_t)$$

How does this read? For a given action a from state $s_{t}$, we want to train the network to predict the following:

  • The immediate reward for taking this action $r_{t+1}$, plus
  • The discounted reward for the best possible action in the subsequent state ($s_{t+1}$)

If we are successful in training the network to predict these values, the agent will consistently chose the action which gives the best immediate reward ($r_{t+1}$) plus the discounted future rewards of future states $gamma max_{{a}}Q(s_{t+1}, a;theta_t)$. The $gamma$ term is the discount term, which places less value on future reward than present rewards (but usually only marginally).

In deep Q learning, the game is repeatedly played and the states, actions and rewards are stored in memory as a list of tuples or an array – ($s_t$, $a$, $r_t$, $s_{t+1}$). Then, for each training step, a random batch of these tuples is extracted from memory and the $Q_{target}(s_t, a_t)$ is calculated and compared to the value produced from the current network $Q(s_t, a_t)$ – the mean squared difference between these two values is used as the loss function to train the neural network.

That’s a fairly brief recap of deep Q learning – for a more extended treatment see here and here. The next section will explain the problems with standard deep Q learning.

The problem with deep Q learning

The problem of deep Q learning has to do with the way it sets the target values:

$$Q_{target} = r_{t+1} + gamma max_{{a}}Q(s_{t+1}, a;theta_t)$$

Namely, the issue is with the $max$ value. This part of the equation is supposed to estimate the value of the rewards for future actions if action is taken from the current state $s_t$. That’s a bit of a mouthful, but just consider it as trying to estimate the optimal future rewards $r_future$ if action a is taken.

The problem is that in many environments, there is random noise. Therefore, as an agent explores an environment, it is not directly observing or $r_future$, but something like $r + epsilon$, where $epsilon$ is the noise. In such an environment, after repeated playing of the game, we would hope that the network would learn to make unbiased estimates of the expected value of the rewards – so E[r]. If it can do this, we are in a good spot – the network should pick out the best actions for current and future rewards, despite the presence of noise.

This is where the $max$ operation is a problem – it produces biased estimates of the future rewards, not the unbiased estimates we require for optimal results. An example will help explain this better. Consider the environment below. The agent starts in state A and at each state can move left or right. The states C, D and F are terminal states – the game ends once these points are reached. The r values are the rewards the agent receives when transitioning from state to state.  

Deep Q network bias illustration - Double Q network tutorial

Deep Q network bias illustration

All the rewards are deterministic except for the rewards when transitioning from states B to C and B to D. The rewards for these transitions are randomly drawn from a normal distribution with a mean of 1 and a standard deviation of 4.

We know the expected rewards, E[r] from taking either action (B to C or B to D) is 1 – however, there is a lot of noise associated with these rewards. Regardless, on average, the agent should ideally learn to always move to the left from A, towards E and finally F where always equals 2.

Let’s consider the $Q_{target}$ expression for these cases. Let’s set $gamma$ to be 0.95. The $Q_target$ expression to move to the left from is: $Q_{target} = 0 + 0.95 * max([0, 2]) = 1.9$. The two action options from E are to either move right (r = 0) or left (r = 2). The maximum of these is obviously 2, and hence we get the result 1.9.

What about in the opposite direction, moving right from A? In this case, it is $Q_{target} = 0 + 0.95 * max([N(1, 4), N(1, 4)])$. We can explore the long term value of this “moving right” action by using the following code snippet:

import numpy as np

Ra = np.zeros((10000,))
Rc = np.random.normal(1, 4, 10000)
Rd = np.random.normal(1, 4, 10000)

comb = np.vstack((Ra, Rc, Rd)).transpose()

max_result = np.max(comb, axis=1)

print(np.mean(Rc))
print(np.mean(Rd))
print(np.mean(max_result))

Here a 10,000 iteration trial is created of what the $max$ term will yield in the long term of running a deep Q agent in this environment. Ra is the reward for moving back to the left towards A (always zero, hence np.zeros()). Rc and Rd are both normal distributions, with mean 1 and standard deviation of 4. Combining all these options together and taking the maximum for each trial gives us what the trial-by-trial $max$ term will be (max_result). Finally, the expected values (i.e. the means) of each quantity are printed. As expected, the mean of Rc and Rd are approximately equal to 1 – the mean which we set for their distributions. However, the expected value / mean from the $max$ term is actually around 3!

You can see the problem here. Because the $max$ term is always taking the maximum value from the random draws of the rewards, it tends to be positively biased and does not give a true indication of the expected values of the rewards for a move in this direction (i.e. 1). As such, an agent using the deep Q learning methodology will not chose the optimal action from (i.e. move left) but will rather tend to move right!

Therefore, in noisy environments, it can be seen that deep Q learning will tend to overestimate rewards. Eventually, deep Q learning will converge to a reasonable solution, but it is potentially much slower than it needs to be. A further problem occurs in deep Q learning which can cause instability in the training process. Consider that in deep Q learning the same network both choses the best action and determines the value of choosing said actions. There is a feedback loop here which can exacerbate the previously mentioned reward overestimation problem, and further slow down the learning process. This is clearly not ideal, and this is why Double Q learning was developed.

An introduction to Double Q reinforcement learning

The paper that introduced Double Q learning initially proposed the creation of two separate networks which predicted $Q^A$ and $Q^B$ respectively. These networks were trained on the same environment / problem, but were each randomly updated. So, say, 50% of the time, $Q^A$ was updated based on a certain random set of training tuples, and 50% of the time $Q^B$ was updated on a different random set of training tuples. Importantly, the update or target equation for network A had an estimate of the future rewards from network B – not itself. This new approach does two things:

  1. The A and B networks are trained on different training samples – this acts to remove the overestimation bias, as, on average, if network A sees a high noisy reward for a certain action, it is likely that network B will see a lower reward – hence the noise effects will cancel
  2. There is a decoupling between the choice of the best action and the evaluation of the best action

The algorithm from the original paper is as follows:

Original Double Q algorithm

Original Double Q algorithm

As can be observed, first an action is chosen from either $Q^A(s_t,.)$ or $Q^B(s_t,.)$ and the rewards, next state, action etc. are stored in the memory. Then either UPDATE(A) or UPDATE(B) is chosen randomly. Next, for the state $s_{t+1}$ (or s’ in the above) the predicted Q value for all actions from this state are taken from network A or B, and the action with the highest predicted Q value is chosen, a*. Note that, within UPDATE(A), this action is chosen from the output of the $Q^A$ network.

Next, you’ll see something interesting. Consider the update equation for $Q^A$ above – I’ll represent it in more familiar, neural network based notation below:

$$Q^A_{target} = r_{t+1} + gamma Q^B(s_{t+1}, a*)$$

Notice that, while the best action a* from the next state ($s_{t+1}$) is chosen from network A, the discounted reward for taking that future action is extracted from network B. This removes any bias associated with the $argmax$ from network A, and also decouples the choice of actions from the evaluation of the value of such actions (i.e. breaks the feedback loop). This is the heart of the Double Q reinforcement learning.

The Double DQN network

The same author of the original Double Q algorithm shown above proposed an update of the algorithm in this paper. This updated algorithm can still legitimately be called a Double Q algorithm, but the author called it Double DQN (or DDQN) to disambiguate. The main difference in this algorithm is the removal of the randomized back-propagation based updating of two networks A and B. There are still two networks involved, but instead of training both of them, only a primary network is actually trained via back-propagation. The other network, often called the target network, is periodically copied from the primary network. The update operation for the primary network in the Double DQN network looks like the following:

$$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, argmax Q(s_{t+1}, a; theta_t); theta^-_t)$$

Alternatively, keeping in line with the previous representation:

$$a* = argmax Q(s_{t+1}, a; theta_t)$$

$$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, a*); theta^-_t)$$

Notice that, as per the previous algorithm, the action a* with the highest Q value from the next state ($s_{t+1}$) is extracted from the primary network, which has weights $theta_t$. This primary network is also often called the “online” network – it is the network from which action decisions are taken. However, notice that, when determining $Q_{target}$, the discounted Q value is taken from the target network with weights $theta^-_t$. Therefore, the actions for the agent to take are extracted from the online network, but the evaluation of the future rewards are taken from the target network. So far, this is similar to the UPDATE(A) step shown in the previous Double Q algorithm.

The difference in this algorithm is that the target network weights ($theta^-_t$) are not trained via back-propagation – rather they are periodically copied from the online network. This reduces the computational overhead of training two networks by back-propagation. This copying can either be a periodic “hard copy”, where the weights are copied from the online network to the target network with no modification, or a more frequent “soft copy” can occur, where the existing target weight values and the online network values are blended. In the example which will soon follow, soft copying will be performed every training iteration, under the following rule:

$$theta^- = theta^- (1-tau) + theta tau$$

With $tau$ being a small constant (i.e. 0.05).

This DDQN algorithm achieves both decoupling between the action choice and evaluation, and it has been shown to remove the bias of deep Q learning. In the next section, I’ll present a code walkthrough of a training algorithm which contains options for both standard deep Q networks and Double DQNs.

A Double Q network example in TensorFlow 2

In this example, I’ll present code which trains a double Q network on the Cartpole reinforcement learning environment. This environment is implemented in OpenAI gym, so you’ll need to have that package installed before attempting to run or replicate. The code for this example can be found on this site’s Github repo.

First, we declare some constants and create the environment:

STORE_PATH = '/Users/andrewthomas/Adventures in ML/TensorFlowBook/TensorBoard'
MAX_EPSILON = 1
MIN_EPSILON = 0.01
LAMBDA = 0.0005
GAMMA = 0.95
BATCH_SIZE = 32
TAU = 0.08
RANDOM_REWARD_STD = 1.0

env = gym.make("CartPole-v0")
state_size = 4
num_actions = env.action_space.n

Notice the epsilon greedy policy parameters (MIN_EPSILON, MAX_EPSILON, LAMBDA) which dictate how long the exploration period of the training should last. GAMMA is the discount rate of future rewards. The final constant RANDOM_REWARD_STD will be explained later in more detail.

It can be observed that the CartPole environment has a state size of 4, and the number of actions available are extracted directly from the environment (there are only 2 of them). Next the primary (or online) network and the target network are created using the Keras Sequential API:

primary_network = keras.Sequential([
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(num_actions)
])

target_network = keras.Sequential([
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(30, activation='relu', kernel_initializer=keras.initializers.he_normal()),
    keras.layers.Dense(num_actions)
])

primary_network.compile(optimizer=keras.optimizers.Adam(), loss='mse')

The code above is fairly standard Keras model definitions, with dense layers and ReLU activations, and He normal initializations (for further information, see these posts: Keras, ReLU activations and initialization). Notice that only the primary network is compiled, as this is the only network which will be trained via the Adam optimizer.

class Memory:
    def __init__(self, max_memory):
        self._max_memory = max_memory
        self._samples = []

    def add_sample(self, sample):
        self._samples.append(sample)
        if len(self._samples) > self._max_memory:
            self._samples.pop(0)

    def sample(self, no_samples):
        if no_samples > len(self._samples):
            return random.sample(self._samples, len(self._samples))
        else:
            return random.sample(self._samples, no_samples)

    @property
    def num_samples(self):
        return len(self._samples)


memory = Memory(50000)

Next a generic Memory class object is created. This holds all the ($s_t$, a, $r_t$, $s_{t+1}$) tuples which are stored during training, and includes functionality to extract random samples for training. In this example, we’ll be using a Memory instance with a maximum sample buffer of 50,000 rows.

def choose_action(state, primary_network, eps):
    if random.random() < eps:
        return random.randint(0, num_actions - 1)
    else:
        return np.argmax(primary_network(state.reshape(1, -1)))

The function above executes the epsilon greedy action policy. As explained in previous posts on deep Q learning, the epsilon value is slowly reduced and the action selection moves from the random selection of actions to actions selected from the primary network. A final training function needs to be reviewed, but first we’ll examine the main training loop:

num_episodes = 1000
eps = MAX_EPSILON
render = False
train_writer = tf.summary.create_file_writer(STORE_PATH + f"/DoubleQ_{dt.datetime.now().strftime('%d%m%Y%H%M')}")
double_q = False
steps = 0
for i in range(num_episodes):
    state = env.reset()
    cnt = 0
    avg_loss = 0
    while True:
        if render:
            env.render()
        action = choose_action(state, primary_network, eps)
        next_state, reward, done, info = env.step(action)
        reward = np.random.normal(1.0, RANDOM_REWARD_STD)
        if done:
            next_state = None
        # store in memory
        memory.add_sample((state, action, reward, next_state))

        loss = train(primary_network, memory, target_network if double_q else None)
        avg_loss += loss

        state = next_state

        # exponentially decay the eps value
        steps += 1
        eps = MIN_EPSILON + (MAX_EPSILON - MIN_EPSILON) * math.exp(-LAMBDA * steps)

        if done:
            avg_loss /= cnt
            print(f"Episode: {i}, Reward: {cnt}, avg loss: {avg_loss:.3f}, eps: {eps:.3f}")
            with train_writer.as_default():
                tf.summary.scalar('reward', cnt, step=i)
                tf.summary.scalar('avg loss', avg_loss, step=i)
            break

        cnt += 1

Starting from the num_episodes loop, we can observe that first the environment is reset, and the current state of the agent returned. A while True loop is then entered into, which is only exited when the environment returns the signal that the episode has been completed. The code will render the Cartpole environment if the relevant variable has been set to True.

The next line shows the action selection, where the primary network is fed into the previously examined choose_network function, along with the current state and the epsilon value. This action is then fed into the environment by calling the env.step() command. This command returns the next state that the agent has entered ($s_{t+1}$), the reward ($r_{t+1}$) and the done Boolean which signifies if the episode has been completed.

The Cartpole environment is completely deterministic, with no randomness involved except in the initialization of the environment. Because Double Q learning is superior to deep Q learning especially when there is randomness in the environment, the Cartpole environment has been externally transformed into a stochastic environment on the next line. Normally, the reward from the Cartpole environment is a deterministic value of 1.0 for every step the pole stays upright. Here, however, the reward is replaced with a sample from a normal distribution, with mean 1.0 and standard deviation equal to the constant RANDOM_REWARD_STD.

In the first pass – RANDOM_REWARD_STD is set to 0.0 to transform the environment back to a deterministic case, but this will be changed in the next example run.

After this, the memory is added to and the primary network is trained.

Notice that the target_network is only passed to the training function if the double_q variable is set to True. If double_q is set to False, the training function defaults to standard deep Q learning. Finally the state is updated, and if the environment has signalled the episode has ended, some logging is performed and the while loop is exited.

It is now time to review the train function, which is where most of the work takes place:

def train(primary_network, memory, target_network=None):
    if memory.num_samples < BATCH_SIZE * 3:
        return 0
    batch = memory.sample(BATCH_SIZE)
    states = np.array([val[0] for val in batch])
    actions = np.array([val[1] for val in batch])
    rewards = np.array([val[2] for val in batch])
    next_states = np.array([(np.zeros(state_size)
                             if val[3] is None else val[3]) for val in batch])
    # predict Q(s,a) given the batch of states
    prim_qt = primary_network(states)
    # predict Q(s',a') from the evaluation network
    prim_qtp1 = primary_network(next_states)
    # copy the prim_qt into the target_q tensor - we then will update one index corresponding to the max action
    target_q = prim_qt.numpy()
    updates = rewards
    valid_idxs = np.array(next_states).sum(axis=1) != 0
    batch_idxs = np.arange(BATCH_SIZE)
    if target_network is None:
        updates[valid_idxs] += GAMMA * np.amax(prim_qtp1.numpy()[valid_idxs, :], axis=1)
    else:
        prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1)
        q_from_target = target_network(next_states)
        updates[valid_idxs] += GAMMA * q_from_target.numpy()[batch_idxs[valid_idxs], prim_action_tp1[valid_idxs]]
    target_q[batch_idxs, actions] = updates
    loss = primary_network.train_on_batch(states, target_q)
    if target_network is not None:
        # update target network parameters slowly from primary network
        for t, e in zip(target_network.trainable_variables, primary_network.trainable_variables):
            t.assign(t * (1 - TAU) + e * TAU)
    return loss

The first line is a bypass of this function if the memory does not contain more than 3 x the batch size – this is to ensure no training of the primary network takes place until there is a reasonable amount of samples within the memory.

Next, a batch is extracted from the memory – this is a list of tuples. The individual state, actions and reward values are then extracted and converted to numpy arrays using Python list comprehensions. Note that the next_state values are set to zeros if the raw next_state values are None – this only happens when the episode has terminated.

Next the sampled states ($s_t$) are passed through the network – this returns the values $Q(s_t, a; theta_t)$. The next line extracts the Q values from the primary network for the next states ($s_{t+1}$). Next, we want to start constructing our target_q values ($Q_target$). These are the “labels” which will be supplied to the primary network to train towards.

Note that the target_q values are the same as the prim_qt ($Q(s_t, a; theta_t)$) values except for the index corresponding to the action chosen. So, for instance, let’s say a single sample of the prim_qt values are [0.5, -0.5] – but the action chosen from $s_t$ was 0. We only want to update the 0.5 value  while training, the remaining values in target_q remain equal to prim_qt (i.e. [update, -0.5]). Therefore, in the next line, we create target_q by simply converting prim_qt from a tensor into its numpy equivalent. This is basically a copy of the values from prim_qt t0 target_q. We convert to numpy also, as it is easier to deal with indexing in numpy than TensorFlow at this stage.

To affect these updates, we create a new variable updates. The first step is to set the update values to the sampled rewards – the $r_{t+1}$ values are the same regardless of whether we are performing deep Q learning or Double Q learning. In the following lines, these update values will be added to in order to capture the discounted future reward terms. The next line creates an array called valid_idxs. This array is to hold all those samples in the batch which don’t include a case where next_state is zero. When next_state is zero, this means that the episode has terminated. In those cases, only the first term of the equation below remains ($r_{t+1}$):

$$Q_{target} = r_{t+1} + gamma Q(s_{t+1}, a*); theta^-_t)$$

Seeing as update already includes the first term, any further additions to update need to exclude these indexes.

The next line, batch_idxs, is simply a numpy arange which counts out the number of samples within the batch. This is included to ensure that the numpy indexing / broadcasting to follow works properly.

The next line switches depending on whether Double Q learning has been enabled or not. If target_network is None, then standard deep Q learning ensures. In such a case, the following term is calculated and added to updates (which already includes the reward term):

$$gamma max Q(s_{t+1}, a; theta)$$

Alternatively, if target_network is not None, then Double Q learning is performed. The first line:

prim_action_tp1 = np.argmax(prim_qtp1.numpy(), axis=1)

calculates the following equation shown earlier:

$$a* = argmax Q(s_{t+1}, a; theta_t)$$

The next line extracts the Q values from the target network for state $s_{t+1}$ and assigns this to variable q_from_target. Finally, the update term has the following added to it:

$$gamma Q(s_{t+1}, a*); theta^-_t)$$

Notice, that the numpy indexing extracts from q_from_target all the valid batch samples, and within those samples, all the highest Q actions drawn from the primary network (i.e. a*).

Finally, the target_q values corresponding to the actions from state $s_t$ are updated with the update array.

Following this, the primary network is trained on this batch of data using the Keras train_on_batch. The last step in the function involves copying the primary or online network values into the target network. This can be varied so that this step only occurs every X amount of training steps (especially when one is doing a “hard copy”). However, as stated previously, in this example we’ll be doing a “soft copy” and therefore every training step involves the target network weights being moved slightly towards the primary network weights. As can be observed, for every trainable variable in both the primary and target networks, the target network trainable variables are assigned new values updated via the previously presented formula:

$$theta^- = theta^- (1-tau) + theta tau$$

That (rather lengthy) explanation concludes the discussion of how Double Q learning can be implemented in TensorFlow 2. Now it is time to examine the results of the training.

Double Q results for a deterministic case

In the first case, we are going to examine the deterministic training case when RANDOM_REWARD_STD is set to 0.0. The TensorBoard graph below shows the results:

 

Double Q deterministic case

Double Q deterministic case (blue – Double Q, red – deep Q)

As can be observed, in both the Double Q and deep Q training cases, the networks converge on “correctly” solving the Cartpole problem – with eventual consistent rewards of 180-200 per episode (a total reward of 200 is the maximum available per episode in the Cartpole environment). The Double Q case shows slightly better performance in reaching the “solved” state than the deep Q network implementation. This is likely due to better stability in decoupling the choice and evaluation of the actions, but it is not a conclusive result in this rather simple deterministic environment.

However, what happens when we increase the randomness by elevated RANDOM_REWARD_STD > 0?

Double Q results for a stochastic case

The results below show the case when RANDOM_REWARD_STD is increased to 1.0 – in this case, the rewards are drawn from a random normal distribution of mean 1.0 and standard deviation of 1.0:

 

Double Q stochastic case

Double Q stochastic case (blue – Double Q, red – deep Q)

As can be seen, in this case, the Double Q network significantly outperforms the deep Q training methodology. This demonstrates the effect of biasing in the deep Q training methodology, and the advantages of using Double Q learning in your reinforcement learning tasks.

I hope this post was helpful in increasing your understanding of both deep Q and Double Q reinforcement learning. Keep an eye out for future posts on reinforcement learning.


Eager to build deep learning systems in TensorFlow 2? Get the book here


 

 

The post Double Q reinforcement learning in TensorFlow 2 appeared first on Adventures in Machine Learning.