Categories
Misc

Power Your Big Data Analytics with the Latest NVIDIA GPUs in the Cloud

To make it easier to leverage NVIDIA accelerated compute, we’ve added support for launching RAPIDS + Dask on the latest NVIDIA A100 GPUs in the cloud.

Dask is an accessible and powerful solution for natively scaling Python analytics. Using familiar interfaces, it allows data scientists familiar with PyData tools to scale big data workloads easily. Dask is such a powerful tool that we have adopted it throughout a variety of projects at NVIDIA. When paired with RAPIDS, data practitioners can distribute big data workloads across massive NVIDIA GPU clusters.

To make it easier to leverage NVIDIA accelerated compute, we’ve added support for launching RAPIDS + Dask on the latest NVIDIA A100 GPUs in the cloud, allowing users and enterprises to get the most out of their data.

Spin-Up NVIDIA GPU Clusters Quickly with Dask Cloud Provider

While Dask makes scaling analytics workloads easy, distributing workloads in Cloud environments can be tricky. Dask-CloudProvider is a package that provides native Cloud integration, making it simple to get started on Amazon Web Services, Google Cloud Platform, or Microsoft Azure. Using native Cloud tools, data scientists, machine learning engineers, and DevOps engineers can stand-up infrastructure and start running workloads in no time.

RAPIDS builds upon Dask-CloudProvider to make spinning-up the most powerful NVIDIA GPU instances easy with raw virtual machines. While AWS, GCP, and Azure have great managed services for data scientists, these implementations can take time to adopt new GPU architectures. With Dask-CloudProvider and RAPIDS, users and enterprises can leverage the latest NVIDIA A100 GPUs, providing 20x more performance than the previous generation. With 40GB of GPU memory each and 600GB/s NVLINK connection, NVIDIA A100 GPUs are a supercharged workhorse for enterprise-scale data science workloads. Dask-CloudProvider and RAPIDS provide an easy way to get started with A100s without having to configure raw VMs from scratch.

RAPIDS strives to make NVIDIA accelerated data science accessible to a broader data-driven audience. With Dask, RAPIDS allows data scientists to solve enterprise-scale problems in less time and with less pain. For a deeper understanding of the latest RAPIDS features and integrations, read more here.

Categories
Offsites

Leveraging Machine Learning for Game Development

Over the years, online multiplayer games have exploded in popularity, captivating millions of players across the world. This popularity has also exponentially increased demands on game designers, as players expect games to be well-crafted and balanced — after all, it’s no fun to play a game where a single strategy beats all the rest.

In order to create a positive gameplay experience, game designers typically tune the balance of a game iteratively:

  1. Stress-test through thousands of play-testing sessions from test users
  2. Incorporate feedback and re-design the game
  3. Repeat 1 & 2 until both the play-testers and game designers are satisfied

This process is not only time-consuming but also imperfect — the more complex the game, the easier it is for subtle flaws to slip through the cracks. When games often have many different roles that can be played, with dozens of interconnecting skills, it makes it all the more difficult to hit the right balance.

Today, we present an approach that leverages machine learning (ML) to adjust game balance by training models to serve as play-testers, and demonstrate this approach on the digital card game prototype Chimera, which we’ve previously shown as a testbed for ML-generated art. By running millions of simulations using trained agents to collect data, this ML-based game testing approach enables game designers to more efficiently make a game more fun, balanced, and aligned with their original vision.

Chimera
We developed Chimera as a game prototype that would heavily lean on machine learning during its development process. For the game itself, we purposefully designed the rules to expand the possibility space, making it difficult to build a traditional hand-crafted AI to play the game.

The gameplay of Chimera revolves around the titular chimeras, creature mash-ups that players aim to strengthen and evolve. The objective of the game is to defeat the opponent’s chimera. These are the key points in the game design:

  • Players may play:
    • creatures, which can attack (through their attack stat) or be attacked (against their health stat), or
    • spells, which produce special effects.
  • Creatures are summoned into limited-capacity biomes, which are placed physically on the board space. Each creature has a preferred biome and will take repeated damage if placed on an incorrect biome or a biome that is over capacity.
  • A player controls a single chimera, which starts off in a basic “egg” state and can be evolved and strengthened by absorbing creatures. To do this, the player must also acquire a certain amount of link energy, which is generated from various gameplay mechanics.
  • The game ends when a player has successfully brought the health of the opponent’s chimera to 0.

Learning to Play Chimera
As an imperfect information card game with a large state space, we expected Chimera to be a difficult game for an ML model to learn, especially as we were aiming for a relatively simple model. We used an approach inspired by those used by earlier game-playing agents like AlphaGo, in which a convolutional neural network (CNN) is trained to predict the probability of a win when given an arbitrary game state. After training an initial model on games where random moves were chosen, we set the agent to play against itself, iteratively collecting game data, that was then used to train a new agent. With each iteration, the quality of the training data improved, as did the agent’s ability to play the game.

The ML agent’s performance against our best hand-crafted AI as training progressed. The initial ML agent (version 0) picked moves randomly.

For the actual game state representation that the model would receive as input, we found that passing an “image” encoding to the CNN resulted in the best performance, beating all benchmark procedural agents and other types of networks (e.g. fully connected). The chosen model architecture is small enough to run on a CPU in reasonable time, which allowed us to download the model weights and run the agent live in a Chimera game client using Unity Barracuda.

An example game state representation used to train the neural network.
In addition to making decisions for the game AI, we also used the model to display the estimated win probability for a player over the course of the game.

Balancing Chimera
This approach enabled us to simulate millions more games than real players would be capable of playing in the same time span. After collecting data from the games played by the best-performing agents, we analyzed the results to find imbalances between the two of the player decks we had designed.

First, the Evasion Link Gen deck was composed of spells and creatures with abilities that generated extra link energy used to evolve a player’s chimera. It also contained spells that enabled creatures to evade attacks. In contrast, the Damage-Heal deck contained creatures of variable strength with spells that focused on healing and inflicting minor damage. Although we had designed these decks to be of equal strength, the Evasion Link Gen deck was winning 60% of the time when played against the Damage-Heal deck.

When we collected various stats related to biomes, creatures, spells, and chimera evolutions, two things immediately jumped out at us:

  1. There was a clear advantage in evolving a chimera — the agent won a majority of the games where it evolved its chimera more than the opponent did. Yet, the average number of evolves per game did not meet our expectations. To make it more of a core game mechanic, we wanted to increase the overall average number of evolves while keeping its usage strategic.
  2. The T-Rex creature was overpowered. Its appearances correlated strongly with wins, and the model would always play the T-Rex regardless of penalties for summoning into an incorrect or overcrowded biome.

From these insights, we made some adjustments to the game. To emphasize chimera evolution as a core mechanism in the game, we decreased the amount of link energy required to evolve a chimera from 3 to 1. We also added a “cool-off” period to the T-Rex creature, doubling the time it took to recover from any of its actions.

Repeating our ‘self-play’ training procedure with the updated rules, we observed that these changes pushed the game in the desired direction — the average number of evolves per game increased, and the T-Rex’s dominance faded.

One example comparison of the T-Rex’s influence before and after balancing. The charts present the number of games won (or lost) when a deck initiates a particular spell interaction (e.g., using the “Dodge” spell to benefit a T-Rex). Left: Before the changes, the T-Rex had a strong influence in every metric examined — highest survival rate, most likely to be summoned ignoring penalties, most absorbed creature during wins. Right: After the changes, the T-Rex was much less overpowered.

By weakening the T-Rex, we successfully reduced the Evasion Link Gen deck’s reliance on an overpowered creature. Even so, the win ratio between the decks remained at 60/40 rather than 50/50. A closer look at the individual game logs revealed that the gameplay was often less strategic than we would have liked. Searching through our gathered data again, we found several more areas to introduce changes in.

To start, we increased the starting health of both players as well as the amount of health that healing spells could replenish. This was to encourage longer games that would allow a more diverse set of strategies to flourish. In particular, this enabled the Damage-Heal deck to survive long enough to take advantage of its healing strategy. To encourage proper summoning and strategic biome placement, we increased the existing penalties on playing creatures into incorrect or overcrowded biomes. And finally, we decreased the gap between the strongest and weakest creatures through minor attribute adjustments.

New adjustments in place, we arrived at the final game balance stats for these two decks:

Deck Avg # evolves per game    
(before → after)    
Win % (1M games)
(before → after)
Evasion Link Gen     1.54 → 2.16     59.1% → 49.8%
Damage Heal 0.86 → 1.76     40.9% → 50.2%

Conclusion
Normally, identifying imbalances in a newly prototyped game can take months of playtesting. With this approach, we were able to not only discover potential imbalances but also introduce tweaks to mitigate them in a span of days. We found that a relatively simple neural network was sufficient to reach high level performance against humans and traditional game AI. These agents could be leveraged in further ways, such as for coaching new players or discovering unexpected strategies. We hope this work will inspire more exploration in the possibilities of machine learning for game development.

Acknowledgements
This project was conducted in collaboration with many people. Thanks to Ryan Poplin, Maxwell Hannaman, Taylor Steil, Adam Prins, Michal Todorovic, Xuefan Zhou, Aaron Cammarata, Andeep Toor, Trung Le, Erin Hoffman-John, and Colin Boswell. Thanks to everyone who contributed through playtesting, advising on game design, and giving valuable feedback.

Categories
Misc

Build Your Own AI-Powered Q&A Service

You can now build your own AI-powered Q&A service with the step-by-step instructions provided in this four-part blog series.

Conversational AI, the ability for machines to understand and respond to human queries, is being widely adopted across industries as enterprises see the value of this technology through solutions like chatbots and virtual assistants to better support their customers while lowering the cost of customer service. 

You can now build your own AI-powered Q&A service with the step-by-step instructions provided in this four-part blog series. All the software resources you will need, from the deep learning frameworks to pre-trained models to inference engines are available from the NVIDIA NGC catalog – a hub of GPU-optimized software.

The blog series walks through:

  1. Part 1: Leveraging pre-trained models to build custom models with your training dataset 
  2. Part 2: Optimizing the custom model to provide lower latency and higher throughput 
  3. Part 3: Running inference on your custom models 
  4. Part 4: Deploying the virtual assistant in the cloud 

While the blog instructions use GPU-powered cloud instances, these instructions will work for on-prem systems as well.

Build your virtual assistant today with these instructions or join us at NVIDIA GTC for free on April 13th for our session “Accelerating AI Workflows at GTC” to learn step-by-step how to build a conversational AI solution using artifacts from the NGC catalog.

Categories
Misc

torch time series, final episode: Attention

This is the final post in a four-part introduction to time-series forecasting with torch. These posts have been the story of a quest for multiple-step prediction, and by now, we’ve seen three different approaches: forecasting in a loop, incorporating a multi-layer perceptron (MLP), and sequence-to-sequence models. Here’s a quick recap.

  • As one should when one sets out for an adventurous journey, we started with an in-depth study of the tools at our disposal: recurrent neural networks (RNNs). We trained a model to predict the very next observation in line, and then, thought of a clever hack: How about we use this for multi-step prediction, feeding back individual predictions in a loop? The result , it turned out, was quite acceptable.

  • Then, the adventure really started. We built our first model “natively” for multi-step prediction, relieving the RNN a bit of its workload and involving a second player, a tiny-ish MLP. Now, it was the MLP’s task to project RNN output to several time points in the future. Although results were pretty satisfactory, we didn’t stop there.

  • Instead, we applied to numerical time series a technique commonly used in natural language processing (NLP): sequence-to-sequence (seq2seq) prediction. While forecast performance was not much different from the previous case, we found the technique to be more intuitively appealing, since it reflects the causal relationship between successive forecasts.

Today we’ll enrich the seq2seq approach by adding a new component: the attention module. Originally introduced around 20141, attention mechanisms have gained enormous traction, so much so that a recent paper title starts out “Attention is Not All You Need”2.

The idea is the following.

In the classic encoder-decoder setup, the decoder gets “primed” with an encoder summary just a single time: the time it starts its forecasting loop. From then on, it’s on its own. With attention, however, it gets to see the complete sequence of encoder outputs again every time it forecasts a new value. What’s more, every time, it gets to zoom in on those outputs that seem relevant for the current prediction step.

This is a particularly useful strategy in translation: In generating the next word, a model will need to know what part of the source sentence to focus on. How much the technique helps with numerical sequences, in contrast, will likely depend on the features of the series in question.

Data input

As before, we work with vic_elec, but this time, we partly deviate from the way we used to employ it. With the original, bi-hourly dataset, training the current model takes a long time, longer than readers will want to wait when experimenting. So instead, we aggregate observations by day. In order to have enough data, we train on years 2012 and 2013, reserving 2014 for validation as well as post-training inspection.

vic_elec_daily <- vic_elec %>%
  select(Time, Demand) %>%
  index_by(Date = date(Time)) %>%
  summarise(
    Demand = sum(Demand) / 1e3) 

elec_train <- vic_elec_daily %>% 
  filter(year(Date) %in% c(2012, 2013)) %>%
  as_tibble() %>%
  select(Demand) %>%
  as.matrix()

elec_valid <- vic_elec_daily %>% 
  filter(year(Date) == 2014) %>%
  as_tibble() %>%
  select(Demand) %>%
  as.matrix()

elec_test <- vic_elec_daily %>% 
  filter(year(Date) %in% c(2014), month(Date) %in% 1:4) %>%
  as_tibble() %>%
  select(Demand) %>%
  as.matrix()

train_mean <- mean(elec_train)
train_sd <- sd(elec_train)

We’ll attempt to forecast demand up to fourteen days ahead. How long, then, should be the input sequences? This is a matter of experimentation; all the more so now that we’re adding in the attention mechanism. (I suspect that it might not handle very long sequences so well).

Below, we go with fourteen days for input length, too, but that may not necessarily be the best possible choice for this series.

n_timesteps <- 7 * 2
n_forecast <- 7 * 2

elec_dataset <- dataset(
  name = "elec_dataset",
  
  initialize = function(x, n_timesteps, sample_frac = 1) {
    
    self$n_timesteps <- n_timesteps
    self$x <- torch_tensor((x - train_mean) / train_sd)
    
    n <- length(self$x) - self$n_timesteps - 1
    
    self$starts <- sort(sample.int(
      n = n,
      size = n * sample_frac
    ))
    
  },
  
  .getitem = function(i) {
    
    start <- self$starts[i]
    end <- start + self$n_timesteps - 1
    lag <- 1
    
    list(
      x = self$x[start:end],
      y = self$x[(start+lag):(end+lag)]$squeeze(2)
    )
    
  },
  
  .length = function() {
    length(self$starts) 
  }
)

batch_size <- 32

train_ds <- elec_dataset(elec_train, n_timesteps, sample_frac = 0.5)
train_dl <- train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)

valid_ds <- elec_dataset(elec_valid, n_timesteps, sample_frac = 0.5)
valid_dl <- valid_ds %>% dataloader(batch_size = batch_size)

test_ds <- elec_dataset(elec_test, n_timesteps)
test_dl <- test_ds %>% dataloader(batch_size = 1)

Model

Model-wise, we again encounter the three modules familiar from the previous post: encoder, decoder, and top-level seq2seq module. However, there is an additional component: the attention module, used by the decoder to obtain attention weights.

Encoder

The encoder still works the same way. It wraps an RNN, and returns the final state.

encoder_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, num_layers = 1, dropout = 0) {
    
    self$type <- type
    
    self$rnn <- if (self$type == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    }
    
  },
  
  forward = function(x) {
    
    x <- self$rnn(x)
    
    # return last states for all layers
    # per layer, a single tensor for GRU, a list of 2 tensors for LSTM
    x <- x[[2]]
    x
    
  }
)

Attention module

In basic seq2seq, whenever it had to generate a new value, the decoder took into account two things: its prior state, and the previous output generated. In an attention-enriched setup, the decoder additionally receives the complete output from the encoder. In deciding what subset of that output should matter, it gets help from a new agent, the attention module.

This, then, is the attention module’s raison d’être: Given current decoder state and well as complete encoder outputs, obtain a weighting of those outputs indicative of how relevant they are to what the decoder is currently up to. This procedure results in the so-called attention weights: a normalized score, for each time step in the encoding, that quantify their respective importance.

Attention may be implemented in a number of different ways. Here, we show two implementation options, one additive, and one multiplicative.

Additive attention

In additive attention, encoder outputs and decoder state are commonly either added or concatenated (we choose to do the latter, below). The resulting tensor is run through a linear layer, and a softmax is applied for normalization.

attention_module_additive <- nn_module(
  
  initialize = function(hidden_dim, attention_size) {
    
    self$attention <- nn_linear(2 * hidden_dim, attention_size)
    
  },
  
  forward = function(state, encoder_outputs) {
    
    # function argument shapes
    # encoder_outputs: (bs, timesteps, hidden_dim)
    # state: (1, bs, hidden_dim)
    
    # multiplex state to allow for concatenation (dimensions 1 and 2 must agree)
    seq_len <- dim(encoder_outputs)[2]
    # resulting shape: (bs, timesteps, hidden_dim)
    state_rep <- state$permute(c(2, 1, 3))$repeat_interleave(seq_len, 2)
    
    # concatenate along feature dimension
    concat <- torch_cat(list(state_rep, encoder_outputs), dim = 3)
    
    # run through linear layer with tanh
    # resulting shape: (bs, timesteps, attention_size)
    scores <- self$attention(concat) %>% 
      torch_tanh()
    
    # sum over attention dimension and normalize
    # resulting shape: (bs, timesteps) 
    attention_weights <- scores %>%
      torch_sum(dim = 3) %>%
      nnf_softmax(dim = 2)
    
    # a normalized score for every source token
    attention_weights
  }
)

Multiplicative attention

In multiplicative attention, scores are obtained by computing dot products between decoder state and all of the encoder outputs. Here too, a softmax is then used for normalization.

attention_module_multiplicative <- nn_module(
  
  initialize = function() {
    
    NULL
    
  },
  
  forward = function(state, encoder_outputs) {
    
    # function argument shapes
    # encoder_outputs: (bs, timesteps, hidden_dim)
    # state: (1, bs, hidden_dim)

    # allow for matrix multiplication with encoder_outputs
    state <- state$permute(c(2, 3, 1))
 
    # prepare for scaling by number of features
    d <- torch_tensor(dim(encoder_outputs)[3], dtype = torch_float())
       
    # scaled dot products between state and outputs
    # resulting shape: (bs, timesteps, 1)
    scores <- torch_bmm(encoder_outputs, state) %>%
      torch_div(torch_sqrt(d))
    
    # normalize
    # resulting shape: (bs, timesteps) 
    attention_weights <- scores$squeeze(3) %>%
      nnf_softmax(dim = 2)
    
    # a normalized score for every source token
    attention_weights
  }
)

Decoder

Once attention weights have been computed, their actual application is handled by the decoder. Concretely, the method in question, weighted_encoder_outputs(), computes a product of weights and encoder outputs, making sure that each output will have appropriate impact.

The rest of the action then happens in forward(). A concatenation of weighted encoder outputs (often called “context”) and current input is run through an RNN. Then, an ensemble of RNN output, context, and input is passed to an MLP. Finally, both RNN state and current prediction are returned.

decoder_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, attention_type, attention_size = 8, num_layers = 1) {
    
    self$type <- type
    
    self$rnn <- if (self$type == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    }
    
    self$linear <- nn_linear(2 * hidden_size + 1, 1)
    
    self$attention <- if (attention_type == "multiplicative") attention_module_multiplicative()
      else attention_module_additive(hidden_size, attention_size)
    
  },
  
  weighted_encoder_outputs = function(state, encoder_outputs) {

    # encoder_outputs is (bs, timesteps, hidden_dim)
    # state is (1, bs, hidden_dim)
    # resulting shape: (bs * timesteps)
    attention_weights <- self$attention(state, encoder_outputs)
    
    # resulting shape: (bs, 1, seq_len)
    attention_weights <- attention_weights$unsqueeze(2)
    
    # resulting shape: (bs, 1, hidden_size)
    weighted_encoder_outputs <- torch_bmm(attention_weights, encoder_outputs)
    
    weighted_encoder_outputs
    
  },
  
  forward = function(x, state, encoder_outputs) {
 
    # encoder_outputs is (bs, timesteps, hidden_dim)
    # state is (1, bs, hidden_dim)
    
    # resulting shape: (bs, 1, hidden_size)
    context <- self$weighted_encoder_outputs(state, encoder_outputs)
    
    # concatenate input and context
    # NOTE: this repeating is done to compensate for the absence of an embedding module
    # that, in NLP, would give x a higher proportion in the concatenation
    x_rep <- x$repeat_interleave(dim(context)[3], 3) 
    rnn_input <- torch_cat(list(x_rep, context), dim = 3)
    
    # resulting shapes: (bs, 1, hidden_size) and (1, bs, hidden_size)
    rnn_out <- self$rnn(rnn_input, state)
    rnn_output <- rnn_out[[1]]
    next_hidden <- rnn_out[[2]]
    
    mlp_input <- torch_cat(list(rnn_output$squeeze(2), context$squeeze(2), x$squeeze(2)), dim = 2)
    
    output <- self$linear(mlp_input)
    
    # shapes: (bs, 1) and (1, bs, hidden_size)
    list(output, next_hidden)
  }
  
)

seq2seq module

The seq2seq module is basically unchanged (apart from the fact that now, it allows for attention module configuration). For a detailed explanation of what happens here, please consult the previous post.

seq2seq_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, attention_type, attention_size, n_forecast, 
                        num_layers = 1, encoder_dropout = 0) {
    
    self$encoder <- encoder_module(type = type, input_size = input_size, hidden_size = hidden_size,
                                   num_layers, encoder_dropout)
    self$decoder <- decoder_module(type = type, input_size = 2 * hidden_size, hidden_size = hidden_size,
                                   attention_type = attention_type, attention_size = attention_size, num_layers)
    self$n_forecast <- n_forecast
    
  },
  
  forward = function(x, y, teacher_forcing_ratio) {
    
    outputs <- torch_zeros(dim(x)[1], self$n_forecast)$to(device = device)
    encoded <- self$encoder(x)
    encoder_outputs <- encoded[[1]]
    hidden <- encoded[[2]]
    # list of (batch_size, 1), (1, batch_size, hidden_size)
    out <- self$decoder(x[ , n_timesteps, , drop = FALSE], hidden, encoder_outputs)
    # (batch_size, 1)
    pred <- out[[1]]
    # (1, batch_size, hidden_size)
    state <- out[[2]]
    outputs[ , 1] <- pred$squeeze(2)
    
    for (t in 2:self$n_forecast) {
      
      teacher_forcing <- runif(1) < teacher_forcing_ratio
      input <- if (teacher_forcing == TRUE) pred$unsqueeze(3) else y[ , t - 1]
      out <- self$decoder(pred$unsqueeze(3), state, encoder_outputs)
      pred <- out[[1]]
      state <- out[[2]]
      outputs[ , t] <- pred$squeeze(2)
      
    }
    
    outputs
  }
  
)

When instantiating the top-level model, we now have an additional choice: that between additive and multiplicative attention. In the “accuracy” sense of performance, my tests did not show any differences. However, the multiplicative variant is a lot faster.

net <- seq2seq_module("gru", input_size = 1, hidden_size = 32, attention_type = "multiplicative",
                      attention_size = 8, n_forecast = n_timesteps)

# training RNNs on the GPU currently prints a warning that may clutter 
# the console
# see https://github.com/mlverse/torch/issues/461
# alternatively, use 
# device <- "cpu"
device <- torch_device(if (cuda_is_available()) "cuda" else "cpu")

net <- net$to(device = device)

Training

Just like last time, in model training, we get to choose the degree of teacher forcing. Below, we go with a fraction of 0.0, that is, no forcing at all.

optimizer <- optim_adam(net$parameters, lr = 0.001)

num_epochs <- 100

train_batch <- function(b, teacher_forcing_ratio) {
  
  optimizer$zero_grad()
  output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio)
  target <- b$y$to(device = device)
  
  loss <- nnf_mse_loss(output, target)
  loss$backward()
  optimizer$step()
  
  loss$item()
  
}

valid_batch <- function(b, teacher_forcing_ratio = 0) {
  
  output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio)
  target <- b$y$to(device = device)
  
  loss <- nnf_mse_loss(output, target)
  
  loss$item()
  
}

for (epoch in 1:num_epochs) {
  
  net$train()
  train_loss <- c()
  
  coro::loop(for (b in train_dl) {
    loss <-train_batch(b, teacher_forcing_ratio = 0.3)
    train_loss <- c(train_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, training: loss: %3.5f n", epoch, mean(train_loss)))
  
  net$eval()
  valid_loss <- c()
  
  coro::loop(for (b in valid_dl) {
    loss <- valid_batch(b)
    valid_loss <- c(valid_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, validation: loss: %3.5f n", epoch, mean(valid_loss)))
}
# Epoch 1, training: loss: 0.83752 
# Epoch 1, validation: loss: 0.83167

# Epoch 2, training: loss: 0.72803 
# Epoch 2, validation: loss: 0.80804 

# ...
# ...

# Epoch 99, training: loss: 0.10385 
# Epoch 99, validation: loss: 0.21259 

# Epoch 100, training: loss: 0.10396 
# Epoch 100, validation: loss: 0.20975 

Evaluation

For visual inspection, we pick a few forecasts from the test set.

net$eval()

test_preds <- vector(mode = "list", length = length(test_dl))

i <- 1

vic_elec_test <- vic_elec_daily %>%
  filter(year(Date) == 2014, month(Date) %in% 1:4)


coro::loop(for (b in test_dl) {
  
  input <- b$x
  output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio = 0)
  preds <- as.numeric(output)
  
  test_preds[[i]] <- preds
  i <<- i + 1
  
})

test_pred1 <- test_preds[[1]]
test_pred1 <- c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_test) - n_timesteps - n_forecast))

test_pred2 <- test_preds[[21]]
test_pred2 <- c(rep(NA, n_timesteps + 20), test_pred2, rep(NA, nrow(vic_elec_test) - 20 - n_timesteps - n_forecast))

test_pred3 <- test_preds[[41]]
test_pred3 <- c(rep(NA, n_timesteps + 40), test_pred3, rep(NA, nrow(vic_elec_test) - 40 - n_timesteps - n_forecast))

test_pred4 <- test_preds[[61]]
test_pred4 <- c(rep(NA, n_timesteps + 60), test_pred4, rep(NA, nrow(vic_elec_test) - 60 - n_timesteps - n_forecast))

test_pred5 <- test_preds[[81]]
test_pred5 <- c(rep(NA, n_timesteps + 80), test_pred5, rep(NA, nrow(vic_elec_test) - 80 - n_timesteps - n_forecast))


preds_ts <- vic_elec_test %>%
  select(Demand, Date) %>%
  add_column(
    ex_1 = test_pred1 * train_sd + train_mean,
    ex_2 = test_pred2 * train_sd + train_mean,
    ex_3 = test_pred3 * train_sd + train_mean,
    ex_4 = test_pred4 * train_sd + train_mean,
    ex_5 = test_pred5 * train_sd + train_mean) %>%
  pivot_longer(-Date) %>%
  update_tsibble(key = name)


preds_ts %>%
  autoplot() +
  scale_color_hue(h = c(80, 300), l = 70) +
  theme_minimal()
A sample of two-weeks-ahead predictions for the test set, 2014.

(#fig:unnamed-chunk-11)A sample of two-weeks-ahead predictions for the test set, 2014.

We can’t directly compare performance here to that of previous models in our series, as we’ve pragmatically redefined the task. The main goal, however, has been to introduce the concept of attention. Specifically, how to manually implement the technique – something that, once you’ve understood the concept, you may never have to do in practice. Instead, you would likely make use of existing tools that come with torch (multi-head attention and transformer modules), tools we may introduce in a future “season” of this series.

Thanks for reading!

Photo by David Clode on Unsplash

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate.” CoRR abs/1409.0473. http://arxiv.org/abs/1409.0473.

Dong, Yihe, Jean-Baptiste Cordonnier, and Andreas Loukas. 2021. “Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth.” arXiv E-Prints, March, arXiv:2103.03404. http://arxiv.org/abs/2103.03404.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” arXiv E-Prints, June, arXiv:1706.03762. http://arxiv.org/abs/1706.03762.

Vinyals, Oriol, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. 2014. “Grammar as a Foreign Language.” CoRR abs/1412.7449. http://arxiv.org/abs/1412.7449.

Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” CoRR abs/1502.03044. http://arxiv.org/abs/1502.03044.

  1. For important papers see: Bahdanau, Cho, and Bengio (2014), Vinyals et al. (2014), Xu et al. (2015).↩︎

  2. Dong, Cordonnier, and Loukas (2021), thus replying to Vaswani et al. (2017), whose 2017 paper “went viral” for stating the opposite.↩︎

Categories
Offsites

torch time series, final episode: Attention

This is the final post in a four-part introduction to time-series forecasting with torch. These posts have been the story of a quest for multiple-step prediction, and by now, we’ve seen three different approaches: forecasting in a loop, incorporating a multi-layer perceptron (MLP), and sequence-to-sequence models. Here’s a quick recap.

  • As one should when one sets out for an adventurous journey, we started with an in-depth study of the tools at our disposal: recurrent neural networks (RNNs). We trained a model to predict the very next observation in line, and then, thought of a clever hack: How about we use this for multi-step prediction, feeding back individual predictions in a loop? The result , it turned out, was quite acceptable.

  • Then, the adventure really started. We built our first model “natively” for multi-step prediction, relieving the RNN a bit of its workload and involving a second player, a tiny-ish MLP. Now, it was the MLP’s task to project RNN output to several time points in the future. Although results were pretty satisfactory, we didn’t stop there.

  • Instead, we applied to numerical time series a technique commonly used in natural language processing (NLP): sequence-to-sequence (seq2seq) prediction. While forecast performance was not much different from the previous case, we found the technique to be more intuitively appealing, since it reflects the causal relationship between successive forecasts.

Today we’ll enrich the seq2seq approach by adding a new component: the attention module. Originally introduced around 20141, attention mechanisms have gained enormous traction, so much so that a recent paper title starts out “Attention is Not All You Need”2.

The idea is the following.

In the classic encoder-decoder setup, the decoder gets “primed” with an encoder summary just a single time: the time it starts its forecasting loop. From then on, it’s on its own. With attention, however, it gets to see the complete sequence of encoder outputs again every time it forecasts a new value. What’s more, every time, it gets to zoom in on those outputs that seem relevant for the current prediction step.

This is a particularly useful strategy in translation: In generating the next word, a model will need to know what part of the source sentence to focus on. How much the technique helps with numerical sequences, in contrast, will likely depend on the features of the series in question.

Data input

As before, we work with vic_elec, but this time, we partly deviate from the way we used to employ it. With the original, bi-hourly dataset, training the current model takes a long time, longer than readers will want to wait when experimenting. So instead, we aggregate observations by day. In order to have enough data, we train on years 2012 and 2013, reserving 2014 for validation as well as post-training inspection.

vic_elec_daily <- vic_elec %>%
  select(Time, Demand) %>%
  index_by(Date = date(Time)) %>%
  summarise(
    Demand = sum(Demand) / 1e3) 

elec_train <- vic_elec_daily %>% 
  filter(year(Date) %in% c(2012, 2013)) %>%
  as_tibble() %>%
  select(Demand) %>%
  as.matrix()

elec_valid <- vic_elec_daily %>% 
  filter(year(Date) == 2014) %>%
  as_tibble() %>%
  select(Demand) %>%
  as.matrix()

elec_test <- vic_elec_daily %>% 
  filter(year(Date) %in% c(2014), month(Date) %in% 1:4) %>%
  as_tibble() %>%
  select(Demand) %>%
  as.matrix()

train_mean <- mean(elec_train)
train_sd <- sd(elec_train)

We’ll attempt to forecast demand up to fourteen days ahead. How long, then, should be the input sequences? This is a matter of experimentation; all the more so now that we’re adding in the attention mechanism. (I suspect that it might not handle very long sequences so well).

Below, we go with fourteen days for input length, too, but that may not necessarily be the best possible choice for this series.

n_timesteps <- 7 * 2
n_forecast <- 7 * 2

elec_dataset <- dataset(
  name = "elec_dataset",
  
  initialize = function(x, n_timesteps, sample_frac = 1) {
    
    self$n_timesteps <- n_timesteps
    self$x <- torch_tensor((x - train_mean) / train_sd)
    
    n <- length(self$x) - self$n_timesteps - 1
    
    self$starts <- sort(sample.int(
      n = n,
      size = n * sample_frac
    ))
    
  },
  
  .getitem = function(i) {
    
    start <- self$starts[i]
    end <- start + self$n_timesteps - 1
    lag <- 1
    
    list(
      x = self$x[start:end],
      y = self$x[(start+lag):(end+lag)]$squeeze(2)
    )
    
  },
  
  .length = function() {
    length(self$starts) 
  }
)

batch_size <- 32

train_ds <- elec_dataset(elec_train, n_timesteps, sample_frac = 0.5)
train_dl <- train_ds %>% dataloader(batch_size = batch_size, shuffle = TRUE)

valid_ds <- elec_dataset(elec_valid, n_timesteps, sample_frac = 0.5)
valid_dl <- valid_ds %>% dataloader(batch_size = batch_size)

test_ds <- elec_dataset(elec_test, n_timesteps)
test_dl <- test_ds %>% dataloader(batch_size = 1)

Model

Model-wise, we again encounter the three modules familiar from the previous post: encoder, decoder, and top-level seq2seq module. However, there is an additional component: the attention module, used by the decoder to obtain attention weights.

Encoder

The encoder still works the same way. It wraps an RNN, and returns the final state.

encoder_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, num_layers = 1, dropout = 0) {
    
    self$type <- type
    
    self$rnn <- if (self$type == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        dropout = dropout,
        batch_first = TRUE
      )
    }
    
  },
  
  forward = function(x) {
    
    x <- self$rnn(x)
    
    # return last states for all layers
    # per layer, a single tensor for GRU, a list of 2 tensors for LSTM
    x <- x[[2]]
    x
    
  }
)

Attention module

In basic seq2seq, whenever it had to generate a new value, the decoder took into account two things: its prior state, and the previous output generated. In an attention-enriched setup, the decoder additionally receives the complete output from the encoder. In deciding what subset of that output should matter, it gets help from a new agent, the attention module.

This, then, is the attention module’s raison d’être: Given current decoder state and well as complete encoder outputs, obtain a weighting of those outputs indicative of how relevant they are to what the decoder is currently up to. This procedure results in the so-called attention weights: a normalized score, for each time step in the encoding, that quantify their respective importance.

Attention may be implemented in a number of different ways. Here, we show two implementation options, one additive, and one multiplicative.

Additive attention

In additive attention, encoder outputs and decoder state are commonly either added or concatenated (we choose to do the latter, below). The resulting tensor is run through a linear layer, and a softmax is applied for normalization.

attention_module_additive <- nn_module(
  
  initialize = function(hidden_dim, attention_size) {
    
    self$attention <- nn_linear(2 * hidden_dim, attention_size)
    
  },
  
  forward = function(state, encoder_outputs) {
    
    # function argument shapes
    # encoder_outputs: (bs, timesteps, hidden_dim)
    # state: (1, bs, hidden_dim)
    
    # multiplex state to allow for concatenation (dimensions 1 and 2 must agree)
    seq_len <- dim(encoder_outputs)[2]
    # resulting shape: (bs, timesteps, hidden_dim)
    state_rep <- state$permute(c(2, 1, 3))$repeat_interleave(seq_len, 2)
    
    # concatenate along feature dimension
    concat <- torch_cat(list(state_rep, encoder_outputs), dim = 3)
    
    # run through linear layer with tanh
    # resulting shape: (bs, timesteps, attention_size)
    scores <- self$attention(concat) %>% 
      torch_tanh()
    
    # sum over attention dimension and normalize
    # resulting shape: (bs, timesteps) 
    attention_weights <- scores %>%
      torch_sum(dim = 3) %>%
      nnf_softmax(dim = 2)
    
    # a normalized score for every source token
    attention_weights
  }
)

Multiplicative attention

In multiplicative attention, scores are obtained by computing dot products between decoder state and all of the encoder outputs. Here too, a softmax is then used for normalization.

attention_module_multiplicative <- nn_module(
  
  initialize = function() {
    
    NULL
    
  },
  
  forward = function(state, encoder_outputs) {
    
    # function argument shapes
    # encoder_outputs: (bs, timesteps, hidden_dim)
    # state: (1, bs, hidden_dim)

    # allow for matrix multiplication with encoder_outputs
    state <- state$permute(c(2, 3, 1))
 
    # prepare for scaling by number of features
    d <- torch_tensor(dim(encoder_outputs)[3], dtype = torch_float())
       
    # scaled dot products between state and outputs
    # resulting shape: (bs, timesteps, 1)
    scores <- torch_bmm(encoder_outputs, state) %>%
      torch_div(torch_sqrt(d))
    
    # normalize
    # resulting shape: (bs, timesteps) 
    attention_weights <- scores$squeeze(3) %>%
      nnf_softmax(dim = 2)
    
    # a normalized score for every source token
    attention_weights
  }
)

Decoder

Once attention weights have been computed, their actual application is handled by the decoder. Concretely, the method in question, weighted_encoder_outputs(), computes a product of weights and encoder outputs, making sure that each output will have appropriate impact.

The rest of the action then happens in forward(). A concatenation of weighted encoder outputs (often called “context”) and current input is run through an RNN. Then, an ensemble of RNN output, context, and input is passed to an MLP. Finally, both RNN state and current prediction are returned.

decoder_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, attention_type, attention_size = 8, num_layers = 1) {
    
    self$type <- type
    
    self$rnn <- if (self$type == "gru") {
      nn_gru(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    } else {
      nn_lstm(
        input_size = input_size,
        hidden_size = hidden_size,
        num_layers = num_layers,
        batch_first = TRUE
      )
    }
    
    self$linear <- nn_linear(2 * hidden_size + 1, 1)
    
    self$attention <- if (attention_type == "multiplicative") attention_module_multiplicative()
      else attention_module_additive(hidden_size, attention_size)
    
  },
  
  weighted_encoder_outputs = function(state, encoder_outputs) {

    # encoder_outputs is (bs, timesteps, hidden_dim)
    # state is (1, bs, hidden_dim)
    # resulting shape: (bs * timesteps)
    attention_weights <- self$attention(state, encoder_outputs)
    
    # resulting shape: (bs, 1, seq_len)
    attention_weights <- attention_weights$unsqueeze(2)
    
    # resulting shape: (bs, 1, hidden_size)
    weighted_encoder_outputs <- torch_bmm(attention_weights, encoder_outputs)
    
    weighted_encoder_outputs
    
  },
  
  forward = function(x, state, encoder_outputs) {
 
    # encoder_outputs is (bs, timesteps, hidden_dim)
    # state is (1, bs, hidden_dim)
    
    # resulting shape: (bs, 1, hidden_size)
    context <- self$weighted_encoder_outputs(state, encoder_outputs)
    
    # concatenate input and context
    # NOTE: this repeating is done to compensate for the absence of an embedding module
    # that, in NLP, would give x a higher proportion in the concatenation
    x_rep <- x$repeat_interleave(dim(context)[3], 3) 
    rnn_input <- torch_cat(list(x_rep, context), dim = 3)
    
    # resulting shapes: (bs, 1, hidden_size) and (1, bs, hidden_size)
    rnn_out <- self$rnn(rnn_input, state)
    rnn_output <- rnn_out[[1]]
    next_hidden <- rnn_out[[2]]
    
    mlp_input <- torch_cat(list(rnn_output$squeeze(2), context$squeeze(2), x$squeeze(2)), dim = 2)
    
    output <- self$linear(mlp_input)
    
    # shapes: (bs, 1) and (1, bs, hidden_size)
    list(output, next_hidden)
  }
  
)

seq2seq module

The seq2seq module is basically unchanged (apart from the fact that now, it allows for attention module configuration). For a detailed explanation of what happens here, please consult the previous post.

seq2seq_module <- nn_module(
  
  initialize = function(type, input_size, hidden_size, attention_type, attention_size, n_forecast, 
                        num_layers = 1, encoder_dropout = 0) {
    
    self$encoder <- encoder_module(type = type, input_size = input_size, hidden_size = hidden_size,
                                   num_layers, encoder_dropout)
    self$decoder <- decoder_module(type = type, input_size = 2 * hidden_size, hidden_size = hidden_size,
                                   attention_type = attention_type, attention_size = attention_size, num_layers)
    self$n_forecast <- n_forecast
    
  },
  
  forward = function(x, y, teacher_forcing_ratio) {
    
    outputs <- torch_zeros(dim(x)[1], self$n_forecast)$to(device = device)
    encoded <- self$encoder(x)
    encoder_outputs <- encoded[[1]]
    hidden <- encoded[[2]]
    # list of (batch_size, 1), (1, batch_size, hidden_size)
    out <- self$decoder(x[ , n_timesteps, , drop = FALSE], hidden, encoder_outputs)
    # (batch_size, 1)
    pred <- out[[1]]
    # (1, batch_size, hidden_size)
    state <- out[[2]]
    outputs[ , 1] <- pred$squeeze(2)
    
    for (t in 2:self$n_forecast) {
      
      teacher_forcing <- runif(1) < teacher_forcing_ratio
      input <- if (teacher_forcing == TRUE) pred$unsqueeze(3) else y[ , t - 1]
      out <- self$decoder(pred$unsqueeze(3), state, encoder_outputs)
      pred <- out[[1]]
      state <- out[[2]]
      outputs[ , t] <- pred$squeeze(2)
      
    }
    
    outputs
  }
  
)

When instantiating the top-level model, we now have an additional choice: that between additive and multiplicative attention. In the “accuracy” sense of performance, my tests did not show any differences. However, the multiplicative variant is a lot faster.

net <- seq2seq_module("gru", input_size = 1, hidden_size = 32, attention_type = "multiplicative",
                      attention_size = 8, n_forecast = n_timesteps)

# training RNNs on the GPU currently prints a warning that may clutter 
# the console
# see https://github.com/mlverse/torch/issues/461
# alternatively, use 
# device <- "cpu"
device <- torch_device(if (cuda_is_available()) "cuda" else "cpu")

net <- net$to(device = device)

Training

Just like last time, in model training, we get to choose the degree of teacher forcing. Below, we go with a fraction of 0.0, that is, no forcing at all.

optimizer <- optim_adam(net$parameters, lr = 0.001)

num_epochs <- 100

train_batch <- function(b, teacher_forcing_ratio) {
  
  optimizer$zero_grad()
  output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio)
  target <- b$y$to(device = device)
  
  loss <- nnf_mse_loss(output, target)
  loss$backward()
  optimizer$step()
  
  loss$item()
  
}

valid_batch <- function(b, teacher_forcing_ratio = 0) {
  
  output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio)
  target <- b$y$to(device = device)
  
  loss <- nnf_mse_loss(output, target)
  
  loss$item()
  
}

for (epoch in 1:num_epochs) {
  
  net$train()
  train_loss <- c()
  
  coro::loop(for (b in train_dl) {
    loss <-train_batch(b, teacher_forcing_ratio = 0.3)
    train_loss <- c(train_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, training: loss: %3.5f n", epoch, mean(train_loss)))
  
  net$eval()
  valid_loss <- c()
  
  coro::loop(for (b in valid_dl) {
    loss <- valid_batch(b)
    valid_loss <- c(valid_loss, loss)
  })
  
  cat(sprintf("nEpoch %d, validation: loss: %3.5f n", epoch, mean(valid_loss)))
}
# Epoch 1, training: loss: 0.83752 
# Epoch 1, validation: loss: 0.83167

# Epoch 2, training: loss: 0.72803 
# Epoch 2, validation: loss: 0.80804 

# ...
# ...

# Epoch 99, training: loss: 0.10385 
# Epoch 99, validation: loss: 0.21259 

# Epoch 100, training: loss: 0.10396 
# Epoch 100, validation: loss: 0.20975 

Evaluation

For visual inspection, we pick a few forecasts from the test set.

net$eval()

test_preds <- vector(mode = "list", length = length(test_dl))

i <- 1

vic_elec_test <- vic_elec_daily %>%
  filter(year(Date) == 2014, month(Date) %in% 1:4)


coro::loop(for (b in test_dl) {
  
  input <- b$x
  output <- net(b$x$to(device = device), b$y$to(device = device), teacher_forcing_ratio = 0)
  preds <- as.numeric(output)
  
  test_preds[[i]] <- preds
  i <<- i + 1
  
})

test_pred1 <- test_preds[[1]]
test_pred1 <- c(rep(NA, n_timesteps), test_pred1, rep(NA, nrow(vic_elec_test) - n_timesteps - n_forecast))

test_pred2 <- test_preds[[21]]
test_pred2 <- c(rep(NA, n_timesteps + 20), test_pred2, rep(NA, nrow(vic_elec_test) - 20 - n_timesteps - n_forecast))

test_pred3 <- test_preds[[41]]
test_pred3 <- c(rep(NA, n_timesteps + 40), test_pred3, rep(NA, nrow(vic_elec_test) - 40 - n_timesteps - n_forecast))

test_pred4 <- test_preds[[61]]
test_pred4 <- c(rep(NA, n_timesteps + 60), test_pred4, rep(NA, nrow(vic_elec_test) - 60 - n_timesteps - n_forecast))

test_pred5 <- test_preds[[81]]
test_pred5 <- c(rep(NA, n_timesteps + 80), test_pred5, rep(NA, nrow(vic_elec_test) - 80 - n_timesteps - n_forecast))


preds_ts <- vic_elec_test %>%
  select(Demand, Date) %>%
  add_column(
    ex_1 = test_pred1 * train_sd + train_mean,
    ex_2 = test_pred2 * train_sd + train_mean,
    ex_3 = test_pred3 * train_sd + train_mean,
    ex_4 = test_pred4 * train_sd + train_mean,
    ex_5 = test_pred5 * train_sd + train_mean) %>%
  pivot_longer(-Date) %>%
  update_tsibble(key = name)


preds_ts %>%
  autoplot() +
  scale_color_hue(h = c(80, 300), l = 70) +
  theme_minimal()
A sample of two-weeks-ahead predictions for the test set, 2014.

(#fig:unnamed-chunk-11)A sample of two-weeks-ahead predictions for the test set, 2014.

We can’t directly compare performance here to that of previous models in our series, as we’ve pragmatically redefined the task. The main goal, however, has been to introduce the concept of attention. Specifically, how to manually implement the technique – something that, once you’ve understood the concept, you may never have to do in practice. Instead, you would likely make use of existing tools that come with torch (multi-head attention and transformer modules), tools we may introduce in a future “season” of this series.

Thanks for reading!

Photo by David Clode on Unsplash

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate.” CoRR abs/1409.0473. http://arxiv.org/abs/1409.0473.

Dong, Yihe, Jean-Baptiste Cordonnier, and Andreas Loukas. 2021. “Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth.” arXiv E-Prints, March, arXiv:2103.03404. http://arxiv.org/abs/2103.03404.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” arXiv E-Prints, June, arXiv:1706.03762. http://arxiv.org/abs/1706.03762.

Vinyals, Oriol, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. 2014. “Grammar as a Foreign Language.” CoRR abs/1412.7449. http://arxiv.org/abs/1412.7449.

Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” CoRR abs/1502.03044. http://arxiv.org/abs/1502.03044.

  1. For important papers see: Bahdanau, Cho, and Bengio (2014), Vinyals et al. (2014), Xu et al. (2015).↩︎

  2. Dong, Cordonnier, and Loukas (2021), thus replying to Vaswani et al. (2017), whose 2017 paper “went viral” for stating the opposite.↩︎

Categories
Misc

Training a Neural Network to Win The World’s Oldest Board Game

Training a Neural Network to Win The World’s Oldest Board Game submitted by /u/_Ari___
[visit reddit] [comments]
Categories
Misc

Current Sensing with embedded TF Lite

Current Sensing with embedded TF Lite submitted by /u/LargeCardinal
[visit reddit] [comments]
Categories
Misc

NVIDIA A100 GPUs Available on Google Cloud’s Compute Engine

By Dave Salvator, Senior Manager, Product Marketing at NVIDIA NVIDIA and Google Cloud are making it possible for applications to push the boundaries of accelerated AI across a wide array of applications. With its new A2 VM, announced today, Google Cloud provides customers the largest configuration of 16 NVIDIA A100 GPUs in a single VM. Also […]

By Dave Salvator, Senior Manager, Product Marketing at NVIDIA

NVIDIA and Google Cloud are making it possible for applications to push the boundaries of accelerated AI across a wide array of applications. With its new A2 VM, announced today, Google Cloud provides customers the largest configuration of 16 NVIDIA A100 GPUs in a single VM. Also available are smaller GPU configurations including 1, 2, 4, and 8 GPUs per VM for added flexibility. The single VM offering features NVIDIA’s NVLink Fabric to deliver greater multi-GPU scalability for ultrafast deep learning training.  

Additionally, the A2 is supported by many Google Managed Services for developers:

  • Google Kubernetes Engine (GKE)
  • Google Cloud AI Platform training 
  • Dataproc – fully managed service for SPARK and DASK 
  • Support for A100 with pre-built VM images including versions for CUDA, TensorFlow, PyTorch, and RAPIDS

Developers can take advantage of NVIDIA NGC, a repository for containerized, pre-configured software to accelerate A100 development workflows and building solutions in Google Cloud.

Google Cloud was the first to announce their A100 private alpha program and now has referenceable customers including Cash App, which uses NVIDIA A100 GPUs to power mobile payment innovation and research. Google Cloud AI and NVIDIA delivered a 66% speedup to the processing time needed for Cash App to complete a critical machine learning workflow.

Head over to Google Cloud’s GPU Instances page to learn more about the new A2 instance. 

Categories
Misc

Get Ready for the Future – Boost Your Skills with Hands-on Training at GTC

The NVIDIA Deep Learning Institute is offering nine instructor-led workshops at this year’s GTC on a wide range of advanced software development topics in AI, accelerated computing, and data science.

The NVIDIA Deep Learning Institute is offering nine instructor-led workshops at this year’s GTC from April 12-16 on a wide range of advanced software development topics in AI, accelerated computing, and data science. Developers, data scientists, researchers, and students can get practical experience powered by GPUs in the cloud. Upon completion, participants earn an NVIDIA DLI certificate to demonstrate subject matter competency and support career growth.

The full-day workshops offer a comprehensive learning experience that includes hands-on exercises and guidance from expert instructors certified by DLI. These immersive learning experiences are offered live online in many time zones to reach developers worldwide and available in English, Traditional Chinese, Japanese and Korean.

The $249 registration fee includes interactive learning materials, guided instruction, and access to fully configured GPU-accelerated development servers for hands-on exercises. 

Register here to DLI workshops at GTC or to view the full schedule. DLI workshops offered at GTC include:

  • Fundamentals of Deep Learning — Build the confidence to take on a deep learning project by learning how to train a deep neural network model, work with common data types and model architectures, use transfer learning between models, and more.
  • Fundamentals of Accelerated Computing with CUDA C/C++ — Find out how to accelerate CPU-only applications by exposing their latent parallelism on GPUs, using essential techniques like CUDA memory management to optimize accelerated applications.
  • Fundamentals of Deep Learning for Multi-GPUs — Scale deep learning training to multiple GPUs, significantly shortening the time required to train neural networks, making solving complex problems with deep learning feasible.
  • Accelerating CUDA C++ Applications with Multiple GPUs — Learn how to write CUDA C++ applications that efficiently and correctly utilize all available GPUs in a single node, dramatically improving the performance of applications and making the most cost-effective use of systems with multiple GPUs.
  • Fundamentals of Accelerated Data Science with RAPIDS — Learn how to manipulate large data sets directly on the GPU using Dask and cuDF.  Apply GPU-accelerated machine learning algorithms including XGBoost, cuGRAPH and cuML to perform data analysis at massive scale.
  • Building Transformer-Based Natural Language Processing Applications — Learn about NLP topics like Word2Vec and recurrent neural network-based embeddings, as well as Transformer architecture features and how to improve them. Use pre-trained NLP models for text classification, named-entity recognition and question answering, and deploy refined models for live applications.
  • Applications of AI for Anomaly Detection — Discover how to implement multiple AI-based solutions to identify network intrusions, using accelerated XGBoost, deep learning-based autoencoders and generative adversarial networks.
  • Deep Learning for Autonomous Vehicles–Perception — Learn how to design, train, and deploy deep neural networks and optimize perception components for autonomous vehicles using the NVIDIA DRIVE development platform.
  • Deep Learning for Intelligent Video Analytics — Learn how to build object detection and tracking models to analyze data from large-scale video streams using NVIDIA DeepStream technology. Complete hands-on tasks to build, train, and deploy deep learning models to analyze parking lot camera feeds of a hardware-accelerated traffic management system.

GTC is free to attend and you can register here to access over 1,400 sessions, including short DLI training labs.

Register to reserve your spot to a full-day DLI instructor-led workshop. Space is limited so register early.

Categories
Misc

Trying to run GPT-2-master, getting error

Hi there! im very new at this so if i’ve left anything out please let me know!
so far ive downloaded the file, used cd to get into it and then have done this:

conda create -n py36 python=3.6 anaconda pip3 install tensorflow==1.12.0 pip3 install -r requirements.txt python3 download_model.py 124M 

and upon running

python3 src/interactive_conditional_samples.py --top_k 40 

Anaconda returns

C:ProgramDataAnaconda3envspy36libsite-packagestensorflowpythonframeworkdtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) C:ProgramDataAnaconda3envspy36libsite-packagestensorflowpythonframeworkdtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) C:ProgramDataAnaconda3envspy36libsite-packagestensorflowpythonframeworkdtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) C:ProgramDataAnaconda3envspy36libsite-packagestensorflowpythonframeworkdtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) C:ProgramDataAnaconda3envspy36libsite-packagestensorflowpythonframeworkdtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) C:ProgramDataAnaconda3envspy36libsite-packagestensorflowpythonframeworkdtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) 2021-03-15 16:39:25.892397: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 Traceback (most recent call last): File "src/interactive_conditional_samples.py", line 91, in <module> fire.Fire(interact_model) File "C:ProgramDataAnaconda3envspy36libsite-packagesfirecore.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "C:ProgramDataAnaconda3envspy36libsite-packagesfirecore.py", line 471, in _Fire target=component.__name__) File "C:ProgramDataAnaconda3envspy36libsite-packagesfirecore.py", line 681, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "src/interactive_conditional_samples.py", line 65, in interact_model temperature=temperature, top_k=top_k, top_p=top_p File "C:UserslouisOneDriveDesktopgpt-2-testsrcsample.py", line 74, in sample_sequence past, prev, output = body(None, context, context) File "C:UserslouisOneDriveDesktopgpt-2-testsrcsample.py", line 66, in body logits = top_p_logits(logits, p=top_p) File "C:UserslouisOneDriveDesktopgpt-2-testsrcsample.py", line 28, in top_p_logits sorted_logits = tf.sort(logits, direction='DESCENDING', axis=-1) AttributeError: module 'tensorflow' has no attribute 'sort' 

I have no idea what any of this means. if someone could help me out it would mean a lot! 🙂

submitted by /u/LouisHendrich
[visit reddit] [comments]