Categories
Misc

Optimizing System Latency with NVIDIA Reflex SDK – Available Now

Measuring and optimizing system latency is one of the hardest challenges during game development and the NVIDIA Reflex SDK helps developers solve that issue.

Measuring and optimizing system latency is one of the hardest challenges during game development and the NVIDIA Reflex SDK helps developers solve that issue. NVIDIA Reflex is an easy to integrate SDK that provides API to both measure and reduce system latency – giving players a more responsive experience.  Epic, Bungie, Respawn, Activision Blizzard, and Riot have integrated the NVIDIA Reflex Low Latency mode  into their titles, giving gamers a responsive experience without  dips in resolution or framerate. 

The NVIDIA Reflex SDK offers developers:

  • Low Latency Mode – Aligns game engine work to complete just-in-time for rendering, eliminating the GPU render queue and reducing CPU back pressure in GPU-bound scenarios, thus reducing latency in GPU bound scenarios.
  • Latency Markers – Real time latency metrics broken down by game pipeline stage: Input, Simulation, Render Submission, Graphics Driver, Render Queue, and GPU Render.  Great for debugging and for real time in-game overlays. 
  • Flash Indicator – Using the marker system, the flash indicator marker draws a small white rectangle on the screen each click.  This is helpful when automating the use of a tool like the NVIDIA Reflex Latency Analyzer to measure latency. 

The NVIDIA Reflex SDK is a low latency suite of esports technologies designed to measure, analyze and reduce input latency. The SDK has been built to support custom engines as well as popular game engines such as UE4 and Unity.

Get Started with NVIDIA Reflex >

Categories
Misc

NVIDIA Announces Upcoming Events for Financial Community

SANTA CLARA, Calif., Dec. 17, 2020 (GLOBE NEWSWIRE) — NVIDIA will present at the following events for the financial community: J.P. Morgan Healthcare ConferenceMonday, Jan. 11, at 1:30 p.m. …

Categories
Misc

Updates to NVIDIA’s Unreal Engine 4 Branch, DLSS, and RTXGI Available Now

NVIDIA has released updates to DLSS, NVIDIA’s Unreal Engine 4 Branch, and RTXGl.

To help developers get the most out of Unreal Engine 4 as they head into the new year, NVIDIA RTX UE4.26 has just been released. We have also released the first DLSS plugin that can be used with both NVIDIA’s NvRTX branch and mainline UE4, along with an updated UE4 Plugin for RTX Global Illumination.

NVIDIA RTX UE4.26 

The new NVIDIA UE4.26 Branch offers all of the benefits of mainline UE4.26, while providing some additional features: 

  • Faster ray tracing
    • NVRTX includes a number of improvements to ray tracing performance. Some of these are tunable, some are automatic. 
  • New tools
    • New debugging tools like the BVH viewer and Ray Timing Visualization allows developers to get a handle on ray tracing cost in their scene and get it tuned for speed.
  • Hybrid Translucency
    • Another way to do ray traced translucency, with greater compatibility, speed and rendering options.  
  • World position offset simulation for ray traced instanced static meshes (beta)
    • Allows ambient motion of foliage like trees and grass.
    • Uses approximate technique of shared animations to reduce overhead for simulating a full forest.
    • Selectable per instance type.
  • Inexact Shadows (beta)
    • Deals with potential mesh mismatches of ray traced and raster geometry.
    • Dithers shadow testing to hide potential artifacts. 
    • Enables approximations that improve performance in the management of ray tracing data.

An updated build of NVIDIA RTX UE4.25 has also been released, which includes all of the new features listed. 

Both branches can be found here

NVIDIA DLSS Plugin for UE4

NVIDIA DLSS is a deep learning neural network that boosts frame rates and generates beautiful, sharp images for your games. It delivers the performance headroom needed to maximize ray tracing settings and increase output resolution. It is available for the first time for mainline UE4 (in beta), compatible with UE4.26 Enjoy great scaling across all RTX GPUs and resolutions, and the new ultra performance mode for 8K gaming. 

Request access to the beta for NVIDIA DLSS plugin for UE4 here.

NVIDIA RTXGI Plugin for UE4

Leveraging the power of ray tracing, NVIDIA RTX Global Illumination (RTXGI) provides scalable solutions to compute multi-bounce indirect lighting without bake times, light leaks, or expensive per-frame costs. RTXGI is supported on any DXR-enabled GPU and is an ideal starting point to bring the benefits of ray tracing to your existing tools, knowledge, and capabilities. We have updated our RTXGI UE4 plugin with bugs fixes, image quality improvements, and support for UE4.26.

Request access to the RTXGI plugin for UE4 here

Categories
Offsites

End-to-End, Transferable Deep RL for Graph Optimization

An increasing number of applications are driven by large and complex neural networks trained on diverse sets of accelerators. This process is facilitated by ML compilers that map high-level computational graphs to low-level, device-specific executables. In doing so, ML compilers need to solve many optimization problems, including graph rewriting, assignment of operations on devices, operation fusion, layout and tiling of tensors, and scheduling. For example, in a device placement problem, the compiler needs to determine the mapping between operations in the computational graph to the target physical devices so that an objective function, such as training step time, can be minimized. The placement performance is determined by a mixture of intricate factors, including inter-device network bandwidth, peak device memory, co-location constraints, etc., making it challenging for heuristics or search-based algorithms, which typically settle for fast, but sub-optimal, solutions. Furthermore, heuristics are hard to develop and maintain, especially as newer model architectures emerge.

Recent attempts at using learning-based approaches have demonstrated promising results, but they have a number of limitations that make them infeasible to be deployed in practice. Firstly, these approaches do not easily generalize to unseen graphs, especially those arising from newer model architectures, and second, they have poor sample efficiency, leading to high resource consumption during training. Finally, they are only able to solve a single optimization task, and consequently, do not capture the dependencies across the tightly coupled optimization problems in the compilation stack.

In “Transferable Graph Optimizers for ML Compilers”, recently published as an oral paper at NeurIPS 2020, we propose an end-to-end, transferable deep reinforcement learning method for computational graph optimization (GO) that overcomes all of the above limitations. We demonstrate 33%-60% speedup on three graph optimization tasks compared to TensorFlow default optimization. On a diverse set of representative graphs consisting of up to 80,000 nodes, including Inception-v3, Transformer-XL, and WaveNet, GO achieves an average 21% improvement over expert optimization and an 18% improvement over the prior state of the art with 15x faster convergence.

Graph Optimization Problems in ML Compilers
There are three coupled optimization tasks that frequently arise in ML compilers, which we formulate as decision problems that can be solved using a learned policy. The decision problems for each of the tasks can be reframed as making a decision for each node in the computational graph.

The first optimization task is device placement, where the goal is to determine how best to assign the nodes of the graph to the physical devices on which it runs such that the end-to-end run time is minimized.

The second optimization task is operation scheduling. An operation in a computational graph is ready to run when its incoming tensors are present in the device memory. A frequently used scheduling strategy is to maintain a ready queue of operations for each device and schedule operations in first-in-first-out order. However, this scheduling strategy does not take into account the downstream operations placed on other devices that might be blocked by an operation, and often leads to schedules with underutilized devices. To find schedules that can keep track of such cross-device dependencies, our approach uses a priority-based scheduling algorithm that schedules operations in the ready queue based on the priority of each. Similar to device placement, operation scheduling can then be formulated as the problem of learning a policy that assigns a priority for each node in the graph to maximize a reward based on run time.

The third optimization task is operation fusion. For brevity we omit a detailed discussion of this problem here, and instead just note that similar to priority-based scheduling, operation fusion can also use a priority-based algorithm to decide which nodes to fuse. The goal of the policy network in this case is again to assign a priority for each node in the graph.

Finally, it is important to recognize that the decisions taken in each of the three optimization problems can affect the optimal decision for the other problems. For example, placing two nodes on two different devices effectively disables fusion and introduces a communication delay that can influence scheduling.

RL Policy Network Architecture
Our research presents GO, a deep RL framework that can be adapted to solve each of the aforementioned optimization problems — both individually as well as jointly. There are three key aspects of the proposed architecture:

First, we use graph neural networks (specifically GraphSAGE) to capture the topological information encoded in the computational graph. The inductive network of GraphSAGE leverages node attribute information to generalize to previously unseen graphs, which enables decision making for unseen data without incurring significant cost on training.

Second, computational graphs for many models often contain more than 10k nodes. Solving the optimization problems effectively over such large scales requires that the network is able to capture long-range dependencies between nodes. GO’s architecture includes a scalable attention network that uses segment-level recurrence to capture such long-range node dependencies.

Third, ML compilers need to solve optimization problems over a wide variety of graphs from different application domains. A naive strategy of training a shared policy network with heterogeneous graphs is unlikely to capture the idiosyncrasies of a particular class of graphs. To overcome this, GO uses a feature modulation mechanism that allows the network to specialize for specific graph types without increasing the number of parameters.

Overview of GO: An end-to-end graph policy network that combines graph embedding and sequential attention.

To jointly solve multiple dependent optimization tasks, GO has the ability to add additional recurrent attention layers for each task with parameters shared across different tasks. The recurrent attention layers with residual connections of actions enables tracking inter-task dependencies.

Multi-task policy network that extends GO’s policy network with additional recurrent attention layers for each task and residual connections. GE: Graph Embedding, FC: Fully-Connected Layer, Nxf: fusion action dimension, Fxd: placement action dimension, Nxs: scheduling action dimension.

Results
Next, we present evaluation results on a single-task speedup on a device placement task based on real-hardware measurements, generalization to unseen graphs with different GO variants, and multi-task performance jointly optimizing operations fusion, device placement, and scheduling.

Speedup:
To evaluate the performance of this architecture, we apply GO to a device placement problem based on real-hardware evaluation, where we first train the model separately on each of our workloads. This approach, called GO-one, consistently outperforms expert manual placement (HP), TensorFlow METIS placement, and Hierarchical Device Placement (HDP) — the current state-of-the-art reinforcement learning-based device placement. Importantly, with the efficient end-to-end single-shot placement, GO-one has a 15x speedup in convergence time of the placement network over HDP.

Our empirical results show that GO-one consistently outperforms expert placement, TensorFlow METIS placement, and hierarchical device placement (HDP). Because GO is designed in a way to scale up to extremely large graphs consisting of over 80,000 nodes like an 8-layer Google Neural Machine Translation (GNMT) model, it outperforms previous approaches, including HDP, REGAL, and Placeto. GO achieves optimized graph runtimes for large graphs like GNMT that are 21.7% and 36.5% faster than HP and HDP, respectively. Overall, GO-one achieves on average 20.5% and 18.2% run time reduction across a diverse set of 14 graphs, compared to HP and HDP respectively. Importantly, with the efficient end-to-end single-shot placement, GO-one has a 15x speedup in convergence time of the placement network over HDP.

Generalization:
GO generalizes to unseen graphs using offline pre-training followed by fine-tuning on the unseen graphs. During pre-training, we train GO on heterogeneous subsets of graphs from the training set. We train GO for 1000 steps on each such batch of graphs before switching to the next. This pretrained model is then fine-tuned (GO-generalization+finetune) on hold-out graphs for fewer than 50 steps, which typically takes less than one minute. GO-generalization+finetune for hold-out graphs outperforms both expert placement and HDP consistently on all datasets, and on average matches GO-one.

We also run inference directly on just the pre-trained model without any fine-tuning for the target hold-out graphs, and name this GO-generalization-zeroshot. The performance of this untuned model is only marginally worse than GO-generalization+finetune, while being slightly better than expert placement and HDP. This indicates that both graph embedding and the learned policies transfer efficiently, allowing the model to generalize to the unseen data.

Generalization across heterogeneous workload graphs. The figure shows a comparison of two different generalization strategies for GO when trained with graphs from 5 (except the held-out one) of the 6 workloads (Inception-v3, AmoebaNet, recurrent neural network language model (RNNLM), Google Neural Machine Translation (GNMT), Transformer-XL (TRFXL), WaveNet), and evaluated on the held-out workload (x-axis).

Co-optimizing placement, scheduling, and fusion (pl+sch+fu):
Optimizing simultaneously for placement, scheduling and fusion provides 30%-73% speedup compared to the single-gpu unoptimized case and 33%-60% speedup compared to TensorFlow default placement, scheduling, and fusion. Comparing to optimizing each tasks individually, multi-task GO (pl+sch+fu) outperforms single-task GO (p | sch | fu) — optimizing all tasks, one at a time — by an average of 7.8%. Furthermore, for all workloads, co-optimizing all three tasks offers faster run time than optimizing any two of them and using the default policy for the third.

Run time for various workloads on multi-task optimizations. TF-default: TF GPU default placement, fusion, and scheduling. hp-only: human placement only with default scheduling and fusion. pl-only: GO placement only with default scheduling and fusion. pl | sch: GO optimizes placement and scheduling individually with default fusion. pl+sch: multi-task GO co-optimizes placement and scheduling with default fusion. sch+fu: multi-task GO co-optimizes scheduling and fusion with human placement. pl | sch | fu: GO optimizes placement, scheduling, and fusion separately. pl+sch+fu: multi-task GO co-optimizes placement, scheduling, and fusion.

Conclusion
The increasing complexity and diversity of hardware accelerators has made the development of robust and adaptable ML frameworks onerous and time-consuming, often requiring multiple years of effort from hundreds of engineers. In this article, we demonstrated that many of the optimization problems in such frameworks can be solved efficiently and optimally using a carefully designed learned approach.

Acknowledgements
This is joint work with Daniel Wong, Amirali Abdolrashidi, Peter Ma, Qiumin Xu, Hanxiao Liu, Mangpo Phitchaya Phothilimtha, Shen Wang, Anna Goldie, Azalia Mirhoseini, and James Laudon.

Categories
Misc

Developer Blog: Enhancing Memory Allocation with New NVIDIA CUDA 11.2 Features

CUDA 11.2 has several important features including programming model updates, new compiler features, and enhanced compatibility across CUDA releases. This post offers an overview of the key CUDA 11.2 software features and highlights:

CUDA 11.2 has several important features including programming model updates, new compiler features, and enhanced compatibility across CUDA releases. This post offers an overview of the key CUDA 11.2 software features and highlights:

Categories
Misc

Developer Blog: Monitoring High-Performance ML Models with RAPIDS and whylogs

With RAPIDS, data scientists can now train models 100X faster and more frequently. Like RAPIDS, we’ve ensured that our data logging solution at WhyLabs empowers users working with larger than memory datasets.

With RAPIDS, data scientists can now train models 100X faster and more frequently. Like RAPIDS, we’ve ensured that our data logging solution at WhyLabs empowers users working with larger than memory datasets.

Categories
Misc

Sustainable and Attainable: Zoox Unveils Autonomous Robotaxi Powered by NVIDIA

When it comes to future mobility, you may not have to pave as many paradises for personal car parking lots. This week, autonomous mobility company Zoox unveiled its much-anticipated purpose-built robotaxi. Designed for everyday urban mobility, the vehicle is powered by NVIDIA and is one of the first level 5 robotaxis featuring bi-directional capabilities, providing Read article >

The post Sustainable and Attainable: Zoox Unveils Autonomous Robotaxi Powered by NVIDIA appeared first on The Official NVIDIA Blog.

Categories
Misc

Defining loss when number of inputs is greater than number of outputs

I have trained a model which outputs multiple images (say 2) at
a time, and takes in multiple inputs (say 5) to do so. However my
loss (MSE) is supposed to apply to only 2 pairs made of 2 of the
input and output. Meaning I define my model as:

themodel = Model( [ip1,ip2,ip3,ip4,ip5], [op1,op2] ) themodel.compile( optimizer='Adam', loss=['mse','mse'] ) 

My model seems to train correctly, I just couldn’t find
confirmation in the docs (compile
method)
of how Tf takes care of which tensors to apply the loss
to. My assumption is that it applies the first loss to the first
ip-op pair, and so on till output tensors are available and nothing
is applied to remaining input tensors. Is this what is happening?
Also another silly question, does this mean for multiple output
models, loss has to be defined for all the outputs, and is not
related to the number of inputs?

submitted by /u/juggy94

[visit reddit]

[comments]

Categories
Misc

Kubeflow: Cloud Native ML Toolbox


Kubeflow: Cloud Native ML Toolbox
submitted by /u/SoulmanIqbal

[visit reddit]

[comments]
Categories
Offsites

Convolutional LSTM for spatial forecasting

This post is the first in a loose series exploring forecasting
of spatially-determined data over time. By spatially-determined I
mean that whatever the quantities we’re trying to predict – be
they univariate or multivariate time series, of spatial
dimensionality or not – the input data are given on a spatial
grid.

For example, the input could be atmospheric measurements, such
as sea surface temperature or pressure, given at some set of
latitudes and longitudes. The target to be predicted could then
span that same (or another) grid. Alternatively, it could be a
univariate time series, like a meteorological index.

But wait a second, you may be thinking. For time-series
prediction, we have that time-honored set of recurrent
architectures (e.g., LSTM, GRU), right? Right. We do; but, once we
feed spatial data to an RNN, treating different locations as
different input features, we lose an essential structural
relationship. Importantly, we need to operate in both space and
time. We want both: recurrence relations and convolutional filters.
Enter convolutional RNNs.

What to expect from this post

Today, we won’t jump into real-world applications just yet.
Instead, we’ll take our time to build a convolutional LSTM
(henceforth: convLSTM) in torch. For one, we have to – there is
no official PyTorch implementation.

Keras, on the other hand, has one. If you’re interested in
quickly playing around with a Keras convLSTM, check out this
nice example
.

What’s more, this post can serve as an introduction to
building your own modules. This is something you may be familiar
with from Keras or not – depending on whether you’ve used
custom models or rather, preferred the declarative define ->
compile -> fit style. (Yes, I’m implying there’s some
transfer going on if one comes to torch from Keras custom training.
Syntactic and semantic details may be different, but both share the
object-oriented style that allows for great flexibility and
control.)

Last but not least, we’ll also use this as a hands-on
experience with RNN architectures (the LSTM, specifically). While
the general concept of recurrence may be easy to grasp, it is not
necessarily self-evident how those architectures should, or could,
be coded. Personally, I find that independent of the framework
used, RNN-related documentation leaves me confused. What exactly is
being returned from calling an LSTM, or a GRU? (In Keras this
depends on how you’ve defined the layer in question.) I suspect
that once we’ve decided what we want to return, the actual code
won’t be that complicated. Consequently, we’ll take a detour
clarifying what it is that torch and Keras are giving us.
Implementing our convLSTM will be a lot more straightforward
thereafter.

A torch convLSTM

The code discussed here may be found on GitHub. (Depending on
when you’re reading this, the code in that repository may have
evolved though.)

My starting point was one of the PyTorch implementations found
on the net, namely,
this one
. If you search for “PyTorch convGRU” or “PyTorch
convLSTM”, you will find stunning discrepancies in how these are
realized – discrepancies not just in syntax and/or engineering
ambition, but on the semantic level, right at the center of what
the architectures may be expected to do. As they say, let the buyer
beware. (Regarding the implementation I ended up porting, I am
confident that while numerous optimizations will be possible, the
basic mechanism matches my expectations.)

What do I expect? Let’s approach this task in a top-down
way.

Input and output

The convLSTM’s input will be a time series of spatial data,
each observation being of size (time steps, channels, height,
width).

Compare this with the usual RNN input format, be it in torch or
Keras. In both frameworks, RNNs expect tensors of size (timesteps,
input_dim)1. input_dim is
(1) for univariate time series and greater than (1) for
multivariate ones. Conceptually, we may match this to convLSTM’s
channels dimension: There could be a single channel, for
temperature, say – or there could be several, such as for
pressure, temperature, and humidity. The two additional dimensions
found in convLSTM, height and width, are spatial indexes into the
data.

In sum, we want to be able to pass data that:

  • consist of one or more features,

  • evolve in time, and

  • are indexed in two spatial dimensions.

How about the output? We want to be able to return forecasts for
as many time steps as we have in the input sequence. This is
something that torch RNNs do by default, while Keras equivalents do
not. (You have to pass return_sequences = TRUE to obtain that
effect.) If we’re interested in predictions for just a single
point in time, we can always pick the last time step in the output
tensor.

However, with RNNs, it is not all about outputs. RNN
architectures also carry through hidden states.

What are hidden states? I carefully phrased that sentence to be
as general as possible – deliberately circling around the
confusion that, in my view, often arises at this point. We’ll
attempt to clear up some of that confusion in a second, but let’s
first finish our high-level requirements specification.

We want our convLSTM to be usable in different contexts and
applications. Various architectures exist that make use of hidden
states, most prominently perhaps, encoder-decoder architectures.
Thus, we want our convLSTM to return those as well. Again, this is
something a torch LSTM does by default, while in Keras it is
achieved using return_state = TRUE.

Now though, it really is time for that interlude. We’ll sort
out the ways things are called by both torch and Keras, and inspect
what you get back from their respective GRUs and LSTMs.

Interlude: Outputs, states, hidden values … what’s what?

For this to remain an interlude, I summarize findings on a high
level. The code snippets in the appendix show how to arrive at
these results. Heavily commented, they probe return values from
both Keras and torch GRUs and LSTMs. Running these will make the
upcoming summaries seem a lot less abstract.

First, let’s look at the ways you create an LSTM in both
frameworks. (I will generally use LSTM as the “prototypical RNN
example”, and just mention GRUs when there are differences
significant in the context in question.)

In Keras, to create an LSTM you may write something like
this:

lstm <- layer_lstm(units = 1)

The torch equivalent would be:

lstm <- nn_lstm( input_size = 2, # number of input features hidden_size = 1 # number of hidden (and output!) features )

Don’t focus on torch‘s input_size parameter for this
discussion. (It’s the number of features in the input tensor.)
The parallel occurs between Keras’ units and torch’s
hidden_size. If you’ve been using Keras, you’re probably
thinking of units as the thing that determines output size
(equivalently, the number of features in the output). So when torch
lets us arrive at the same result using hidden_size, what does that
mean? It means that somehow we’re specifying the same thing,
using different terminology. And it does make sense, since at every
time step current input and previous hidden state are added2:

[ mathbf{h}_t = mathbf{W}_{x}mathbf{x}_t +
mathbf{W}_{h}mathbf{h}_{t-1} ]

Now, about those hidden states.

When a Keras LSTM is defined with return_state = TRUE, its
return value is a structure of three entities called output, memory
state, and carry state. In torch, the same entities are referred to
as output, hidden state, and cell state. (In torch, we always get
all of them.)

So are we dealing with three different types of entities? We are
not.

The cell, or carry state is that special thing that sets apart
LSTMs from GRUs deemed responsible for the “long” in “long
short-term memory”. Technically, it could be reported to the user
at all points in time; as we’ll see shortly though, it is
not.

What about outputs and hidden, or memory states? Confusingly,
these really are the same thing. Recall that for each input item in
the input sequence, we’re combining it with the previous state,
resulting in a new state, to be made used of in the next
step3:

[ mathbf{h}_t = mathbf{W}_{x}mathbf{x}_t +
mathbf{W}_{h}mathbf{h}_{t-1} ]

Now, say that we’re interested in looking at just the final
time step – that is, the default output of a Keras LSTM. From
that point of view, we can consider those intermediate computations
as “hidden”. Seen like that, output and hidden states feel
different.

However, we can also request to see the outputs for every time
step. If we do so, there is no difference – the
outputs (plural) equal the hidden states. This can
be verified using the code in the appendix.

Thus, of the three things returned by an LSTM, two are really
the same. How about the GRU, then? As there is no “cell state”,
we really have just one type of thing left over – call it outputs
or hidden states.

Let’s summarize this in a table.

Table 1: RNN terminology. Comparing torch-speak and
Keras-speak. In row 1, the terms are parameter names. In rows 2 and
3, they are pulled from current documentation.

Referring to this entity: torch says: Keras says:

Number of features in the output

This determines both how many output features there are and the
dimensionality of the hidden states.

hidden_size units

Per-time-step output; latent state; intermediate state …

This could be named “public state” in the sense that we, the
users, are able to obtain all values.

hidden state memory state

Cell state; inner state … (LSTM only)

This could be named “private state” in that we are able to
obtain a value only for the last time step. More on that in a
second.

cell state carry state

Now, about that public vs.private distinction. In both
frameworks, we can obtain outputs (hidden states) for every time
step. The cell state, however, we can access only for the very last
time step. This is purely an implementation decision. As we’ll
see when building our own recurrent module, there are no obstacles
inherent in keeping track of cell states and passing them back to
the user.

If you dislike the pragmatism of this distinction, you can
always go with the math. When a new cell state has been computed
(based on prior cell state, input, forget, and cell gates – the
specifics of which we are not going to get into here), it is
transformed to the hidden (a.k.a. output) state making use of yet
another, namely, the output gate:

[ h_t = o_t odot tanh(c_t) ]

Definitely, then, hidden state (output, resp.) builds on cell
state, adding additional modeling power.

Now it is time to get back to our original goal and build that
convLSTM. First though, let’s summarize the return values
obtainable from torch and Keras.

Table 2: Contrasting ways of obtaining various return values
in torch vs. Keras. Cf. the appendix for complete examples.

To achieve this goal: in torch do: in Keras do:
access all intermediate outputs ( = per-time-step outputs) ret[[1]] return_sequences = TRUE
access both “hidden state” (output) and “cell state”
from final time step (only!)
ret[[2]] return_state = TRUE
access all intermediate outputs and the final “cell
state”
both of the above return_sequences = TRUE, return_state = TRUE
access all intermediate outputs and “cell states” from all
time steps
no way no way

convLSTM, the plan

In both torch and Keras RNN architectures, single time steps are
processed by corresponding Cell classes: There is an LSTM Cell
matching the LSTM, a GRU Cell matching the GRU, and so on. We do
the same for ConvLSTM. In convlstm_cell(), we first define what
should happen to a single observation; then in convlstm(), we build
up the recurrence logic.

Once we’re done, we create a dummy dataset, as
reduced-to-the-essentials as can be. With more complex datasets,
even artificial ones, chances are that if we don’t see any
training progress, there are hundreds of possible explanations. We
want a sanity check that, if failed, leaves no excuses. Realistic
applications are left to future posts.

A single step: convlstm_cell

Our convlstm_cell’s constructor takes arguments input_dim ,
hidden_dim, and bias, just like a torch LSTM Cell.

But we’re processing two-dimensional input data. Instead of
the usual affine combination of new input and previous state, we
use a convolution of kernel size kernel_size. Inside convlstm_cell,
it is self$conv that takes care of this.

Note how the channels dimension, which in the original input
data would correspond to different variables, is creatively used to
consolidate four convolutions into one: Each channel output will be
passed to just one of the four cell gates. Once in possession of
the convolution output, forward() applies the gate logic, resulting
in the two types of states it needs to send back to the caller.

library(torch) library(zeallot) convlstm_cell <- nn_module( initialize = function(input_dim, hidden_dim, kernel_size, bias) { self$hidden_dim <- hidden_dim padding <- kernel_size %/% 2 self$conv <- nn_conv2d( in_channels = input_dim + self$hidden_dim, # for each of input, forget, output, and cell gates out_channels = 4 * self$hidden_dim, kernel_size = kernel_size, padding = padding, bias = bias ) }, forward = function(x, prev_states) { c(h_prev, c_prev) %<-% prev_states combined <- torch_cat(list(x, h_prev), dim = 2) # concatenate along channel axis combined_conv <- self$conv(combined) c(cc_i, cc_f, cc_o, cc_g) %<-% torch_split(combined_conv, self$hidden_dim, dim = 2) # input, forget, output, and cell gates (corresponding to torch's LSTM) i <- torch_sigmoid(cc_i) f <- torch_sigmoid(cc_f) o <- torch_sigmoid(cc_o) g <- torch_tanh(cc_g) # cell state c_next <- f * c_prev + i * g # hidden state h_next <- o * torch_tanh(c_next) list(h_next, c_next) }, init_hidden = function(batch_size, height, width) { list( torch_zeros(batch_size, self$hidden_dim, height, width, device = self$conv$weight$device), torch_zeros(batch_size, self$hidden_dim, height, width, device = self$conv$weight$device)) } )

Now convlstm_cell has to be called for every time step. This is
done by convlstm.

Iteration over time steps: convlstm

A convlstm may consist of several layers, just like a torch
LSTM. For each layer, we are able to specify hidden and kernel
sizes individually.

During initialization, each layer gets its own convlstm_cell. On
call, convlstm executes two loops. The outer one iterates over
layers. At the end of each iteration, we store the final pair
(hidden state, cell state) for later reporting. The inner loop runs
over input sequences, calling convlstm_cell at each time step.

We also keep track of intermediate outputs, so we’ll be able
to return the complete list of hidden_states seen during the
process. Unlike a torch LSTM, we do this for every layer.

convlstm <- nn_module( # hidden_dims and kernel_sizes are vectors, with one element for each layer in n_layers initialize = function(input_dim, hidden_dims, kernel_sizes, n_layers, bias = TRUE) { self$n_layers <- n_layers self$cell_list <- nn_module_list() for (i in 1:n_layers) { cur_input_dim <- if (i == 1) input_dim else hidden_dims[i - 1] self$cell_list$append(convlstm_cell(cur_input_dim, hidden_dims[i], kernel_sizes[i], bias)) } }, # we always assume batch-first forward = function(x) { c(batch_size, seq_len, num_channels, height, width) %<-% x$size() # initialize hidden states init_hidden <- vector(mode = "list", length = self$n_layers) for (i in 1:self$n_layers) { init_hidden[[i]] <- self$cell_list[[i]]$init_hidden(batch_size, height, width) } # list containing the outputs, of length seq_len, for each layer # this is the same as h, at each step in the sequence layer_output_list <- vector(mode = "list", length = self$n_layers) # list containing the last states (h, c) for each layer layer_state_list <- vector(mode = "list", length = self$n_layers) cur_layer_input <- x hidden_states <- init_hidden # loop over layers for (i in 1:self$n_layers) { # every layer's hidden state starts from 0 (non-stateful) c(h, c) %<-% hidden_states[[i]] # outputs, of length seq_len, for this layer # equivalently, list of h states for each time step output_sequence <- vector(mode = "list", length = seq_len) # loop over time steps for (t in 1:seq_len) { c(h, c) %<-% self$cell_list[[i]](cur_layer_input[ , t, , , ], list(h, c)) # keep track of output (h) for every time step # h has dim (batch_size, hidden_size, height, width) output_sequence[[t]] <- h } # stack hs for all time steps over seq_len dimension # stacked_outputs has dim (batch_size, seq_len, hidden_size, height, width) # same as input to forward (x) stacked_outputs <- torch_stack(output_sequence, dim = 2) # pass the list of outputs (hs) to next layer cur_layer_input <- stacked_outputs # keep track of list of outputs or this layer layer_output_list[[i]] <- stacked_outputs # keep track of last state for this layer layer_state_list[[i]] <- list(h, c) } list(layer_output_list, layer_state_list) } )

Calling the convlstm

Let’s see the input format expected by convlstm, and how to
access its different outputs.

Here is a suitable input tensor.

# batch_size, seq_len, channels, height, width x <- torch_rand(c(2, 4, 3, 16, 16))

First we make use of a single layer.

model <- convlstm(input_dim = 3, hidden_dims = 5, kernel_sizes = 3, n_layers = 1) c(layer_outputs, layer_last_states) %<-% model(x)

We get back a list of length two, which we immediately split up
into the two types of output returned: intermediate outputs from
all layers, and final states (of both types) for the last
layer.

With just a single layer, layer_outputs[[1]]holds all of the
layer’s intermediate outputs, stacked on dimension two.

dim(layer_outputs[[1]]) # [1] 2 4 5 16 16

layer_last_states[[1]]is a list of tensors, the first of which
holds the single layer’s final hidden state, and the second, its
final cell state.

dim(layer_last_states[[1]][[1]]) # [1] 2 5 16 16 dim(layer_last_states[[1]][[2]]) # [1] 2 5 16 16

For comparison, this is how return values look for a multi-layer
architecture.

model <- convlstm(input_dim = 3, hidden_dims = c(5, 5, 1), kernel_sizes = rep(3, 3), n_layers = 3) c(layer_outputs, layer_last_states) %<-% model(x) # for each layer, tensor of size (batch_size, seq_len, hidden_size, height, width) dim(layer_outputs[[1]]) # 2 4 5 16 16 dim(layer_outputs[[3]]) # 2 4 1 16 16 # list of 2 tensors for each layer str(layer_last_states) # List of 3 # $ :List of 2 # ..$ :Float [1:2, 1:5, 1:16, 1:16] # ..$ :Float [1:2, 1:5, 1:16, 1:16] # $ :List of 2 # ..$ :Float [1:2, 1:5, 1:16, 1:16] # ..$ :Float [1:2, 1:5, 1:16, 1:16] # $ :List of 2 # ..$ :Float [1:2, 1:1, 1:16, 1:16] # ..$ :Float [1:2, 1:1, 1:16, 1:16] # h, of size (batch_size, hidden_size, height, width) dim(layer_last_states[[3]][[1]]) # 2 1 16 16 # c, of size (batch_size, hidden_size, height, width) dim(layer_last_states[[3]][[2]]) # 2 1 16 16

Now we want to sanity-check this module with the
simplest-possible dummy data.

Sanity-checking the convlstm

We generate black-and-white “movies” of diagonal beams
successively translated in space.

Each sequence consists of six time steps, and each beam of six
pixels. Just a single sequence is created manually. To create that
one sequence, we start from a single beam:

library(torchvision) beams <- vector(mode = "list", length = 6) beam <- torch_eye(6) %>% nnf_pad(c(6, 12, 12, 6)) # left, right, top, bottom beams[[1]] <- beam

Using torch_roll() , we create a pattern where this beam moves
up diagonally, and stack the individual tensors along the timesteps
dimension.

for (i in 2:6) { beams[[i]] <- torch_roll(beam, c(-(i-1),i-1), c(1, 2)) } init_sequence <- torch_stack(beams, dim = 1)

That’s a single sequence. Thanks to
torchvision::transform_random_affine(), we almost effortlessly
produce a dataset of a hundred sequences. Moving beams start at
random points in the spatial frame, but they all share that
upward-diagonal motion.

sequences <- vector(mode = "list", length = 100) sequences[[1]] <- init_sequence for (i in 2:100) { sequences[[i]] <- transform_random_affine(init_sequence, degrees = 0, translate = c(0.5, 0.5)) } input <- torch_stack(sequences, dim = 1) # add channels dimension input <- input$unsqueeze(3) dim(input) # [1] 100 6 1 24 24

That’s it for the raw data. Now we still need a dataset and a
dataloader. Of the six time steps, we use the first five as input
and try to predict the last one.

dummy_ds <- dataset( initialize = function(data) { self$data <- data }, .getitem = function(i) { list(x = self$data[i, 1:5, ..], y = self$data[i, 6, ..]) }, .length = function() { nrow(self$data) } ) ds <- dummy_ds(input) dl <- dataloader(ds, batch_size = 100)

Here is a tiny-ish convLSTM, trained for motion prediction:

model <- convlstm(input_dim = 1, hidden_dims = c(64, 1), kernel_sizes = c(3, 3), n_layers = 2) optimizer <- optim_adam(model$parameters) num_epochs <- 100 for (epoch in 1:num_epochs) { model$train() batch_losses <- c() for (b in enumerate(dl)) { optimizer$zero_grad() # last-time-step output from last layer preds <- model(b$x)[[2]][[2]][[1]] loss <- nnf_mse_loss(preds, b$y) batch_losses <- c(batch_losses, loss$item()) loss$backward() optimizer$step() } if (epoch %% 10 == 0) cat(sprintf("nEpoch %d, training loss:%3fn", epoch, mean(batch_losses))) }
Epoch 10, training loss:0.008522 Epoch 20, training loss:0.008079 Epoch 30, training loss:0.006187 Epoch 40, training loss:0.003828 Epoch 50, training loss:0.002322 Epoch 60, training loss:0.001594 Epoch 70, training loss:0.001376 Epoch 80, training loss:0.001258 Epoch 90, training loss:0.001218 Epoch 100, training loss:0.001171

Loss decreases, but that in itself is not a guarantee the model
has learned anything. Has it? Let’s inspect its forecast for the
very first sequence and see.

For printing, I’m zooming in on the relevant region in the
24×24-pixel frame. Here is the ground truth for time step six:

b$y[1, 1, 6:15, 10:19]
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

And here is the forecast. This does not look bad at all, given
there was neither experimentation nor tuning involved.

round(as.matrix(preds[1, 1, 6:15, 10:19]), 2)
 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 [2,] -0.02 0.36 0.01 0.06 0.00 0.00 0.00 0.00 0.00 0 [3,] 0.00 -0.01 0.71 0.01 0.06 0.00 0.00 0.00 0.00 0 [4,] -0.01 0.04 0.00 0.75 0.01 0.06 0.00 0.00 0.00 0 [5,] 0.00 -0.01 -0.01 -0.01 0.75 0.01 0.06 0.00 0.00 0 [6,] 0.00 0.01 0.00 -0.07 -0.01 0.75 0.01 0.06 0.00 0 [7,] 0.00 0.01 -0.01 -0.01 -0.07 -0.01 0.75 0.01 0.06 0 [8,] 0.00 0.00 0.01 0.00 0.00 -0.01 0.00 0.71 0.00 0 [9,] 0.00 0.00 0.00 0.01 0.01 0.00 0.03 -0.01 0.37 0 [10,] 0.00 0.00 0.00 0.00 0.00 0.00 -0.01 -0.01 -0.01 0

This should suffice for a sanity check. If you made it till the
end, thanks for your patience! In the best case, you’ll be able
to apply this architecture (or a..