DataBloom - Part 451

Misc

Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT

Post author By
Post date July 20, 2021
No Comments on Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT

○ TensorRT is an SDK for high-performance deep learning inference and with TensorRT 8.0, you can import models trained using Quantization Aware Training (QAT) to run inference in INT8 precision without losing FP32 accuracy. QAT significantly reduces compute required and storage overhead for efficient inference.

Deep learning is revolutionizing the way that industries are delivering products and services. These services include object detection, classification, and segmentation for computer vision, and text extraction, classification, and summarization for language-based applications. These applications must run in real time.

Most of the models are trained in floating-point 32-bit arithmetic to take advantage of a wider dynamic range. However, at inference, these models may take a longer time to predict results compared to reduced precision inference, causing some delay in the real-time responses, and affecting the user experience.

It’s better in many cases to use reduced precision or 8-bit integer numbers. The challenge is that simply rounding the weights after training may result in a lower accuracy model, especially if the weights have a wide dynamic range. This post provides a simple introduction to quantization-aware training (QAT), and how to implement fake-quantization during training, and perform inference with NVIDIA TensorRT 8.0.

Overview

Model quantization is a popular deep learning optimization method in which model data—both network parameters and activations—are converted from a floating-point representation to a lower-precision representation, typically using 8-bit integers. This has several benefits:

When processing 8-bit integer data, NVIDIA GPUs employ the faster and cheaper 8-bit Tensor Cores to compute convolution and matrix-multiplication operations. This yields more compute throughput, which is particularly effective on compute-limited layers.
Moving data from memory to computing elements (streaming multiprocessors in NVIDIA GPUs) takes time and energy, and also produces heat. Reducing the precision of activation and parameter data from 32-bit floats to 8-bit integers results in 4x data reduction, which saves power and reduces the produced heat.
Some layers are bandwidth-bound (memory-limited). That means that their implementation spends most of its time reading and writing data, and therefore reducing their computation time does not reduce their overall runtime. Bandwidth-bound layers benefit most from reduced bandwidth requirements.
A reduced memory footprint means that the model requires less storage space, parameter updates are smaller, cache utilization is higher, and so on.

Quantization methods

Quantization has many benefits but the reduction in the precision of the parameters and data can easily hurt a model’s task accuracy. Consider that 32-bit floating-point can represent roughly 4 billion numbers in the interval [-3.4e38, 3.40e38]. This interval of representable numbers is also known as the dynamic-range. The distance between two neighboring representable numbers is the precision of the representation.

Floating-point numbers are distributed nonuniformly in the dynamic range and about half of the representable floating-point numbers are in the interval [-1,1]. In other words, representable numbers in the [-1, 1] interval would have higher precision than numbers in [1, 2]. The high density of representable 32-bit floating-point numbers in [-1, 1] is helpful in deep learning models where parameters and data have most of their distribution mass around zero.

Using an 8-bit integer representation, however, you can represent only 2⁸ different values. These 256 values can be distributed uniformly or nonuniformly, for example, for higher precision around zero. All mainstream, deep-learning hardware and software chooses to use a uniform representation because it enables computing using high-throughput parallel or vectorized integer math pipelines.

To convert the representation of a floating-point tensor ( $x_{f}$ ) to an 8-bit representation ( $x_{q}$ ), a scale-factor is used to map the floating-point tensor’s dynamic-range to [-128, 127]:

$x_{q} = Clip(Round(x_{f}/scale))$

This is symmetric quantization because the dynamic-range is symmetric about the origin. $Round$ is a function that applies some rounding-policy to round rational numbers to integers; and $Clip$ is a function that clips outliers that fall outside the [-128, 127] interval. TensorRT uses symmetric quantization to represent both activation data and model weights.

At the top of Figure 1 is a diagram of an arbitrary floating-point tensor $x_{f}$ , depicted as a histogram of the distribution of its elements. We chose a symmetric range of coefficients to represent in the quantized tensor: [ $-amax$ , $amax$ ]. Here, $amax$ is the element with the largest absolute value to represent. To compute the quantization scale, divide the float-point dynamic-range into 256 equal parts:

$amax = max(abs(x_{f}))$

$scale = (2 * amax) / 256$

The method shown here to compute the scale uses the full range that you can represent with signed 8-bit integers: [-128, 127]. TensorRT Explicit Precision (Q/DQ) networks use this range when quantizing weights and activations.

There is tension between the dynamic range chosen to represent using 8-bit integers and the error introduced by the rounding operation. A larger dynamic range means that more values from the original floating-point tensor get represented in the quantized tensor, but it also means using a lower precision and introducing a larger rounding error.

Choosing a smaller dynamic range reduces the rounding error but introduce a clipping error. Floating-point values that are outside the dynamic range are clipped to the min/max value of the dynamic range.

The diagram shows the range of the floating point integer on the top and its mapping to 8 bit signed integer between -128 to 127. — *Figure 1. 8-bit signed integer quantization of a floating-point tensor $x_{f}$* . *The symmetric dynamic range of* $x_{f} [-amax, amax]$ *is mapped through quantization to* [-128, 127].

*Figure 1. 8-bit signed integer quantization of a floating-point tensor $x_{f}$* . *The symmetric dynamic range of* $x_{f} [-amax, amax]$ *is mapped through quantization to* [-128, 127].

To address the effects of the loss of precision on the task accuracy, various quantization techniques have been developed. These techniques can be classified as belonging to one of two categories: post-training quantization (PTQ) or quantization-aware training (QAT).

As the name suggests, PTQ is performed after a high-precision model has been trained. With PTQ, quantizing the weights is easy. You have access to the weight tensors and can measure their distributions. Quantizing the activations is more challenging because the activation distributions must be measured using real input data.

To do this, the trained floating-point model is evaluated using a small dataset representative of the task’s real input data, and statistics about the interlayer activation distributions are collected. As a final step, the quantization scales of the model’s activation tensors are determined using one of several optimization objectives. This process is calibration and the representative dataset used is the calibration-dataset.

Sometimes PTQ is not able to achieve acceptable task accuracy. This is when you might consider using QAT. The idea behind QAT is simple: you can improve the accuracy of quantized models if you include the quantization error in the training phase. It enables the network to adapt to the quantized weights and activations.

There are various recipes to perform QAT, from starting with an untrained model to starting with a pretrained model. All recipes change the training regimen to include the quantization error in the training loss by inserting fake-quantization operations into the training graph to simulate the quantization of data and parameters. These operations are called ‘fake’ because they quantize the data, but then immediately dequantize the data so the operation’s compute remains in float-point precision. This trick adds quantization noise without changing much in the deep-learning framework.

In the forward-pass, you fake-quantize the floating-point weights and activations and use these fake-quantized weights and activations to perform the layer’s operation. In the backward pass, you use the weights’ gradients to update the floating-point weights. To deal with the quantization gradient, which is zero almost everywhere except for points where it is undefined, you use the (straight-through estimator (STE), which passes the gradient as-is through the fake-quantization operator. When the QAT process is done, the fake-quantization layers hold the quantization scales that you use to quantize the weights and activations that the model is used for inference.

Fake Quantization operators are added in the forward to introduce training loss which get optimized and reduced in the backward pass. — *Figure 2. QAT fake-quantization operators in the training forward-pass* (left) *and backward-pass* (right)

PTQ is the more popular method of the two because it is simple and doesn’t involve the training pipeline, which also makes it the faster method. However, QAT almost always produces better accuracy, and sometimes this is the only acceptable method.

Quantization in TensorRT

TensorRT 8.0 supports INT8 models using two different processing modes. The first processing mode uses the TensorRT tensor dynamic-range API and also uses INT8 precision (8-bit signed integer) compute and data opportunistically to optimize inference latency.

The workflow on the left shows the steps that Post Training Quantization requires, and on the right the workflow for Quantization Aware Training is shown which is much simpler. — *Figure 3. TensorRT PTQ workflow* (left) *vs. TensorRT INT8 quantization using quantization scales derived from the configured tensors dynamic-range* (right)

This mode is used when TensorRT performs the full PTQ calibration recipe and when TensorRT uses preconfigured tensor dynamic-ranges (Figure 3). The other TensorRT INT8 processing mode is used when processing floating-point ONNX networks with QuantizeLayer/DequantizeLayer layers and follows explicit quantization rules. For more information about the differences, see Explicit-Quantization vs. PTQ-Processing in the TensorRT Developer Guide.

TensorRT Quantization Toolkit

The TensorRT Quantization Toolkit for PyTorch compliments TensorRT by providing a convenient PyTorch library that helps produce optimizable QAT models. The toolkit provides an API to automatically or manually prepare a model for QAT or PTQ.

At the core of the API is the TensorQuantizer module, which can quantize, fake-quantize, or collect statistics on a tensor. It is used together with QuantDescriptor, which describes how a tensor should be quantized. Layered on top of TensorQuantizer are quantized modules that are designed as drop-in replacements of PyTorch’s full-precision modules. These are convenience modules that use TensorQuantizer to fake-quantize or collect statistics on a module’s weights and inputs.

The API supports the automatic conversion of PyTorch modules to their quantized versions. Conversion can also be done manually using the API, which allows for partial quantization in cases where you don’t want to quantize all modules. For example, some layers may be more sensitive to quantization and leaving them unquantized improves task accuracy.

This includes tools for both PTQ as well as QAT such as Calibrator, TensorQuantizer, QuantDescriptor. — *Figure 4. TensorRT Quantization Toolkit components*

The TensorRT-specific recipe for QAT is described in detail in NVIDIA Quantization whitepaper, which includes a more rigorous discussion of the quantization methods and results from experiments comparing QAT and PTQ on various learning tasks.

Code example walkthrough

This section describes the classification-task quantization example included with the toolkit.

The recommended toolkit recipe for QAT calls for starting with a pretrained model, as it’s been shown that starting from a pretrained model and fine-tuning leads to better accuracy and requires significantly fewer iterations. In this case, you load a pretrained ResNet50 model. The command-line arguments for running the example from the bash shell:

python3 classification_flow.py --data-dir [path to ImageNet DS] --out-dir . --num-finetune-epochs 1 --evaluate-onnx --pretrained --calibrator=histogram --model resnet50_res

The --data-dir argument points to the ImageNet (ILSVRC2012) dataset, which you must download separately. The --calibrator=histogram argument specifies that the model should be calibrated, using the histogram calibrator, before fine-tuning the model. The rest of the arguments, and many more, are documented in the example.

The ResNet50 model is originally from Facebook’s Torchvision package, but because it includes some important changes (quantization of skip-connections), the network definition is included with the toolkit (resnet50_res). For more information, see Q/DQ Layer-Placement Recommendations.

Here’s a brief overview of the code. For more information, see Quantizing ResNet50.

 # Prepare the pretrained model and data loaders
 model, data_loader_train, data_loader_test, data_loader_onnx = prepare_model(
         args.model_name,
         args.data_dir,
         not args.disable_pcq,
         args.batch_size_train,
         args.batch_size_test,
         args.batch_size_onnx,
         args.calibrator,
         args.pretrained,
         args.ckpt_path,
         args.ckpt_url)

The function prepare_model instantiates the data loaders and model as usual, but it also configures the quantization descriptors. Here’s an example:

 # Initialize quantization
 if per_channel_quantization:
         quant_desc_input = QuantDescriptor(calib_method=calibrator)
 else:
         quant_desc_input = QuantDescriptor(calib_method=calibrator, axis=None)
 quant_nn.QuantConv2d.set_default_quant_desc_input(quant_desc_input)
 quant_nn.QuantConvTranspose2d.set_default_quant_desc_input(quant_desc_input)
 quant_nn.QuantLinear.set_default_quant_desc_input(quant_desc_input)
 quant_desc_weight = QuantDescriptor(calib_method=calibrator, axis=None)
 quant_nn.QuantConv2d.set_default_quant_desc_weight(quant_desc_weight)
 quant_nn.QuantConvTranspose2d.set_default_quant_desc_weight(quant_desc_weight)
 quant_nn.QuantLinear.set_default_quant_desc_weight(quant_desc_weight)

Instances of QuantDescriptor describe how to calibrate and quantize tensors by configuring the calibration method and axis of quantization. For each quantized operation (such as quant_nn.QuantConv2d), you configure the activations and weights in QuantDescriptor separately because they use different fake-quantization nodes.

You then add fake-quantization nodes in the training graph. The following code (quant_modules.initialize) dynamically patches PyTorch code behind the scenes so that some of the torch.nn.module subclasses are replaced by their quantized counterparts, instantiates the model’s modules, and then reverts the dynamic patch (quant_modules.deactivate). For example, torch.nn.conv2d is replaced by pytorch_quantization.nn.QuantConv2d, which performs fake-quantization before performing the 2D convolution. The method quant_modules.initialize should be invoked before model instantiation.

quant_modules.initialize()
model = torchvision.models.__dict__[model_name](pretrained=pretrained)
quant_modules.deactivate()

Next, you collect statistics (collect_stats) on the calibration data: feed calibration data to the model and collect activation distribution statistics in the form of a histogram for each layer to quantize. After you’ve collected the histogram data, calibrate the scales (calibrate_model) using one or more calibration algorithms (compute_amax).

During calibration, try to determine the quantization scale of each layer, so that it optimizes some objective, such as the model accuracy. There are currently two calibrator classes:

pytorch_quantization.calib.histogram—Uses entropy minimization (KLD), mean-square error minimization (MSE), or a percentile metric method (choose the dynamic-range such that a specified percentage of the distribution is represented).
pytorch_quantization.calib.max—Calibrates using the maximum activation value (represents the entire dynamic range of the floating point data).

To determine the quality of the calibration method afterward, evaluate the model accuracy on your dataset. The toolkit makes it easy to compare the results of the four different calibration methods to discover the best method for a specific model. The toolkit can be extended with proprietary calibration algorithms. For more information, see the ResNet50 example notebook.

If the model’s accuracy is satisfactory, you don’t have to proceed with QAT. You can export to ONNX and be done. That would be the PTQ recipe. TensorRT is given the ONNX model that has Q/DQ operators with quantization scales, and it optimizes the model for inference. So, this is a PTQ workflow that results in a Q/DQ ONNX model.

To continue to the QAT phase, choose the best calibrated, quantized model. Use QAT to fine-tune for around 10% of the original training schedule with an annealing learning-rate schedule, and finally export to ONNX. For more information, see the Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation whitepaper.

There are a couple of things to keep in mind when exporting to ONNX:

Per-channel quantization (PCQ) was introduced in ONNX opset 13, so if you are using PCQ as recommended, mind the opset version that you are using.
The argument do_constant_folding should be set to True to produce smaller models that are more readable.

torch.onnx.export(model, dummy_input, onnx_filename, verbose=False, opset_version=opset_version, do_constant_folding=True)

When the model is finally exported to ONNX, the fake-quantization nodes are exported to ONNX as two separate ONNX operators: QuantizeLinear and DequantizeLinear (shown in Figure 5 as Q and DQ).

The fake quantization nodes added in the training framework which is PyTorch in this case gets converted and exported to ONNX as two separate ONNX Operators as QuantizeLinear and DequantizeLinear. — *Figure 5. Fake-quantization operators are converted to* Q/DQ ONNX operators when the PyTorch model is exported to ONNX

QAT inference phase

At a high level, TensorRT processes ONNX models with Q/DQ operators similarly to how TensorRT processes any other ONNX model:

TensorRT imports an ONNX model containing Q/DQ operations.
It performs a set of optimizations that are dedicated to Q/DQ processing.
It continues to perform the general optimization passes.
It builds a platform-specific, execution-plan file for inference execution. This plan file contains quantized operations and weights.

Building Q/DQ networks in TensorRT does not require any special builder configuration, aside from enabling INT8, because it is automatically enabled when Q/DQ layers are detected in the network. The minimal command to build a Q/DQ network using the TensorRT sample application trtexec is as follows:

$ trtexec -int8

TensorRT optimizes Q/DQ networks using a special mode referred to as explicit quantization, which is motivated by the requirements for network processing-predictability and control over the arithmetic precision used for network operation. Processing-predictability is the promise to maintain the arithmetic precision of the original model. The idea is that Q/DQ layers specify where precision transitions must happen and that all optimizations must preserve the arithmetic semantics of the original ONNX model.

Contrasting TensorRT Q/DQ processing and plain TensorRT INT8 processing helps explain this better. In plain TensorRT, INT8 network tensors are assigned quantization scales, using the dynamic range API or through a calibration process. TensorRT treats the model as a floating-point model when applying the backend optimizations and uses INT8 as another tool to optimize layer execution time. If a layer runs faster in INT8, then it is configured to use INT8. Otherwise, FP32 or FP16 is used, whichever is faster. In this mode, TensorRT is optimizing for latency only, and you have little control over which operations are quantized.

In contrast, in explicit quantization, Q/DQ layers specify where precision transitions must happen. The optimizer is not allowed to perform precision-conversions not dictated by the network. This is true even if such conversions increase layer precision (for example, choosing an FP16 implementation over an INT8 implementation) and even if such conversion results in a plan file that executes faster (for example, preferring INT8 over FP16 on V100 where INT8 is not accelerated by Tensor Cores).

In explicit quantization, you have full control over precision transitions and the quantization is predictable. TensorRT still optimizes for performance but under the constraint of maintaining the original model’s arithmetic precision. Using the dynamic-range API on Q/DQ networks is not supported.

The explicit quantization optimization passes operate in three phases:

First, the optimizer tries to maximize the model’s INT8 data and compute using Q/DQ layer propagation. Q/DQ propagation is a set of rules specifying how Q/DQ layers can migrate in the network. For example, QuantizeLayer can migrate toward the beginning of the network by swapping places with a ReLU activation layer. By doing so, the input and output activations of the ReLU layer are reduced to INT8 precision and the bandwidth requirement is reduced by 4x.
Then, the optimizer fuses layers to create quantized operations that operate on INT8 inputs and use INT8 math pipelines. For example, QuantizeLayer can fuse with ConvolutionLayer.
Finally, the TensorRT auto-tuner optimizer searches for the fastest implementation of each layer that also respects the layer’s specified input and output precision.

For more information about the main explicit quantization optimizations that TensorRT performs, see the TensorRT Developer Guide.

The plan file created from building a TensorRT Q/DQ network contains quantized weights and operations and is ready to deploy. EfficientNet is one of the networks that requires QAT to maintain accuracy. The following chart compares PTQ to QAT.

The graphs shows the comparison for Quantized EfficientNet-B0 INT8 model, for which the PTQ accuracy is almost 30%, whereas QAT accuracy is similar to FP32 model. — *Figure 6. Accuracy comparison for EfficientNet-B0 on PTQ and QAT*

For more information, see the EfficientNet Quantization example on NVIDIA DeepLearningExamples.

Conclusion

In this post, we briefly introduced basic quantization concepts and TensorRT’s quantization toolkit and then reviewed how TensorRT 8.0 processes Q/DQ networks. We did a quick walkthrough of the ResNet50 QAT example provided with the Quantization Toolkit.

ResNet50 can be quantized using PTQ and doesn’t require QAT. EfficientNet, however, requires QAT to maintain accuracy. The EfficientNet B0 baseline floating-point Top1 accuracy is 77.4, while its PTQ Top1 accuracy is 33.9 and its QAT Top1 accuracy is 76.8.

For more information, see the GTC 2021 session, Quantization Aware Training in PyTorch with TensorRT 8.0.

Misc

Using Neural Networks for Your Recommender System

Post author By
Post date July 20, 2021
No Comments on Using Neural Networks for Your Recommender System

This post is an introduction to deep learning-based recommender systems. It highlights the benefits of using neural networks and explains the different components. Neural network architectures are covered from the basic matrix factorization with deep learning up to session-based models.

Deep learning (DL) is the state-of-the-art solution for many machine learning problems, such as computer vision or natural language problems and it outperforms alternative methods. Recent trends include applying DL techniques to recommendation engines. Many large companies—such as AirBnB, Facebook, Google, Home Depot, LinkedIn, and Pinterest—share their experience in using DL for recommender systems.

Recently, NVIDIA and the RAPIDS.AI team won three competitions with DL: the ACM RecSys2021 Challenge, SIGIR eCom Data Challenge, and ACM WSDM2021 Booking.com Challenge.

The field of recommender systems is complex. In this post, I focus on the neural network architecture and its components, such as embedding and fully connected layers, recurrent neural network cells (LSTM or GRU), and transformer blocks. I discuss popular network architectures, such as Google’s Wide & Deep and Facebook’s Deep Learning Recommender Model (DLRM).

Benefits of DL recommender systems

There are many different techniques to design a recommender system, such as association rules, content-based or collaborative filtering, matrix factorization, or training a linear or tree-based model to predict the interaction likelihood.

What are the advantages of using neural networks? In general, DL models achieve higher accuracy. First, DL can leverage additional data. Many traditional machine-learning techniques plateau with more data. However, when you increase the capacity of neural networks, the model can improve performance with more data.

Second, neural networks are flexible in their design. For example, you can train a DL model on multiple objectives (multitask learning), such as “will the user add the item to the cart?”, “start the checkout with the item?”, or “purchase the item?” Each goal helps the model to extract information from the data and the goals can support each other.

Other design approaches include adding multimodal data to the recommender model. You can do this by processing product images with a convolutional neural network or product description with an NLP model. Neural networks are used in many domains. You can transfer new developments, such as optimizers or new layers, to recommender systems.

Finally, DL frameworks are highly optimized to process terabytes to petabytes of data for all kinds of domains. Here’s how you can design neural networks for recommender systems.

Basic building block: Embedding layers

Embedding layers represent categories with dense vectors. This technique is popular in NLP to embed words with dense representation. Words with similar meaning have a similar embedding vector.

You can apply the same technique to recommender systems. The most trivial recommender system is based on users and items: Which items should you recommend to a user? You have user IDs and item IDs. The words are the users and items, so you use two embedding tables (Figure 1).

Embedding tables are a dense representation of sparse categories. Each category is represented by a vector. — *Figure 1. Embedding tables with dimensionality 4*

Calculate the dot-product between the user embedding and the item embedding to get a final score, the likelihood that a user interacts with an item. You may apply the sigmoid activation function as a last step to transform the output to a probability between 0 and 1.

$dot product: u cdot v = Sigma a_i cdot b_i$

User ID and item ID are embedded and the dot product is applied to the embedding vectors. — *Figure 2. Neural network with two embedding tables and dot product output*

This method is equivalent to matrix factorization or alternating least squares (ALS).

Deeper models with fully connected layers

The performance of neural networks is based on deep architectures with multiple, nonlinear layers. You can extend the previous model by feeding the output of your embedding layers through multiple, fully connected layers with ReLU activations.

A design choice is how to combine the two embedding vectors. Either you can concatenate only the embedding vectors or you can multiply the vectors element-wise with each other, similar to a dot product. The output is followed by multiple hidden layers.

Combine the user and item embeddings with concatenating or element-wise multiplication. Multiple hidden layers can process the resulting vector. — *Figure 3. Neural network with two embedding tables and multiple, fully connected layers*

Adding metadata information to the neural network

So far, you’ve used only the user ID and product ID as an input, but you often have more information available. Additional user information could be gender, age, city (address), time since last visit, or credit card used for payment. An item usually has a brand, price, categories, or quantity sold in the last 7 days. This side information can help the model to generalize better. Modify the neural network to use the additional features as input.

Add more information to the neural network architecture. You can add side information, such as city, age, branch, category, and price. — *Figure 4. Neural network with meta information and multiple fully connected layers*

Popular architectures

Embedding layers and fully connected layers are the main components to understand some of the latest published neural network architectures. In this post, I cover Google’s Wide and Deep from 2016 and Facebook’s DLRM from 2019.

Google’s Wide and Deep

Google’s Wide and Deep contains two components:

A wide tower to memorize common feature combinations
A deep tower to generalize rare or unobserved feature combinations

The innovation is that both components are trained simultaneously, which is possible as neural networks are flexible. The deep tower feeds categorical features through embedding layers and concatenates the output with numerical input features. The concatenated vector is fed through multiple fully connected layers.

Does that sound familiar to you? Yes, that is your previous neural network design. The new component is the wide tower, which is just a linear combination of the input features, with a similar linear/logistic regression. The output of each tower is summed for the final prediction value.

Facebook’s DLRM

Facebook’s DLRM has a similar structure to the neural network architecture with metadata but has some specific differences. The dataset can contain multiple categorical features. DLRM requires that all categorical inputs are fed through an embedding layer with the same dimensionality. Later, I discuss why this is important.

Next, the continuous inputs are concatenated and fed through multiple, fully connected layers, called bottom multilayer perceptron (MLP). The final layer of the bottom MLP has the same dimensionality as the embedding layer vectors.

DLRM uses a new combination layer. It applies element-wise multiplication between all pairs of embedding vectors and bottom MLP output. That is the reason each vector has the same dimensionality. The resulting vectors are concatenated and fed through another set of fully connected layers (top MLP).

Wide and Deep architecture is visualized on the left and DLRM architecture is on the right. — *Figure 5. Wide and Deep architecture is visualized left and DLRM architecture is on the right.*

Session-based recommender systems

When I analyzed different DL-based architectures for recommender systems, I assumed that the input has a tabular data structure and ignored the nature of user interactions. However, a user has multiple interactions in one session when they visit the website. For example, they visit a shop and view multiple product pages. Can you use the sequence of user interactions as an input to extract patterns?

In one session, the user views multiple pairs of jeans in a row and you should recommend another pair of jeans. In another session, the same user views multiple pairs of shoes in a row and you should recommend another pair of shoes. That is the intuition behind session-based recommender systems.

Thankfully, you can apply some techniques from NLP to the recommender system domain. The user’s interactions have a sequential structure.

User behavior is often a sequence of actions, which you can process with a neural network, such as an RNN or Transformer layer. Add the hidden representation of the sequence to the neural network model. — *Figure 6. Session-based neural network architecture*

The sequence can be processed by using either a recurrent neural network (RNN) or transformer-based architecture as the Sequence Layer. You represent the item IDs with embedding vectors and feed the output through your sequence layer. The hidden representation of the sequence layer can be added as an input to your deep learning architecture.

Other options

As I focused this post on the theory of applying DL to recommender systems, I didn’t cover many other challenges. I briefly describe them here to provide a starting point:

Embedding tables can exceed CPU or GPU memory. As an online service can have millions of users, the embedding table can reach up to multiple terabytes. NVIDIA provides the HugeCTR framework to scale embedding tables beyond CPU or GPU memory.
Maximize GPU utilization during training. DL-based recommender systems have a shallow network architecture with only a few, fully connected layers. The data loader is sometimes the bottleneck in training pipelines. To counteract this, NVIDIA developed a highly optimized GPU data loader for TensorFlow and PyTorch.
Generating recommendations requires you to score user-item pairs. The worst-case scenario is to predict, for all available products, the probability and select the top ones. In practice, that isn’t feasible and candidates are generated with a low-overhead model, such as approximate nearest neighbors.

Summary

This post introduced you to DL-based recommender systems. I started with basic matrix factorization based on two inputs and went over the latest session-based architecture using transformer layers.

You can process the sequence by using either a recurrent neural network (RNN) or transformer-based architecture as the sequence layer. Represent the item IDs with embedding vectors and feed the output through the sequence layer. Add the hidden representation of the sequence layer as an input to your DL architecture.

Interested in learning more about recommender systems? NVIDIA Merlin is an open-source framework to accelerate recommender systems end-to-end on the GPU. NVIDIA continuously develop more resources to train and deploy DL-based recommender systems easily. Here are some resources to help:

Examples in the NVIDIA/NVTabular GitHub repo
Deep Learning Recommender Summit on July 29, where NVIDIA and guest speakers will talk about their experience in deploying recommender systems and overcoming challenges

Misc

Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT

Post author By
Post date July 20, 2021
No Comments on Accelerating Inference with Sparsity Using the NVIDIA Ampere Architecture and NVIDIA TensorRT

○ TensorRT is an SDK for high-performance deep learning inference, and TensorRT 8.0 introduces support for sparsity that uses sparse tensor cores on NVIDIA Ampere GPUs. It can accelerate networks by reducing the computation of zeros present in GEMM operations in neural networks. You get a performance gain compared to dense networks by just following the steps in this post.

This post was updated July 20, 2021 to reflect NVIDIA TensorRT 8.0 updates.

When deploying a neural network, it’s useful to think about how the network could be made to run faster or take less space. A more efficient network can make better predictions in a limited time budget, react more quickly to unexpected input, or fit into constrained deployment environments.

Sparsity is one optimization technique that holds the promise of meeting these goals. If there are zeros in the network, then you don’t need to store or operate on them. The benefits of sparsity only seem straightforward. There have long been three challenges to realizing the promised gains.

Acceleration—Fine-grained, unstructured, weight sparsity lacks structure and cannot use the vector and matrix instructions available in efficient hardware to accelerate common network operations. Standard sparse formats are inefficient for all but high sparsities.
Accuracy—To achieve a useful speedup with fine-grained, unstructured sparsity, the network must be made sparse, which often causes accuracy loss. Alternate pruning methods that attempt to make acceleration easier, such as coarse-grained pruning that removes blocks of weights, channels, or entire layers, can run into accuracy trouble even sooner. This limits the potential performance benefit.
Workflow—Much of the current research in network pruning serves as useful existence proofs. It has been shown that network A can achieve Sparsity X. The trouble comes when you try to apply Sparsity X to network B. It may not work due to differences in the network, task, optimizer, or any hyperparameter.

In this post, we discuss how the NVIDIA Ampere Architecture addresses these challenges. Today, NVIDIA is releasing TensorRT version 8.0, which introduces support for the Sparse Tensor Cores available on the NVIDIA Ampere Architecture GPUs.

TensorRT is an SDK for high-performance deep learning inference, which includes an optimizer and runtime that minimizes latency and maximizes throughput in production. Using a simple training workflow and deploying with TensorRT 8.0, Sparse Tensor Cores can eliminate unnecessary calculations in neural networks, resulting in over 30% performance/watt gain compared to dense networks.

Sparse Tensor Cores accelerate 2:4 fine-grained structured sparsity

The NVIDIA A100 GPU adds support for fine-grained structured sparsity to its Tensor Cores. Sparse Tensor Cores accelerate a 2:4 sparsity pattern. In each contiguous block of four values, two values must be zero. This naturally leads to a sparsity of 50%, which is fine-grained. There are no vector or block structures pruned together. Such a regular pattern is easy to compress and has a low metadata overhead (Figure 1).

A matrix with 50% empty (zero-valued) locations, on the left, is compressed to half its original size with some metadata to indicate the positions of nonzero elements, on the right. — *Figure 1. A 2:4 structured sparse matrix W, and its compressed representation*

Sparse Tensor Cores accelerate this format by operating only on the nonzero values in the compressed matrix. They use the metadata that is stored with the nonzeros to pull only the necessary values from the other, uncompressed operand. So, for a sparsity of 2x, they can complete the same effective calculation in half the time. Table 1 shows details on the wide variety of data types supported by Sparse Tensor Cores.

Input Operands	Accumulator	Dense TOPS	vs. FFMA	Sparse TOPS	vs. FFMA
FP32	FP32	19.5	–	–	–
TF32	FP32	156	8X	312	16X
FP16	FP32	312	16X	624	32X
BF16	FP32	312	16X	624	32X
FP16	FP16	312	16X	624	32X
INT8	INT32	624	32X	1248	64X

Table 1. Performance of Sparse Tensor Cores in the NVIDIA Ampere Architecture.

2:4 structured sparse networks maintain accuracy

Of course, performance is pointless without good accuracy. We’ve developed a simple training workflow that can easily generate a 2:4 structured sparse network matching the accuracy of the dense network:

Start with a dense network. The goal is to start with a known-good model whose weights have converged to give useful results.
On the dense network, prune the weights to satisfy the 2:4 structured sparsity criteria. Out of every four elements, remove just two.
Repeat the original training procedure.

This workflow uses one-shot pruning in Step 2. After the pruning stage, the sparsity pattern is fixed. There are many ways to make pruning decisions. Which weights should stay, and which should be forced to zero? We’ve found that a simple answer works well: weight magnitude. We prefer to prune values that are already close to zero.

As you might expect, suddenly turning half of the weights in a network to zero can affect the network’s accuracy. Step 3 recovers that accuracy with enough weight update steps to let the weights converge and a high enough learning rate to let the weights move around sufficiently. This recipe works incredibly well. Across a wide range of networks, it generates a sparse model that maintains the accuracy of the dense network from Step 1.

Table 2 has a sample of FP16 accuracy results that we obtained using this workflow implemented in the PyTorch Library Automatic SParsity (ASP). For more information about the full results for both FP16 and INT8, see the Accelerating Sparse Deep Neural Networks whitepaper.

Network	Data Set	Metric	Dense FP16	Sparse FP16
ResNet-50	ImageNet	Top-1	76.1	76.2
ResNeXt-101_32x8d	ImageNet	Top-1	79.3	79.3
Xception	ImageNet	Top-1	79.2	79.2
SSD-RN50	COCO2017	bbAP	24.8	24.8
MaskRCNN-RN50	COCO2017	bbAP	37.9	37.9
FairSeq Transformer	EN-DE WMT’14	BLEU	28.2	28.5
BERT-Large	SQuAD v1.1	F1	91.9	91.9

Table 2. Sample accuracy of 2:4 structured sparse networks trained with our recipe.

Case study: ResNeXt-101_32x8d

Here’s how easy the workflow is to use with ResNeXt-101_32x8d as a target.

Generating the sparse model

You use the torchvision pretrained model, so step 1 is done already. Because you’re using ASP, the first code change is to import the library:

try:
    from apex.contrib.sparsity import ASP
except ImportError:
    raise RuntimeError("Failed to import ASP. Please install Apex from https:// github.com/nvidia/apex .")

Load the pretrained model for this training run. Instead of training the dense weights, though, prune the model and prepare the optimizer before the training loop (step 2 of the workflow):

ASP.prune_trained_model(model, optimizer)
print("Start training")
    start_time = time.time()
    for epoch in range(args.start_epoch, args.epochs):
    ...

That’s it. The training loop proceeds as normal with the default command augmented to begin with the pretrained model, which reuses the original hyperparameters and optimizer settings for the retraining:

python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py
    --model resnext101_32x8d --epochs 100 --pretrained True

When training completes (Step 3), the network accuracy should have recovered to match that of the pretrained model, as shown in Table 2. As usual, the best-performing checkpoint may not be from the final epoch.

Preparing for inference

For inference, use TensorRT 8.0 to import the trained model’s sparse checkpoint. The model needs to be converted from the native framework format into the ONNX format before importing into TensorRT. Conversion can be done by following the notebooks in the quickstart/IntroNotebooks GitHub repo.

We have already converted the sparse ResNeXt-101_32x8d to ONNX format. You can download this model from NGC. If you don’t have NGC installed, use the following command to install NGC:

cd /usr/local/bin && wget https://ngc.nvidia.com/downloads/ngccli_cat_linux.zip && unzip ngccli_cat_linux.zip && chmod u+x ngc && rm ngccli_cat_linux.zip ngc.md5 && echo "no-apikeynasciin" | ngc config set

After NGC is installed, download sparse ResNeXt-101_32x8d in ONNX format by running the following command:

ngc registry model download-version nvidia/resnext101_32x8d_dense_onnx:1"

To import the ONNX model into TensorRT, clone the TensorRT repo and set up the Docker environment, as mentioned in the NVIDIA/TensorRT readme.

After you are in the TensorRT root directory, convert the sparse ONNX model to TensorRT engine using trtexec. Make a directory to store the model and engine:

cd /workspace/TensorRT/
mkdir model

Copy the downloaded ResNext ONNX model to the /workspace/TensorRT/model directory and then execute the trtexec command as follows:

./workspace/TensorRT/build/out/trtexec  --onnx=/workspace/TensorRT/model/resnext101_32x8d_sparse_fp32.onnx  --saveEngine=/workspace/TensorRT/model/resnext101_engine.trt  
--explicitBatch 
--sparsity=enable 
--fp16

A new file named resnext101_engine.trt is created at /workspace/TensorRT/model/. The resnext101_engine.trt file can now be serialized to perform inference by one of the following methods:

TensorRT runtime in C++ or Python, as shown in this example notebook
NVIDIA Triton Inference Server

Performance in TensorRT 8.0

Benchmarking this sparse model in TensorRT 8.0 on an A100 GPU at various batch sizes shows two important trends:

Performance benefits increase with the amount of work that the A100 is doing. Larger batch sizes generally lead to larger improvements, approaching 20% at the high end.
At smaller batch sizes, where the A100 clock speeds can stay low, using sparsity allows them to be pushed even lower for the same performance, which results in power efficiency improvements greater than the performance itself, leading to up to a 36% performance/watt gain.

Don’t forget, this network has the exact same accuracy as the dense baseline. This extra efficiency and performance doesn’t require penalizing accuracy.

A column chart showing inference performance and performance-per-watt improvements of the sparse network compared to a dense network over a number of batch sizes on an A100 GPU running in TensorRT 8.0 in fp16 precision. — *Figure 2. Sparsity improvements in performance and power efficiency (with dense as a baseline)*

Summary

Sparsity is popular in neural network compression and simplification research. Until now, though, fine-grained sparsity has not delivered on its promise of performance andaccuracy. We developed 2:4 fine-grained structured sparsity and built support directly into NVIDIA Ampere Architecture Sparse Tensor Cores. With this simple, three-step sparse retraining workflow, you can generate sparse neural networks that match the baseline accuracy, and TensorRT 8.0 accelerates them by default.

For more information, see the Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture GTC2021 session, all about accelerating sparsity in the NVIDIA Ampere Architecture, or read the Accelerating Sparse Deep Neural Networks whitepaper.

Ready to jump in and try 2:4 sparsity on your own networks? The Automatic SParsity (ASP) PyTorch library makes it easy to generate a sparse network, and TensorRT 8.0 can deploy them efficiently.

To learn more about TensorRT 8.0 and it’s new features, see the Accelerate Deep Learning Inference with TensorRT 8.0 GTC’21 session or the TensorRT page.

Misc

TensorFlow IO Kit From Google

Post author By
Post date July 20, 2021
No Comments on TensorFlow IO Kit From Google

submitted by /u/Prabeen1
[visit reddit] [comments]

Misc

Model was constructed with shape (None, 100) for input but it was called on an input with incompatible shape (None, 1).

Post author By
Post date July 20, 2021
No Comments on Model was constructed with shape (None, 100) for input but it was called on an input with incompatible shape (None, 1).

Hello,

I’m working on a project where I aim to give an image on the night sky to my neural network, and it should tell me how many satellites are in the frame.

I’ve trained my network to learn what a satellite looks like by giving it 10×10 pxl cropped images, and so my model takes 100 input. I’m doing it as a classification problem, and so it gives me the probability of the 10×10 pxl image to contain a satellite.

I’ve trained the network with no issue, I’ve done the testing with no issue, but when it comes to actually giving it a single 10x10pxl image to give me an answer, I get this error message:

ValueError: Input 0 of layer sequential is incompatible with the layer: expected axis -1 of input shape to have value 100 but received input with shape (None, 1)

This is what my code looks like for the model training and testing:

visible = Input(shape=100) Hidden1 = Dense(32, activation = 'relu')(visible) Hidden2 = Dense(64, activation = 'relu')(Hidden1) Hidden3 = Dense(128, activation = 'relu')(Hidden2) Output = Dense(1, activation = 'softplus')(Hidden3) model = Model(inputs=visible, outputs=Output) model.compile(loss='huber',optimizer='adam', metrics=['accuracy']) batch_size = 112 epochs = 200 model.fit(x=x_train,y=y_train, batch_size=batch_size,epochs=epochs) test_loss, test_acc = model.evaluate(x_test, y_test) >Test Loss: 0.06204257160425186, Test Accuracy: 0.8361972570419312

I have written a little code that crops full size frames (480×640) into a multitude of 10×10 crops, and I want to feed each crop to the network to then tell me which crops contain a satellite.

# image size is 480*640 height = 480 length = 640 X_Try = [] for a in range(0, height, 10): for b in range(0, length, 10): img = image.crop((a, b, a+10, b+10)) img = np.array(img)/255.0 #to normalize my values img = img.flatten() #to make the matrix into a single array X_Try.append(img) print(len(X_Try)) >3072 print(len(X_Try[0])) >100

My X_Try is populated with arrays of shape (100,1), and my logic is that I should just give each of those arrays to my model to predict, and that should output a y_pred of shape (3072,1) containing the probability of each crop to contain a satellite.

However, my problem appears here:

y_pred=[] for f in range(0,len(X_Try)): y_p = model.predict(X_Try[f]) y_pred.append(y_p)

This is then where the error message appears: it seems to be of the opinion that I’m giving something of shape (None, 1), and it only takes input of shape (None,100), however I’ve checked before and all of my arrays in X_Try are indeed of size 100.

I’ve also tried with a single crop, in case it is the for loop that’s causing issue:

y_pred = model.predict(X_Try[0])

but that gives me the same error.

I’ve looked online for similar issue but most people seem to have that problem during the training or the testing, not at the prediction.

Could someone guide me in the right direction?

EDIT: I forgot to add that, but I find that doing

y_pred = model.predict(x_test)

does work, where x_test has a shape of (4481,100,1) but for some reason doing

y_pred = model.predict(X_Try)

doesn’t work, where X_Try has a shape of (3072,100,1) doesn’t work and gives me the error message:

ValueError: Layer model_2 expects 1 input(s), but it received 3072 input tensors.

submitted by /u/flaflou
[visit reddit] [comments]

Misc

TensorFlow introduction that works with Java

Post author By
Post date July 19, 2021
No Comments on TensorFlow introduction that works with Java

I want to learn TensorFlow, for use in my Java projects. I have not found any official TensorFlow tutorials and introductions that use Java as the language. I can use one of the Python tutorials, and translate into Java as I go, but I would appreciate if anyone can point me in the direction of an official Java tutorial.

submitted by /u/OliverHPerry
[visit reddit] [comments]

Misc

TensorFlow giving error: Input is empty

Post author By
Post date July 19, 2021
No Comments on TensorFlow giving error: Input is empty

Hello, I am trying to run the resnet model on custom images (transfer learning).

My directory tree looks like this:

train

class1

image1

image2

….

class2

image1

image2

…

val

class1

image1

image2

….

class2

image1

image2

…

And I created the tensorflow datasets like this:

train_ds = tf.keras.preprocessing.image_dataset_from_directory( “train”, labels=’inferred’, label_mode=’int’, image_size=(img_height, img_width), batch_size=batch_size)

val_ds = tf.keras.preprocessing.image_dataset_from_directory( “val”, labels=’inferred’, label_mode=’int’, image_size=(img_height, img_width), batch_size=batch_size)

Then I try to display the images for reference, running:

import matplotlib.pyplot as plt

class_names = val_ds.class_names

plt.figure(figsize=(10, 10)) for images, labels in val_ds.take(1): for i in range(30): ax = plt.subplot(5, 7, i + 1) plt.imshow(images[i].numpy().astype(“uint8”)) plt.title(class_names[labels[i]]) plt.axis(“off”)

But with val_ds, I get the error “Input is empty”, while with train_ds it actually shows the images. Anyone knows a possible reason why behind it? The dataset I am using is here – https://github.com/xuequanlu/I-Nema – and I have converted all the .tif images to .jpg.

Thanks in advance!

EDIT: here is the error log : https://pastecode.io/s/82hk68ar

submitted by /u/quaranprove
[visit reddit] [comments]

Misc

MONAI Expands its Horizons with Healthcare Imaging Annotation

Post author By
Post date July 19, 2021
No Comments on MONAI Expands its Horizons with Healthcare Imaging Annotation

Deep learning models have been successfully used in medical image analysis problems but they require a large, curated amount of labeled images to obtain good performance. Creating such annotations are tedious, time-consuming and typically require clinical expertise. To address this gap, Project MONAI has released MONAI Label v0.1 – an intelligent open source image labeling … Continued

To address this gap, Project MONAI has released MONAI Label v0.1 – an intelligent open source image labeling and learning tool that helps researchers and clinicians collaborate, create annotated datasets easily and quickly, and build AI models in a standardized MONAI paradigm.

MONAI Label enables the adaptation of AI models to the clinical task at hand by continuously learning from the user’s interactions and new labels. It powers an AI Assisted annotation experience, allowing researchers and developers to make continuous improvements to their applications with iterative feedback from clinicians who are typically the end users of the medical imaging AI models.

At the Children’s Hospital of Philadelphia (CHOP), Dr. Matthew Jolley explains how they are innovating and driving clinical impact with machine learning algorithms.

“Children with congenital heart disease demonstrate a broad range of anatomy, and there are few readily available tools to facilitate image-based structural phenotyping and patient specific planning of complex cardiac interventions. However, currently 3D image-based heart model creation is slow, even in the hands of experienced modelers. As such, we have been working to develop machine learning algorithms to create models of heart valves in children with congenital heart disease, such as the tricuspid valve in hypoplastic left heart syndrome. Ongoing development of automation based on machine learning will allow rapid modeling and precise quantification of how a dysfunctional valve differs from normal valves across multiple parameters. That “structural valve profile” for an individual can then be contextualized within the spectrum of anatomy and function we see in the population, which may eventually inform improved medical decision making and interventions for children.”

With MONAI Label we envision creating a community of researchers and clinicians like Dr. Jolley and his team who can build upon a well maintained software foundation that will accelerate collaboration through continuous learning. The MONAI Label team and CHOP collaborated through a Slicer week project, and successfully developed a MONAI Label application for leaflet segmentation of heart valves in 3D echocardiographic (3DE) images. The team is now working to deploy this model as a MONAI Label application on a public facing server at CHOP where clinicians can directly interface with the model and trigger a training loop for adaptation – learn more.

It is incredibly important for an open source initiative like Project MONAI to have clinicians in the loop as we converge to develop a common set of best practices for AI lifecycle management in healthcare imaging. To quote Dr. Jolley:

“Open-source frameworks like Project MONAI provide a standardized, transparent, and reproducible template for the creation of, and deployment of medical imaged-focused machine learning models, potentiating efforts such as ours. They allow us to focus on investigating novel algorithms and their application, rather than developing and maintaining software infrastructure. This in turn has accelerated research progress which we are actively translating into tools of practical relevance to the pediatric community we serve.”

WHAT IS INCLUDED IN MONAI LABEL V0.1

MONAI Label is an open-source server-client system that is easy to set up and can run locally on a machine with one or two GPUs. The initial release does not yet support multiple user sessions, therefore both server and client operate on the same machine.

MONAI Label delivers on MONAI’s core promise of being modular, Pythonic, extensible, easy to debug, user friendly, and portable.

Figure 1. MONAI Label provides interfaces which can be implemented by the label app developer for custom functionality as well as utilities which are readily usable in the labeling app

MONAI v0.1 includes:

MONAI Label server: REST API server that facilitates communication with the viewer clients i.e. (Slicer, OHIF, etc).
MONAI Label sample applications that can be adapted to a given clinical task for MR/CT modality including DeepGrow and DeepEdit which are developed by MONAI researchers.
With v0.1, we have released a 3DSlicer plugin for MONAI Label to kick start users in the MONAI Label experience

Future releases of NVIDIA Clara AIAA will also leverage the MONAI Label framework. We continue to bring together development efforts for NVIDIA Clara medical imaging tools and MONAI to deliver domain-optimized, robust software tools for researchers and developers in healthcare imaging.

With contributions from an engaged community, MONAI Label aims to reduce the cost of labeling and maximize the collaboration between researchers & clinicians. Get started today with sample applications available on the MONAI Label GitHub and follow along with our step-by-step getting started guide available in the MONAI Label Documentation.

Misc

Transforming Brain Waves into Words with AI

Post author By
Post date July 19, 2021
No Comments on Transforming Brain Waves into Words with AI

Diagram of neuroprosthesis device New research out of the University of California, San Francisco has given a paralyzed man the ability to communicate by translating his brain signals into computer generated writing. The study, published in The New England Journal of Medicine, marks a significant milestone toward restoring communication for people who have lost the ability to speak. “To … Continued Diagram of neuroprosthesis device

New research out of the University of California, San Francisco has given a paralyzed man the ability to communicate by translating his brain signals into computer generated writing. The study, published in The New England Journal of Medicine, marks a significant milestone toward restoring communication for people who have lost the ability to speak.

“To our knowledge, this is the first successful demonstration of direct decoding of full words from the brain activity of someone who is paralyzed and cannot speak,” senior author and the Joan and Sanford Weill Chair of Neurological Surgery at UCSF, Edward Chang said in a press release. “It shows strong promise to restore communication by tapping into the brain’s natural speech machinery.”

Some with speech limitations use assistive devices–such as touchscreens, keyboards, or speech-generating computers to communicate. However, every year thousands lose their speech ability from paralysis or brain damage, leaving them unable to use assistive technologies.

The participant lost his ability to speak in 2003, paralyzed by a brain stroke following a car accident. The researchers were not sure if his brain retained neural activity linked to speech. To track his brain signals, a neuroprosthetic device consisting of electrodes was positioned on the left side of the brain, across several regions known for speech processing.

Over about four months the team embarked on 50 training sessions, where the participant was prompted to say individual words, form sentences, or respond to questions on a display screen. While responding to the prompts, the electrode device captured neural activity and transmitted the information to a computer with custom software.

“Our models needed to learn the mapping between complex brain activity patterns and intended speech. That poses a major challenge when the participant can’t speak,” David Moses, a postdoctoral engineer in the Chang lab and one of the lead authors of the study, said in a press release.

To decode the responses from his brain activity, the team created speech-detection and word classification models. Using the cuDNN-accelerated TensorFlow framework and 32 NVIDIA V100 Tensor Core GPUs the researchers trained, fine-tuned, and evaluated the models.

“Utilizing neural networks was essential to getting the classification and detection performance we did, and our final product was the result of lots of experimentation,’ said study co-lead Sean Metzger. “Because our dataset was constantly evolving and growing, being able to adapt the models we were using was critical. The GPUs helped us make changes, monitor progress, and understand our dataset.”

With up to 93% accuracy, and a median rate of 75%, the model decoded the participants word’s at a rate of up to 18 per minute.

“We want to get to 1,000 words, and eventually all words. This is just the starting point,” Chang said.

The study builds off previous work by Chang and his colleagues, which developed a deep learning method for decoding and converting brain signals. Unlike the current work, participants in the previous study were able to speak.

Read more >>>
Read the full article in The New England Journal of Medicine >>>

Offsites

Google at ICML 2021

Posted by Cat Armato and Jaqui Herman, Program Managers

Groups across Google are actively pursuing research across the field of machine learning, ranging from theory to application. With scalable tools and architectures, we build machine learning systems to solve deep scientific and engineering challenges in areas of language, music, visual processing, and more.

Google is proud to be a Platinum Sponsor of the thirty-eighth International Conference on Machine Learning (ICML 2021), a premier annual event happening this week. As a leader in machine learning research — with over 100 accepted publications and Googlers participating in workshops — we look forward to our continued partnership with the broader machine learning research community.

Registered for ICML 2021? We hope you’ll visit the Google virtual booth to learn more about the exciting work, creativity, and fun that goes into solving a portion of the field’s most interesting challenges. Take a look below to learn more about the Google research being presented at ICML 2021 (Google affiliations in bold).

Organizing Committee
ICML Board Members include: Corinna Cortes, Hugo Larochelle, Shakir Mohamed
ICML Emeritus Board includes: William Cohen, Andrew McCallum
Tutorial Co-Chair member: Quoc Lee

Publications
Attention Is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas

Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot
Joel Z. Leibo, Edgar Duéñez-Guzmán, Alexander Sasha Vezhnevets, John P. Agapiou, Peter Sunehag, Raphael Koster, Jayd Matyas, Charles Beattie, Igor Mordatch, Thore Graepel

On the Optimality of Batch Policy Optimization Algorithms
Chenjun Xiao, Yifan Wu, Tor Lattimore, Bo Dai, Jincheng Mei, Lihong Li*, Csaba Szepesvari, Dale Schuurmans

Low-Rank Sinkhorn Factorization
Meyer Scetbon, Marco Cuturi, Gabriel Peyré

Oops I Took A Gradient: Scalable Sampling for Discrete Distributions
Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, Chris J. Maddison

PID Accelerated Value Iteration Algorithm
Amir-Massoud Farahmand, Mohammad Ghavamzadeh

Dueling Convex Optimization
Aadirupa Saha, Tomer Koren, Yishay Mansour

What Are Bayesian Neural Network Posteriors Really Like?
Pavel Izmailov, Sharad Vikram, Matthew D. Hoffman, Andrew Gordon Wilson

Offline Reinforcement Learning with Pseudometric Learning
Robert Dadashi, Shideh Rezaeifar, Nino Vieillard, Léonard Hussenot, Olivier Pietquin, Matthieu Geist

Revisiting Rainbow: Promoting More Insightful and Inclusive Deep Reinforcement Learning Research (see blog post)
Johan S. Obando-Ceron, Pablo Samuel Castro

EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL
Seyed Kamyar Seyed Ghasemipour*, Dale Schuurmans, Shixiang Shane Gu

Variational Data Assimilation with a Learned Inverse Observation Operator
Thomas Frerix, Dmitrii Kochkov, Jamie A. Smith, Daniel Cremers, Michael P. Brenner, Stephan Hoyer

Tilting the Playing Field: Dynamical Loss Functions for Machine Learning
Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré

Model-Based Reinforcement Learning via Latent-Space Collocation
Oleh Rybkin, Chuning Zhu, Anusha Nagabandi, Kostas Daniilidis, Igor Mordatch, Sergey Levine

Momentum Residual Neural Networks
Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré

OmniNet: Omnidirectional Representations from Transformers
Yi Tay, Mostafa Dehghani, Vamsi Aribandi, Jai Gupta, Philip Pham, Zhen Qin, Dara Bahri, Da-Cheng Juan, Donald Metzler

Synthesizer: Rethinking Self-Attention for Transformer Models
Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

Towards Domain-Agnostic Contrastive Learning
Vikas Verma, Minh-Thang Luong, Kenji Kawaguchi, Hieu Pham, Quoc V. Le

Randomized Entity-wise Factorization for Multi-Agent Reinforcement Learning
Shariq Iqbal, Christian A. Schroeder de Witt, Bei Peng, Wendelin Böhmer, Shimon Whiteson, Fei Sha

LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning
Yuhuai Wu, Markus Rabe, Wenda Li, Jimmy Ba, Roger Grosse, Christian Szegedy

Emergent Social Learning via Multi-agent Reinforcement Learning
Kamal Ndousse, Douglas Eck, Sergey Levine, Natasha Jaques

Improved Contrastive Divergence Training of Energy-Based Models
Yilun Du, Shuang Li, Joshua Tenenbaum, Igor Mordatch

Characterizing Structural Regularities of Labeled Data in Overparameterized Models
Ziheng Jiang*, Chiyuan Zhang, Kunal Talwar, Michael Mozer

Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills
Yevgen Chebotar, Karol Hausman, Yao Lu, Ted Xiao, Dmitry Kalashnikov, Jake Varley, Alex Irpan, Benjamin Eysenbach, Ryan Julian, Chelsea Finn, Sergey Levine

PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning
Angelos Filos, Clare Lyle, Yarin Gal, Sergey Levine, Natasha Jaques, Gregory Farquhar

EfficientNetV2: Smaller Models and Faster Training
Mingxing Tan, Quoc V. Le

Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies
Paul Vicol, Luke Metz, Jascha Sohl-Dickstein

Federated Composite Optimization
Honglin Yuan*, Manzil Zaheer, Sashank Reddi

Light RUMs
Flavio Chierichetti, Ravi Kumar, Andrew Tomkins

Catformer: Designing Stable Transformers via Sensitivity Analysis
Jared Quincy Davis, Albert Gu, Krzysztof Choromanski, Tri Dao, Christopher Re, Chelsea Finn, Percy Liang

Representation Matters: Offline Pretraining for Sequential Decision Making
Mengjiao Yang, Ofir Nachum

Variational Empowerment as Representation Learning for Goal-Conditioned Reinforcement Learning
Jongwook Choi*, Archit Sharma*, Honglak Lee, Sergey Levine, Shixiang Shane Gu

Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization
Wesley Chung, Valentin Thomas, Marlos C. Machado, Nicolas Le Roux

Whitening and Second Order Optimization Both Make Information in the Dataset Unusable During Training, and Can Reduce or Prevent Generalization
Neha S. Wadia, Daniel Duckworth, Samuel S. Schoenholz, Ethan Dyer, Jascha Sohl-Dickstein

Understanding Invariance via Feedforward Inversion of Discriminatively Trained Classifiers
Piotr Teterwak*, Chiyuan Zhang, Dilip Krishnan, Michael C. Mozer

Policy Information Capacity: Information-Theoretic Measure for Task Complexity in Deep Reinforcement Learning
Hiroki Furuta, Tatsuya Matsushima, Tadashi Kozuno, Yutaka Matsuo, Sergey Levine, Ofir Nachum, Shixiang Shane Gu

Hyperparameter Selection for Imitation Learning
Leonard Hussenot, Marcin Andrychowicz, Damien Vincent, Robert Dadashi, Anton Raichuk, Lukasz Stafiniak, Sertan Girgin, Raphael Marinier, Nikola Momchev, Sabela Ramos, Manu Orsini, Olivier Bachem, Matthieu Geist, Olivier Pietquin

Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces
Ankit Singh Rawat, Aditya Krishna Menon, Wittawat Jitkrittum, Sadeep Jayasumana, Felix X. Yu, Sashank J. Reddi, Sanjiv Kumar

Revenue-Incentive Tradeoffs in Dynamic Reserve Pricing
Yuan Deng, Sebastien Lahaie, Vahab Mirrokni, Song Zuo

Debiasing a First-Order Heuristic for Approximate Bi-Level Optimization
Valerii Likhosherstov, Xingyou Song, Krzysztof Choromanski, Jared Davis, Adrian Weller

Characterizing the Gap Between Actor-Critic and Policy Gradient
Junfeng Wen, Saurabh Kumar, Ramki Gummadi, Dale Schuurmans

Composing Normalizing Flows for Inverse Problems
Jay Whang, Erik Lindgren, Alexandros Dimakis

Online Policy Gradient for Model Free Learning of Linear Quadratic Regulators with √T Regret
Asaf Cassel, Tomer Koren

Learning to Price Against a Moving Target
Renato Paes Leme, Balasubramanian Sivan, Yifeng Teng, Pratik Worah

Fairness and Bias in Online Selection
Jose Correa, Andres Cristi, Paul Duetting, Ashkan Norouzi-Fard

The Impact of Record Linkage on Learning from Feature Partitioned Data
Richard Nock, Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Jakub Nabaglo, Giorgio Patrini, Guillaume Smith, Brian Thorne

Reserve Price Optimization for First Price Auctions in Display Advertising
Zhe Feng*, Sébastien Lahaie, Jon Schneider, Jinchao Ye

A Regret Minimization Approach to Iterative Learning Control
Naman Agarwal, Elad Hazan, Anirudha Majumdar, Karan Singh

A Statistical Perspective on Distillation
Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, Sanjiv Kumar

Best Model Identification: A Rested Bandit Formulation
Leonardo Cella, Massimiliano Pontil, Claudio Gentile

Generalised Lipschitz Regularisation Equals Distributional Robustness
Zac Cranko, Zhan Shi, Xinhua Zhang, Richard Nock, Simon Kornblith

Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions
Tal Lancewicki, Shahar Segal, Tomer Koren, Yishay Mansour

Regularized Online Allocation Problems: Fairness and Beyond
Santiago Balseiro, Haihao Lu, Vahab Mirrokni

Implicit Rate-Constrained Optimization of Non-decomposable Objectives
Abhishek Kumar, Harikrishna Narasimhan, Andrew Cotter

Leveraging Non-uniformity in First-Order Non-Convex Optimization
Jincheng Mei, Yue Gao, Bo Dai, Csaba Szepesvari, Dale Schuurmans

Dynamic Balancing for Model Selection in Bandits and RL
Ashok Cutkosky, Christoph Dann, Abhimanyu Das, Claudio Gentile, Aldo Pacchiano, Manish Purohit

Adversarial Dueling Bandits
Aadirupa Saha, Tomer Koren, Yishay Mansour

Optimizing Black-Box Metrics with Iterative Example Weighting
Gaurush Hiranandani*, Jatin Mathur, Harikrishna Narasimhan, Mahdi Milani Fard, Oluwasanmi Koyejo

Relative Deviation Margin Bounds
Corinna Cortes, Mehryar Mohri, Ananda Theertha Suresh

MC-LSTM: Mass-Conserving LSTM
Pieter-Jan Hoedt, Frederik Kratzert, Daniel Klotz, Christina Halmich, Markus Holzleitner, Grey Nearing, Sepp Hochreiter, Günter Klambauer

12-Lead ECG Reconstruction via Koopman Operators
Authors:Tomer Golany, Kira Radinsky, Daniel Freedman, Saar Minha

Finding Relevant Information via a Discrete Fourier Expansion
Mohsen Heidari, Jithin Sreedharan, Gil Shamir, Wojciech Szpankowski

LEGO: Latent Execution-Guided Reasoning for Multi-Hop Question Answering on Knowledge Graphs
Hongyu Ren, Hanjun Dai, Bo Dai, Xinyun Chen, Michihiro Yasunaga, Haitian Sun, Dale Schuurmans, Jure Leskovec, Denny Zhou

SpreadsheetCoder: Formula Prediction from Semi-structured Context
Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, Denny Zhou

Combinatorial Blocking Bandits with Stochastic Delays
Alexia Atsidakou, Orestis Papadigenopoulos, Soumya Basu, Constantine Caramani, Sanjay Shakkottai

Beyond log2(T) Regret for Decentralized Bandits in Matching Markets
Soumya Basu, Karthik Abinav Sankararaman, Abishek Sankararaman

Robust Pure Exploration in Linear Bandits with Limited Budget
Ayya Alieva, Ashok Cutkosky, Abhimanyu Das

Latent Programmer: Discrete Latent Codes for Program Synthesis
Joey Hong, David Dohan, Rishabh Singh, Charles Sutton, Manzil Zaheer

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (see blog post)
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig

On Linear Identifiability of Learned Representations
Geoffrey Roeder, Luke Metz, Diederik P. Kingma

Hierarchical Clustering of Data Streams: Scalable Algorithms and Approximation Guarantees
Anand Rajagopalan, Fabio Vitale, Danny Vainstein, Gui Citovsky, Cecilia M Procopiuc, Claudio Gentile

Differentially Private Quantiles
Jennifer Gillenwater, Matthew Joseph, Alex Kulesza

Active Covering
Heinrich Jiang, Afshin Rostamizadeh

Sharf: Shape-Conditioned Radiance Fields from a Single View
Konstantinos Rematas, Ricardo Martin-Brualla, Vittorio Ferrari

Learning a Universal Template for Few-Shot Dataset Generalization
Eleni Triantafillou*, Hugo Larochelle, Richard Zemel, Vincent Dumoulin

Private Alternating Least Squares: Practical Private Matrix Completion with Tighter Rates
Steve Chien, Prateek Jain, Walid Krichene, Steffen Rendle, Shuang Song, Abhradeep Thakurta, Li Zhang

Differentially-Private Clustering of Easy Instances
Edith Cohen, Haim Kaplan, Yishay Mansour, Uri Stemmer, Eliad Tsfadia

Label-Only Membership Inference Attacks
Christopher A. Choquette-Choo, Florian Tramèr, Nicholas Carlini, Nicolas Papernot

Neural Feature Matching in Implicit 3D Representations
Yunlu Chen, Basura Fernando, Hakan Bilen, Thomas Mensink, Efstratios Gavves

Locally Private k-Means in One Round
Alisa Chang, Badih Ghazi, Ravi Kumar, Pasin Manurangsi

Large-Scale Meta-Learning with Continual Trajectory Shifting
Jaewoong Shin, Hae Beom Lee, Boqing Gong, Sung Ju Hwang

Statistical Estimation from Dependent Data
Vardis Kandiros, Yuval Dagan, Nishanth Dikkala, Surbhi Goel, Constantinos Daskalakis

Oneshot Differentially Private Top-k Selection
Gang Qiao, Weijie J. Su, Li Zhang

Unsupervised Part Representation by Flow Capsules
Sara Sabour, Andrea Tagliasacchi, Soroosh Yazdani, Geoffrey E. Hinton, David J. Fleet

Private Stochastic Convex Optimization: Optimal Rates in L1 Geometry
Hilal Asi, Vitaly Feldman, Tomer Koren, Kunal Talwar

Practical and Private (Deep) Learning Without Sampling or Shuffling
Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, Zheng Xu

Differentially Private Aggregation in the Shuffle Model: Almost Central Accuracy in Almost a Single Message
Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Rasmus Pagh, Amer Sinha

Leveraging Public Data for Practical Private Query Release
Terrance Liu, Giuseppe Vietri, Thomas Steinke, Jonathan Ullman, Zhiwei Steven Wu

Meta-Thompson Sampling
Branislav Kveton, Mikhail Konobeev, Manzil Zaheer, Chih-wei Hsu, Martin Mladenov, Craig Boutilier, Csaba Szepesvári

Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
Kieran A Murphy, Carlos Esteves, Varun Jampani, Srikumar Ramalingam, Ameesh Makadia

Improving Ultrametrics Embeddings Through Coresets
Vincent Cohen-Addad, Rémi de Joannis de Verclos, Guillaume Lagarde

A Discriminative Technique for Multiple-Source Adaptation
Corinna Cortes, Mehryar Mohri, Ananda Theertha Suresh, Ningshan Zhang

Self-Supervised and Supervised Joint Training for Resource-Rich Machine Translation
Yong Cheng, Wei Wang*, Lu Jiang, Wolfgang Macherey

Correlation Clustering in Constant Many Parallel Rounds
Vincent Cohen-Addad, Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Nikos Parotsidis, Jakub Tarnawski

Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time
Laxman Dhulipala, David Eisenstat, Jakub Łącki, Vahab Mirrokni, Jessica Shi

Meta-Learning Bidirectional Update Rules
Mark Sandler, Max Vladymyrov, Andrey Zhmoginov, Nolan Miller, Andrew Jackson, Tom Madams, Blaise Aguera y Arcas

Discretization Drift in Two-Player Games
Mihaela Rosca, Yan Wu, Benoit Dherin, David G.T. Barrett

Reasoning Over Virtual Knowledge Bases With Open Predicate Relations
Haitian Sun*, Pat Verga, Bhuwan Dhingra, Ruslan Salakhutdinov, William W. Cohen

Learn2Hop: Learned Optimization on Rough Landscapes
Amil Merchant, Luke Metz, Samuel Schoenholz, Ekin Cubuk

Locally Adaptive Label Smoothing Improves Predictive Churn
Dara Bahri, Heinrich Jiang

Overcoming Catastrophic Forgetting by Bayesian Generative Regularization
Patrick H. Chen, Wei Wei, Cho-jui Hsieh, Bo Dai

Workshops (only Google affiliations are noted)
LatinX in AI (LXAI) Research at ICML 2021
Hosts: Been Kim, Natasha Jaques

Uncertainty and Robustness in Deep Learning
Organizers: Balaji Lakshminarayanan, Jasper Snoek Invited Speaker: Dustin Tran

Reinforcement Learning for Real Life
Organizers: Minmin Chen, Lihong Li Invited Speaker: Ed Chi

Interpretable Machine Learning in Healthcare
Organizers: Alan Karthikesalingam Invited Speakers: Abhijit Guha Roy, Jim Winkens

The Neglected Assumptions in Causal Inference
Organizer: Alexander D’Amour

ICML Workshop on Algorithmic Recourse
Invited Speakers: Been Kim, Berk Ustun

A Blessing in Disguise: The Prospects and Perils of Adversarial Machine Learning
Invited Speaker: Nicholas Carlini

Overparameterization: Pitfalls and Opportunities
Organizers: Yasaman Bahri, Hanie Sedghi

Information-Theoretic Methods for Rigorous, Responsible, and Reliable Machine Learning (ITR3)
Invited Speaker: Thomas Steinke

Beyond First-Order Methods in Machine Learning Systems
Invited Speaker: Courtney Paquette

ICML 2021 Workshop: Self-Supervised Learning for Reasoning and Perception
Invited Speaker: Chelsea Finn

Workshop on Reinforcement Learning Theory
Invited Speaker: Bo Dai

Tutorials (only Google affiliations are noted)
Responsible AI in Industry: Practical Challenges and Lessons Learned
Organizers: Ben Packer

Online and Non-stochastic Control
Organizers: Elad Hazan

Random Matrix Theory and ML (RMT +ML)
Organizers: Fabian Pedregosa, Jeffrey Pennington, Courntey Paquette Self-Attention for Computer Vision Organizers: Prajit Ramachandran, Ashish Vaswani

* Indicates work done while at Google