Categories
Misc

Improving Japanese Language ASR by Combining Convolutions with Attention Mechanisms

Automatic speech recognition (ASR) research generally focuses on high-resource languages such as English, which is supported by hundreds of thousands of hours…

Automatic speech recognition (ASR) research generally focuses on high-resource languages such as English, which is supported by hundreds of thousands of hours of speech. Recent literature has renewed focus on more complex languages, such as Japanese. Like other Asian languages, Japanese has a vast base character set (upwards of 3,000 unique characters are used in common vernacular), and poses unique challenges, such as multiple word order.

To learn more about the technology discussed in this post, watch the author’s INTERSPEECH 2022 presentation on September 20, Neural Transducers, Streaming ASR, and Novel ASR Models.

This post discusses recent work that improved both accuracy and speed for Japanese language ASR. First, we improved Conformer, a state-of-the-art ASR neural network architecture, to achieve a significant improvement in training and inferencing speed without accuracy loss. Second, we enhanced a pure and deep convolutional network with a multiple-head self attention mechanism for enriching the learning of global contextual representations of the input speech waves.

Deep sparse conformer for speech recognition

Conformer is a neural network architecture widely used in the ASR systems of numerous languages and has achieved high levels of accuracy. However, Conformer is relatively slow at both training and inferencing because it uses multi-head self-attention with a time/memory complexity of quadratic to the length of input audio waves. 

This prevents its efficient processing of long audio sequences since relatively high memory footprints are required during training and inferencing. These motivate the usage of sparse attention for efficient Conformer constructing. In addition, with sparse attention and relatively low memory cost, we are able to build a deeper network that can process long sequences fed with large-scale speech datasets.

Diagram showing encoder model architecture for the deep sparse Conformer. A wave file is first sent to the spectrogram augmentation (SpecAug) module and then a convolutional subsampling network followed by a linear projection and a dropout function for preprocessing. Then, there are a stack of deep sparse Conformer building Macaron blocks in which two feed forward modules clamp a multi-head sparse self attention module and a convolution module.
Figure 1. Encoder model architecture for the deep sparse Conformer

As depicted in Figure 1, we improved the Conformer long-sequence representation ability in two directions: sparser and deeper. We used a ranking criteria to only select a small scale of dominant queries instead of the whole query set to save time for attention score computing.

A deep normalization strategy is used when performing residual connections to ensure the training of hundred-level Conformer blocks. This strategy involves discounting the parameters of the encoder and decoder parts with a function that is related respectively to the number of encoder layers and decoder layers.

In addition, this deep normalization strategy ensures a successful building of 10 to 100 layers so that the model is more expressive. In contrast, the deep sparse Conformer time and memory costs decrease at a rate of 10% to 20% when compared to the usual Conformer. 

Attention-enhanced Citrinet for speech recognition

Citrinet, proposed by NVIDIA researchers, is an end-to-end convolutional Connectionist Temporal Classification (CTC)-based ASR model. To capture local and global contextual information, Citrinet uses 1D time-channel separable convolutions combined with sub-word encoding and squeeze-and-excitation (SE), enabling the whole architecture to achieve state-of-the-art accuracies compared with transformer-based counterparts.

Applying Citrinet to Japanese ASR involves several challenges. Specifically, it is relatively slower at convergence and more difficult to train a model with a comparable accuracy than similar deep neural network models. Considering that there are as many as 235 convolutional layers that influence the convergence speed of Citrinet, we aim to reduce the CNN layers by introducing multi-head attentions in the convolution module in Citrinet blocks while keeping the SE and residual modules unchanged.

Diagram showing the Citrinet end-to-end architecture and major building block. 1D time-channel separable convolution and SE-attached Jasper Block, updated with multi-head self-attention module with layer normalization and reduced repeating number from 5 to 1.
Figure 2. The Citrinet end-to-end architecture and major building block

As shown in Figure 2, speeding up the training time involves reducing eight convolution layers in each attention-enhanced Citrinet block. In addition, considering that self-attention has time/memory complexities of quadratic to the length of input audio waves, we reduced the original 23 Jasper blocks to eight blocks with a significant model size reduction. This design ensures the attention-enhanced Citrinet reaches comparable inference time for long speech sequences with layers from the 20s to the 100s.

Preliminary experiments show that attention-based models converge at 100 to 200 epochs while 500 to 1,000 epochs are required for Citrinet’s convergence to the best error rates. Experiments on the Japanese CSJ-500-hour dataset show that the attention-Citrinet requires fewer layers of blocks and converges faster with lower character error rates than Citrinet with 80% training time and Conformer with 40% training time and 18.5% model size.

Summary

Generally, we propose two novel architectures to build end-to-end Japanese ASR models. In one direction, we improved the transformer-based Conformer training and inferencing speed and retained its accuracy. We successfully built sparser and deeper Conformer models. We also improved the CNN-based Citrinet convergence speed and accuracy by introducing a multi-head self-attention mechanism and pruning 80% of the CNN layers. These proposals are general and applicable to other Asian languages.

To learn more about Japanese language ASR models, see Deep Sparse Conformer for Speech Recognition and Attention Enhanced Citrinet for Speech Recognition, or watch the related INTERSPEECH 2022 session, Neural Transducers, Streaming ASR, and Novel ASR Models.

Categories
Misc

Changing CTC Rules to Reduce Memory Consumption in Training and Decoding

Loss functions for training automatic speech recognition (ASR) models are not set in stone. The older rules of loss functions are not necessarily optimal….

Loss functions for training automatic speech recognition (ASR) models are not set in stone. The older rules of loss functions are not necessarily optimal. Consider connectionist temporal classification (CTC) and see how changing some of its rules enables you to reduce GPU memory, which is required for training and inference of CTC-based models and more.

For more information about the technology behind this post, tune into the author’s presentation at INTERSPEECH 2022 on Monday, September 19, 14:30-16:30 (KST), Virtual Poster: Novel models and training methods for ASR II.

Overview of connectionist temporal classification

If you are going to train an ASR model, whether it’s a convolutional or recurrent neural network, a transformer, or a combination, you are most likely training it with the CTC loss. 

CTC is simple and convenient because it does not require per-frame information about “what sound is pronounced when” (so-called audio-text time alignment). In most cases, this knowledge is simply unavailable, as in a typical ASR dataset of audio with associated text with no time marks. 

True time alignment is not always trivial. Suppose that most of the recording is without speech with only one short phrase at the end. CTC loss does not tell the model when exactly to emit a prediction. Instead, it allows for every possible alignment and just regulates the form of these alignments.

Here’s how exactly CTC manages all possible ways to align audio against text. 

First, the target text is tokenized, that is, words are sliced into letters or word pieces. The number of the resulting units (whatever they are) should be less than the number of audio “timeframes”: segments of audio of length 0.01 to 0.08 seconds. 

If there are fewer timeframes than units, then the algorithm fails. In this case, you should make your timeframes shorter. Otherwise, only RNN Transducer can save you. If the timeframes are as many as the units, then there can be only one alignment (a one-in-a-million case). 

Most of the time, there are way more timeframes than units, so some of the frames are left without a unit. For such empty frames, CTC has a special unit. This unit tells you that the model has nothing to give you at this particular frame. This may be because there is no speech, or maybe the model is just too lazy to predict something meaningful here. The ability to predict nothing if the model doesn’t want to do that is provided by the most important rule of the CTC: the rule.

Other rules are related to unit continuation. Suppose that your unit is a vowel that lasts longer than one frame. On which of two frames should the model output the unit? CTC allows for the same unit emission on multiple consecutive frames. But then, the same consecutive units should be merged into one to convert the recognition results into a sequence of units that resembles text. 

Now, what if your tokenized text itself contains the same repeated units like “ll” or “pp”? If left unprocessed, these units are merged into one even if they shouldn’t be. For this special case, CTC has a rule that if the target text has repeated units, then such units must be separated by during inference. 

To sum up, in almost every frame, the model is allowed to emit the same unit from the previous frame, , or the next unit if it is different from the previous one. These rules are more sophisticated than the rule and they are not exactly necessary for CTC to work.

CTC implementation

Here’s how CTC loss can be represented. Like most loss functions in machine learning, it is usually represented as a dynamic algorithm that applies these rules to a training utterance or to the model’s softmax output. 

In training, loss values and gradients are computed from conditional probabilities of all possible alignments by the Baum–Welch algorithm, applied to the CTC rules. The CTC implementations often have hundreds to thousands of lines of code and can be hard to modify.

Fortunately, there is another way to implement CTC. The weighted finite-state transducers (WFST) approach, apart from other application areas, allows the model to represent a dynamic algorithm as a set of graphs and associated graph operations. This approach enables you to decouple the CTC rules by applying them to specific audio and text and by calculating loss and gradients.

CTC WFST applications

With WFST, you can easily take the CTC rules and use them with different criteria, like maximum mutual information (MMI). These models usually have a lower word error rate (WER) than CTC models. MMI incorporates prior language information into the training process. 

In contrast to CTC, MMI not only maximizes the probability of the most feasible path but also minimizes the probabilities of every other path. To do this, MMI has a so-called denominator graph, which can occupy a lot of  GPU memory during training. Fortunately, some CTC rules can be modified to reduce denominator memory consumption without compromising speech recognition accuracy.

Also, a WFST representation of CTC rules, or a so-called topology, can be used to allow for WFST decoding of CTC models. To do that, you convert an N-gram language model to a WFST graph and compose it with the topology graph. The resulting decoding graph can be passed to, for example, the Riva CUDA WFST Decoder

Decoding graphs can be large to the point that they may not fit in GPU memory. But with some CTC topology modifications, you can reduce decoding graph sizes for CTC.

CTC topologies

Figure 1 shows the CTC topology, Correct-CTC. This is a directed complete graph with self-loops, so for N units, including blank, there are N states and the square number of arcs. 

Correct-CTC is the most commonly used CTC representation. Look at the typical sizes that this topology produces. For the LibriSpeech 4-gram word language model and 256 units of the model vocabulary, the decoding graph size is ~16Gb. Cold-start MMI training on a 32Gb GPU with the model vocabulary size 2048 is possible only with batch size 1.

A fully-connected WFST with three nodes: 0 for <blank>, 1 for A, and 2 for B. Each node has a self-loop arc with the node’s unit as input and <epsilon> as output. Every node has arcs to all other nodes with arcs’ units according to the destination node.
Figure 1. Correct-CTC example for a three-unit vocabulary: , A, and B

Reduce the memory consumption induced by Correct-CTC by dropping some of the CTC rules. First, drop the mandatory separation of repeated units with . Without this rule, you end up with the topology called Compact-CTC ( Figure 2). It has 3N – 2 arcs for N units. 

A WFST with three nodes: 0 for <blank>, 1 for A, and 2 for B. Each node has a self-loop arc with the node’s unit as input and <epsilon> as output. The <blank> node has arcs to all other nodes with arcs’ units according to the destination node. Other nodes are connected to the <blank> node through (virtual) <epsilon>-arcs.
Figure 2. Compact-CTC example

Despite having pure epsilon (virtual) arcs, this topology can be used in training and decoding and does not negatively affect the recognition quality. If you’re wondering how this works, see CTC Variations Through New WFST Topologies or the NVIDIA NeMo implementation.

Decoding graph sizes with Compact-CTC are a quarter smaller than with Correct-CTC. It also requires 2x less GPU memory for MMI training.

Now drop the same unit emission on multiple consecutive frames, leaving only the rule. This way, you end up with a topology with only one state and N arcs for N units. 

This is the smallest possible CTC topology, so we call it Minimal-CTC (Figure 3). It requires even less GPU memory for MMI training (4x less compared to Correct-CTC), but the accuracy of an MMI-trained model with the Minimal-CTC topology will degrade compared to the baseline. 

The smallest topology also produces the smallest decoding WFSTs with half the size of the baseline graph. Decoding graphs compiled with Minimal-CTC are incompatible with models built with Correct-CTC or Compact-CTC.

A WFST with a single node and three self-loop arcs for <blank>, A, and B.
Figure 3. Minimal-CTC example

Finally, l come back to Correct-CTC but this time leave mandatory separation of repeated units and drop unit continuation. The topology called Selfless-CTC was designed to remedy the shortcomings of Minimal-CTC. 

Figures 1 and 4 show that Correct-CTC and Selfless-CTC differ only in non- self-loops. These two topologies also give the same MMI model accuracy and an even better one if the model has a long context window. However, Selfless-CTC is also compatible with Minimal-CTC at decoding. You get the 2x graph size reduction from Minimal-CTC by the cost of only 0.2% higher WER.

A fully connected WFST with three nodes: 0 for <blank>, 1 for A, and 2 for B. The <blank> node has a self-loop arc with input <blank> and output <epsilon>. Every node has arcs to all other nodes with arcs’ units according to the destination node.
Figure 4. Selfless-CTC example; based on Correct-CTC

Conclusion

There are several tips for better performance:

  • Use Compact-CTC instead of Correct-CTC for decoding graph construction and MMI training.
  • For the best decoding graph size reduction, train your models with Selfless-CTC and decode with Minimal-CTC.
  • Loss functions are not set in stone:  Experiment with your own WFST representations of existing loss functions and create new ones. It’s fun!

For more information, see CTC Variations Through New WFST Topologies or view the INTERSPEECH session.

Other useful resources:

Categories
Offsites

Learning to Walk in the Wild from Terrain Semantics

An important promise for quadrupedal robots is their potential to operate in complex outdoor environments that are difficult or inaccessible for humans. Whether it’s to find natural resources deep in the mountains, or to search for life signals in heavily-damaged earthquake sites, a robust and versatile quadrupedal robot could be very helpful. To achieve that, a robot needs to perceive the environment, understand its locomotion challenges, and adapt its locomotion skill accordingly. While recent advances in perceptive locomotion have greatly enhanced the capability of quadrupedal robots, most works focus on indoor or urban environments, thus they cannot effectively handle the complexity of off-road terrains. In these environments, the robot needs to understand not only the terrain shape (e.g., slope angle, smoothness), but also its contact properties (e.g., friction, restitution, deformability), which are important for a robot to decide its locomotion skills. As existing perceptive locomotion systems mostly focus on the use of depth cameras or LiDARs, it can be difficult for these systems to estimate such terrain properties accurately.

In “Learning Semantics-Aware Locomotion Skills from Human Demonstrations”, we design a hierarchical learning framework to improve a robot’s ability to traverse complex, off-road environments. Unlike previous approaches that focus on environment geometry, such as terrain shape and obstacle locations, we focus on environment semantics, such as terrain type (grass, mud, etc.) and contact properties, which provide a complementary set of information useful for off-road environments. As the robot walks, the framework decides the locomotion skill, including the speed and gait (i.e., shape and timing of the legs’ movement) of the robot based on the perceived semantics, which allows the robot to walk robustly on a variety of off-road terrains, including rocks, pebbles, deep grass, mud, and more.

Our framework selects skills (gait and speed) of the robot from the camera RGB image. We first compute the speed from terrain semantics, and then select a gait based on the speed.

Overview
The hierarchical framework consists of a high-level skill policy and a low level motor controller. The skill policy selects a locomotion skill based on camera images, and the motor controller converts the selected skill into motor commands. The high-level skill policy is further decomposed into a learned speed policy and a heuristic-based gait selector. To decide a skill, the speed policy first computes the desired forward speed, based on the semantic information from the onboard RGB camera. For energy efficiency and robustness, quadrupedal robots usually select a different gait for each speed, so we designed the gait selector to compute a desired gait based on the forward speed. Lastly, a low-level convex model-predictive controller (MPC) converts the desired locomotion skill into motor torque commands, and executes them on the real hardware. We train the speed policy directly in the real world using imitation learning because it requires fewer training data compared to standard reinforcement learning algorithms.

The framework consists of a high-level skill policy and a low-level motor controller.

Learning Speed Command from Human Demonstrations
As the central component in our pipeline, the speed policy outputs the desired forward speed of the robot based on the RGB image from the onboard camera. Although many robot learning tasks can leverage simulation as a source of lower-cost data collection, we train the speed policy in the real world because accurate simulation of complex and diverse off-road environments is not yet available. As policy learning in the real world is time-consuming and potentially unsafe, we make two key design choices to improve the data efficiency and safety of our system.

The first is learning from human demonstrations. Standard reinforcement learning algorithms typically learn by exploration, where the agent attempts different actions in an environment and builds preferences based on the rewards received. However, such explorations can be potentially unsafe, especially in off-road environments, since any robot failures can damage both the robot hardware and the surrounding environment. To ensure safety, we train the speed policy using imitation learning from human demonstrations. We first ask a human operator to teleoperate the robot on a variety of off-road terrains, where the operator controls the speed and heading of the robot using a remote joystick. Next, we collect the training data by storing (image, forward_speed) pairs. We then train the speed policy using standard supervised learning to predict the human operator’s speed command. As it turns out, the human demonstration is both safe and high-quality, and allows the robot to learn a proper speed choice for different terrains.

The second key design choice is the training method. Deep neural networks, especially those involving high-dimensional visual inputs, typically require lots of data to train. To reduce the amount of real-world training data required, we first pre-train a semantic segmentation model on RUGD (an off-road driving dataset where the images look similar to those captured by the robot’s onboard camera), where the model predicts the semantic class (grass, mud, etc.) for every pixel in the camera image. We then extract a semantic embedding from the model’s intermediate layers and use that as the feature for on-robot training. With the pre-trained semantic embedding, we can train the speed policy effectively using less than 30 minutes of real-world data, which greatly reduces the amount of effort required.

We pre-train a semantic segmentation model and extract a semantic embedding to be fine-tuned on robot data.

Gait Selection and Motor Control
The next component in the pipeline, the gait selector, computes the appropriate gait based on the speed command from the speed policy. The gait of a robot, including its stepping frequency, swing height, and base height, can greatly affect the robot’s ability to traverse different terrains.

Scientific studies have shown that animals switch between different gaits at different speeds, and this result is further validated in quadrupedal robots, so we designed the gait selector to compute a robust gait for each speed. Compared to using a fixed gait across all speeds, we find that the gait selector further enhances the robot’s navigation performance on off-road terrains (more details in the paper).

The last component of the pipeline is a motor controller, which converts the speed and gait commands into motor torques. Similar to previous work, we use separate control strategies for swing and stance legs. By separating the task of skill learning and motor control, the skill policy only needs to output the desired speed, and does not need to learn low-level locomotion controls, which greatly simplifies the learning process.

Experiment Results
We implemented our framework on an A1 quadrupedal robot and tested it on an outdoor trail with multiple terrain types, including grass, gravel, and asphalt, which pose varying degrees of difficulty for the robot. For example, while the robot needs to walk slowly with high foot swings in deep grass to prevent its foot from getting stuck, on asphalt it can walk much faster with lower foot swings for better energy efficiency. Our framework captures such differences and selects an appropriate skill for each terrain type: slow speed (0.5m/s) on deep grass, medium speed (1m/s) on gravel, and high speed (1.4m/s) on asphalt. It completes the 460m-long trail in 9.6 minutes with an average speed of 0.8m/s (i.e., that’s 1.8 miles or 2.9 kilometers per hour). In contrast, non-adaptive policies either cannot complete the trail safely or walk significantly slower (0.5m/s), illustrating the importance of adapting locomotion skills based on the perceived environments.

The framework selects different speeds based on conditions of the trail.

To test generalizability, we also deployed the robot to a number of trails that are not seen during training. The robot traverses through all of them without failure, and adjusts its locomotion skills based on terrain semantics. In general, the skill policy selects a faster skill on rigid and flat terrains and a slower speed on deformable or uneven terrain. At the time of writing, the robot has traversed over 6km of outdoor trails without failure.

With the framework, the robot walks safely on a variety of outdoor terrains not seen during training.

Conclusion
In this work, we present a hierarchical framework to learn semantic-aware locomotion skills for off-road locomotion. Using less than 30 minutes of human demonstration data, the framework learns to adjust the speed and gait of the robot based on the perceived semantics of the environment. The robot can walk safely and efficiently on a wide variety of off-road terrains. One limitation of our framework is that it only adjusts locomotion skills for standard walking and does not support more agile behaviors such as jumping, which can be essential for traversing more difficult terrains with gaps or hurdles. Another limitation is that our framework currently requires manual steering commands to follow a desired path and reach the goal. In future work, we plan to look into a deeper integration of high-level skill policy with the low-level controller for more agile behaviors, and incorporate navigation and path planning into the framework so that the robot can operate fully autonomously in challenging off-road environments.

Acknowledgements
We would like to thank our paper co-authors: Xiangyun Meng, Wenhao Yu, Tingnan Zhang, Jie Tan, and Byron Boots. We would also like to thank the team members of Robotics at Google for discussions and feedback.

Categories
Misc

Top Financial Services Sessions at NVIDIA GTC 2022

Discover how Deutsche Bank, U.S. Bank, Capital One, and other firms are using #AI technologies to optimize customer experience in financial services through…

Discover how Deutsche Bank, U.S. Bank, Capital One, and other firms are using #AI technologies to optimize customer experience in financial services through recommender systems, NLP, and more.

Categories
Misc

Calculating and Synchronizing Time with the Precision Timing Protocol on the NVIDIA Spectrum Switch

PTP uses an algorithm and method for synchronizing clocks on various devices across packet-based networks to provide submicrosecond accuracy. NVIDIA Spectrum…

PTP uses an algorithm and method for synchronizing clocks on various devices across packet-based networks to provide submicrosecond accuracy. NVIDIA Spectrum supports PTP in both one-step and two-step modes and can serve either as a boundary or a transparent clock.

Here’s how the switch calculates and synchronizes time in one-step mode when acting as a transparent clock. Later in this post, I review overall PTP accuracy.

Calculating and synchronizing time in one-step mode

In one-step mode when acting as a transparent clock, the switch must calculate the residence time of a PTP packet in real time. It does this by comparing the time of the packet’s arrival (t1) with the time of the packet’s egress (t2). The switch then changes the correction field of the packet accordingly.

To perform this calculation, the switch uses several hardware features:

  • A synchronized clock across the ASIC
  • An accurate timestamp as the packet enters the switch
  • A calculation of the time at which the packet will egress the switch

A synchronized clock across the ASIC

Because t1 at ingress and t2 at egress are on two different switch ports, time synchronization between different parts of the ASIC must be of high resolution to maintain an accurate comparison.

Having synchronized timestamps between different hardware units that sometimes work in different frequencies is challenging. The Spectrum family of ASICs can maintain synchronization errors smaller than 4 nanoseconds.

An accurate timestamp as the packet enters the switch

To achieve accurate one-step PTP, the switch must record the exact time at which it receives the packet.

As the switch receives bits from the line, it must assemble them, then parse and recognize the packet as PTP. This process takes time and must be considered so that there is no difference between the timestamp on the packet and the actual time at which the bits enter the switch.

To solve this challenge, the switch includes a designated hardware counter that calculates the number of bits between the line and the packet assembly. This counter can be translated to latency according to the protocol, then subtracted from the t1 timestamp to find the exact arrival time of the packet.

A calculation of the time at which the packet will egress the switch

Calculating the time at which the packet will egress the switch in advance is also a challenge. This is because the latency is typically affected by queuing and other parameters that are not accessible when the switch calculates the timestamp.

To solve this challenge, the switch schedules a future time for the packet to egress, then timestamps the packet according to this time. The PTP packet must then wait until the exact time to egress.

Schematic drawing of the PTP packet modification.
Figure 1. PTP packet modification in the switch

PTP scale

Other vendors use the software to match a PTP packet and its timestamp. The NVIDIA Spectrum-2 and later ASICs take a different approach. They handle PTP flows completely by hardware; nothing is required from software. There are many advantages to this implementation.

The Spectrum approach scales better for PTP flows and there is no burden on the switch’s limited compute resources. The scale is only limited by the CPU host capabilities when acting as a boundary clock. For a transparent clock, where no software is involved, there is technically no limit on the scale.

Software processing is serial and slower than hardware. Therefore, a PTP packet resides on the switch much longer if it requires software intervention. This process increases the delay between the primary and the follower entities in the network and can indirectly damage the synchronization process that assumes constant traversing time from point to point in the network.

PTP accuracy

The overall PTP accuracy of the NVIDIA Spectrum switch is around 10 nanoseconds. This accuracy is maintained for all speeds and FEC configuration.

The following graph demonstrates PTP accuracy on the Spectrum-3 switch.

Schematic drawing of IXIA serving as primary and follower connected to the Spectrum-3 switch.
Figure 2. Setup used to measure PTP accuracy
Graph of offset between primary and follower port of IXIA.
Figure 3. Offset from primary in nanoseconds

These results are taken from a one-hour test at a speed of 50 Gbps in which IXIA serves as a leader clock connected to an NVIDIA Spectrum-3 switch. The switch acts as a boundary clock. Another IXIA port serves as a follower and measures the time offset compared to the primary for each packet.

For more information, see the following resources: 

Categories
Misc

Top Omniverse Sessions for Developers at GTC 2022

Learn how to develop and distribute custom applications for the metaverse with the NVIDIA Omniverse platform at GTC.

Learn how to develop and distribute custom applications for the metaverse with the NVIDIA Omniverse platform at GTC.

Categories
Offsites

A Multi-Axis Approach for Vision Transformer and MLP Models

Convolutional neural networks have been the dominant machine learning architecture for computer vision since the introduction of AlexNet in 2012. Recently, inspired by the evolution of Transformers in natural language processing, attention mechanisms have been prominently incorporated into vision models. These attention methods boost some parts of the input data while minimizing other parts so that the network can focus on small but important parts of the data. The Vision Transformer (ViT) has created a new landscape of model designs for computer vision that is completely free of convolution. ViT regards image patches as a sequence of words, and applies a Transformer encoder on top. When trained on sufficiently large datasets, ViT demonstrates compelling performance on image recognition.

While convolutions and attention are both sufficient for good performance, neither of them are necessary. For example, MLP-Mixer adopts a simple multi-layer perceptron (MLP) to mix image patches across all the spatial locations, resulting in an all-MLP architecture. It is a competitive alternative to existing state-of-the-art vision models in terms of the trade-off between accuracy and computation required for training and inference. However, both ViT and the MLP models struggle to scale to higher input resolution because the computational complexity increases quadratically with respect to the image size.

Today we present a new multi-axis approach that is simple and effective, improves on the original ViT and MLP models, can better adapt to high-resolution, dense prediction tasks, and can naturally adapt to different input sizes with high flexibility and low complexity. Based on this approach, we have built two backbone models for high-level and low-level vision tasks. We describe the first in “MaxViT: Multi-Axis Vision Transformer”, to be presented in ECCV 2022, and show it significantly improves the state of the art for high-level tasks, such as image classification, object detection, segmentation, quality assessment, and generation. The second, presented in “MAXIM: Multi-Axis MLP for Image Processing” at CVPR 2022, is based on a UNet-like architecture and achieves competitive performance on low-level imaging tasks including denoising, deblurring, dehazing, deraining, and low-light enhancement. To facilitate further research on efficient Transformer and MLP models, we have open-sourced the code and models for both MaxViT and MAXIM.

A demo of image deblurring using MAXIM frame by frame.

Overview
Our new approach is based on multi-axis attention, which decomposes the full-size attention (each pixel attends to all the pixels) used in ViT into two sparse forms — local and (sparse) global. As shown in the figure below, the multi-axis attention contains a sequential stack of block attention and grid attention. The block attention works within non-overlapping windows (small patches in intermediate feature maps) to capture local patterns, while the grid attention works on a sparsely sampled uniform grid for long-range (global) interactions. The window sizes of grid and block attentions can be fully controlled as hyperparameters to ensure a linear computational complexity to the input size.

The proposed multi-axis attention conducts blocked local and dilated global attention sequentially followed by a FFN, with only a linear complexity. The pixels in the same colors are attended together.

Such low-complexity attention can significantly improve its wide applicability to many vision tasks, especially for high-resolution visual predictions, demonstrating greater generality than the original attention used in ViT. We build two backbone instantiations out of this multi-axis attention approach – MaxViT and MAXIM, for high-level and low-level tasks, respectively.

MaxViT
In MaxViT, we first build a single MaxViT block (shown below) by concatenating MBConv (proposed by EfficientNet, V2) with the multi-axis attention. This single block can encode local and global visual information regardless of input resolution. We then simply stack repeated blocks composed of attention and convolutions in a hierarchical architecture (similar to ResNet, CoAtNet), yielding our homogenous MaxViT architecture. Notably, MaxViT is distinguished from previous hierarchical approaches as it can “see” globally throughout the entire network, even in earlier, high-resolution stages, demonstrating stronger model capacity on various tasks.

The meta-architecture of MaxViT.

MAXIM
Our second backbone, MAXIM, is a generic UNet-like architecture tailored for low-level image-to-image prediction tasks. MAXIM explores parallel designs of the local and global approaches using the gated multi-layer perceptron (gMLP) network (patching-mixing MLP with a gating mechanism). Another contribution of MAXIM is the cross-gating block that can be used to apply interactions between two different input signals. This block can serve as an efficient alternative to the cross-attention module as it only employs the cheap gated MLP operators to interact with various inputs without relying on the computationally heavy cross-attention. Moreover, all the proposed components including the gated MLP and cross-gating blocks in MAXIM enjoy linear complexity to image size, making it even more efficient when processing high-resolution pictures.

Results
We demonstrate the effectiveness of MaxViT on a broad range of vision tasks. On image classification, MaxViT achieves state-of-the-art results under various settings: with only ImageNet-1K training, MaxViT attains 86.5% top-1 accuracy; with ImageNet-21K (14M images, 21k classes) pre-training, MaxViT achieves 88.7% top-1 accuracy; and with JFT (300M images, 18k classes) pre-training, our largest model MaxViT-XL achieves a high accuracy of 89.5% with 475M parameters.

Performance comparison of MaxViT with state-of-the-art models on ImageNet-1K. Top: Accuracy vs. FLOPs performance scaling with 224×224 image resolution. Bottom: Accuracy vs. parameters scaling curve under ImageNet-1K fine-tuning setting.

For downstream tasks, MaxViT as a backbone delivers favorable performance on a broad spectrum of tasks. For object detection and segmentation on the COCO dataset, the MaxViT backbone achieves 53.4 AP, outperforming other base-level models while requiring only about 60% the computational cost. For image aesthetics assessment, the MaxViT model advances the state-of-the-art MUSIQ model by 3.5% in terms of linear correlation with human opinion scores. The standalone MaxViT building block also demonstrates effective performance on image generation, achieving better FID and IS scores on the ImageNet-1K unconditional generation task with a significantly lower number of parameters than the state-of-the-art model, HiT.

The UNet-like MAXIM backbone, customized for image processing tasks, has also demonstrated state-of-the-art results on 15 out of 20 tested datasets, including denoising, deblurring, deraining, dehazing, and low-light enhancement, while requiring fewer or comparable number of parameters and FLOPs than competitive models. Images restored by MAXIM show more recovered details with less visual artifacts.

Visual results of MAXIM for image deblurring, deraining, and low-light enhancement.

Summary
Recent works in the last two or so years have shown that ConvNets and Vision Transformers can achieve similar performance. Our work presents a unified design that takes advantage of the best of both worlds — efficient convolution and sparse attention — and demonstrates that a model built on top, namely MaxViT, can achieve state-of-the-art performance on a variety of vision tasks. More importantly, MaxViT scales well to very large data sizes. We also show that an alternative multi-axis design using MLP operators, MAXIM, achieves state-of-the-art performance on a broad range of low-level vision tasks.

Even though we present our models in the context of vision tasks, the proposed multi-axis approach can easily extend to language modeling to capture both local and global dependencies in linear time. Motivated by the work here, we expect that it is worthwhile to study other forms of sparse attention in higher-dimensional or multimodal signals such as videos, point clouds, and vision-language models.

We have open-sourced the code and models of MAXIM and MaxViT to facilitate future research on efficient attention and MLP models.

Acknowledgments
We would like to thank our co-authors: Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, and Alan Bovik. We would also like to acknowledge the valuable discussion and support from Xianzhi Du, Long Zhao, Wuyang Chen, Hanxiao Liu, Zihang Dai, Anurag Arnab, Sungjoon Choi, Junjie Ke, Mauricio Delbracio, Irene Zhu, Innfarn Yoo, Huiwen Chang, and Ce Liu.

Categories
Misc

NVIDIA Hopper Sweeps AI Inference Benchmarks in MLPerf Debut

In their debut on the MLPerf industry-standard AI benchmarks, NVIDIA H100 Tensor Core GPUs set world records in inference on all workloads, delivering up to 4.5x more performance than previous-generation GPUs. The results demonstrate that Hopper is the premium choice for users who demand utmost performance on advanced AI models. Additionally, NVIDIA A100 Tensor Core Read article >

The post NVIDIA Hopper Sweeps AI Inference Benchmarks in MLPerf Debut appeared first on NVIDIA Blog.

Categories
Misc

Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA

Today’s AI-powered applications are enabling richer experiences, fueled by both larger and more complex AI models as well as the application of many models in…

Today’s AI-powered applications are enabling richer experiences, fueled by both larger and more complex AI models as well as the application of many models in a pipeline. To meet the increasing demands of AI-infused applications, an AI platform must not only deliver high performance but also be versatile enough to deliver that performance across a diverse range of AI models. To maximize infrastructure utilization and optimize CapEx, the ability to run the entire AI workflow on the same infrastructure is critical: from data prep and model training to deployed inference.

MLPerf benchmarks have emerged as industry-standard, peer-reviewed measures of deep learning performance, covering AI training, AI inference, and high-performance computing (HPC). MLPerf Inference 2.1, the latest iteration of the MLPerf Inference benchmark suite, covers a breadth of common AI use cases including recommenders, natural language processing, speech recognition, medical imaging, image classification, and object detection.

In this round, NVIDIA made its first MLPerf submissions on the latest NVIDIA H100 Tensor Core GPU based on the breakthrough NVIDIA Hopper Architecture.

  • H100 set new per-accelerator records on all data center tests, demonstrating up to 4.5x higher inference performance compared to the NVIDIA A100 Tensor Core GPU.
  • A100 continued to demonstrate excellent performance across the full suite of MLPerf Inference 2.1 tests for both the data center and edge inference scenarios.

NVIDIA Jetson AGX Orin, built for edge AI and robotics applications, also delivered up to a 50% improvement in performance-per-watt following its debut in the prior round of MLPerf Inference, and ran all edge workloads and scenarios.

Delivering these performance results required deep software and hardware co-optimization. In this post, we discuss the results and then dive into some of the key software optimizations.

NVIDIA H100 Tensor Core technology

On a per-streaming multiprocessor (SM) basis, the H100 Tensor Cores provide twice the matrix multiply-accumulate (MMA) throughput clock-for-clock of the A100 SMs when using the same data types and four times the throughput when comparing FP16 on an A100 SM to FP8 on an H100 SM. New kernels had to be developed in order to leverage several of the H100’s new capabilities and to take advantage of these dramatically faster Tensor Cores.

The H100 Tensor Cores process data so rapidly that it can be challenging to keep them both fed with enough input data and to post-process their output data. Kernels must create an efficient pipeline such that data loading, Tensor Core processing, post-processing, and storage all happen simultaneously and efficiently.

The new H100 asynchronous transaction barriers are instrumental to the efficiency of these pipelines. The asynchronous barriers allow producer threads to run ahead after signaling data availability. In the case of data loading threads, this provides significant improvement in kernels’ ability to hide memory system latencies and ensure a steady stream of input data is available for the Tensor Cores. The asynchronous transaction barriers also provide an efficient mechanism for consumer threads to wait on resource availability so that they don’t waste SM resources in spin loops.

The Tensor Memory Accelerator (TMA) further turbocharges these kernels. The TMA was designed to natively integrate into asynchronous pipelines, and provides for the asynchronous transfer of multi-dimensional tensors from global memory into the SM’s shared memory.

The Tensor Cores are so fast that operations like address calculation can become a performance bottleneck; the TMA offloads this work so that the kernels can focus on running the math and post-processing as quickly as possible.

Finally, the new kernels employ H100 thread block clusters to exploit locality at the GPU processing cluster (GPC). The thread blocks within each thread block cluster collaborate to load data more efficiently and provide higher input bandwidth to the Tensor Cores.

NVIDIA H100 Tensor Core GPU performance results

Starting with the Data Center category, the NVIDIA H100 Tensor Core GPU delivered the highest per-accelerator performance on every workload across both the Server and Offline scenarios, delivering up to 4.5x more performance in the Offline scenario and up to 3.9x more performance in the Server scenario than the A100 Tensor Core GPU.

Left bar chart shows the H100 delivering up to 3.9x more performance than the A100 in the Server scenario. Right chart shows H100 delivering up to 4.5x more performance than A100 in the Offline scenario.
Figure 1. H100 delivers up to 4.5x more performance than A100 in the MLPerf Inference 2.1 Data Center category

Compared to a CPU-only submission, the H100 Tensor Core GPU provides up to 36x higher performance.

Thanks to full-stack improvements, NVIDIA Jetson AGX Orin turned in large improvements in energy efficiency compared to the last round, delivering up to a 50% efficiency improvement.

Left bar chart shows the queries per watt improvements in the Offline scenario. Right chart shows the energy per stream improvements in the Single Stream and Multi Stream scenarios in the AGX Jetson Orin MLPerf Inference 2.1 submission compared to the MLPerf Inference 2.0 submission.
Figure 2. Efficiency improvements in the NVIDIA Jetson AGX Orin MLPerf Inference 2.1 compared to the prior submission

Here’s a closer look at the software optimizations that made these results possible.

High-performance BERT inference using FP8

Diagram shows the steps to perform FP8 inference on BERT.
Figure 3. FP8 Inference on BERT using E4M3 offers increased stability for the forward pass

The NVIDIA Hopper Architecture incorporates new fourth-generation Tensor Cores with support for two new FP8 data types: E4M3 and E5M2. These new data types increase Tensor Core throughput by 2x and reduce memory requirements by 2x compared to 16-bit floating-point.

The E4M3 offers an additional mantissa bit, which leads to increased stability in the first step of the calculation process, known as the forward pass. The additional exponent bit of E5M2 is more helpful for preventing overflow/underflow during the backward pass. For our BERT FP8 submission, we used E4M3.

Our experiments on NLP models like BERT showed that when quantizing the model from a higher precision (FP32) to a lower precision (such as FP8 or INT8), the drop in accuracy observed with FP8 is lower than that of INT8.

Although we can use quantization aware training (QAT) to recover some of the model accuracy with INT8, the accuracy of INT8 under post training quantization (PTQ) remains a challenge. This is where FP8 is beneficial: It can provide 99.9% accuracy of the FP32 model under PTQ without the additional cost and effort required to run QAT. As a result, FP8 can be used for the 99.9% high accuracy category of MLPerf where previously FP16 was required. In essence, FP8 delivered the performance of INT8 with the accuracy of FP16 for this workload.

In the NVIDIA BERT submission, all fully connected and matrix multiply layers in the encoder used FP8 precision. The implementation of these layers used cuBLASLt to perform the FP8 GEMMs on the H100 Tensor Cores.

Key BERT optimizations were extended to support FP8, including the following:

  • Removing padding: Inputs to BERT have variable sequence lengths, and are padded to a maximum sequence length. We strip the padding to avoid wasting compute on padding, and then reconstruct the padded sequences for the final output to match the input shape.
  • Fused multi-head attention: This is a fusion of four operations: transpose Q/K, Q*K, softmax, and QK*V to compute the attention. Fused multi-head attention enhances memory efficiency, skipping the padding to prevent useless computing. Fused multi-head attention provides roughly a 2x end-to-end speedup.
  • Activation fusion: We fuse the GEMM with more operations, including bias and activation functions (GeLU). This fusion also helps enhance memory efficiency by removing extra memory transfers.

RetinaNet for object detection

In MLPerf Inference 2.1, a new one-stage object detection model named RetinaNet was added. This replaced the ssd-resnet34 and ssd-mobilenet workloads of MLPerf Inference 2.0. This updated model architecture and its new inference dataset bring new challenges in delivering fast, accurate, and power-efficient inference.

NVIDIA submitted results for RetinaNet across all the platforms demonstrating the width of our software support.

RetinaNet is trained and inferred using the Open Images dataset, which contains an order of magnitude more object categories and object notations than the COCO dataset used earlier. For RetinaNet, 264 unique classes are selected for training and inference tasks. This is significantly more than the 81 classes used for ssd-resnet34.

Photo from the OpenImage Dataset shows brightly colored bounding boxes around objects and people in a theater.
Figure 4. The OpenImage Dataset used for RetinaNet training and inference includes highly detailed object annotations

Although RetinaNet is also a single-shot object detection model, it has several key differences compared to ssd-resnet34:

  • RetinaNet uses Feature Pyramid Network (FPN) as its backbone on top of a feedforward ResNeXt architecture. ResNeXt uses group convolution in its computation blocks, and has different math characteristics from that of ResNet34.
  • For every image, 120,087 boxes and 264 unique class scores per box are fed into the non-maximum suppression (NMS) layer, and the top 1,000 scoring boxes are selected for outputs. In ssd-resnet34, these numbers were 25x lesser: 15,130 boxes, 81 classes per box, and 200 topK.
The RetinaNet model architecture illustrated with ResNeXt feed-forward network, Feature Pyramid Network (FPN) and Class/Box subnet (K=264, A=9)
Figure 5. MLPerf Inf 2.1 RetinaNet model architecture

NVIDIA used TensorRT as the backend for RetinaNet. TensorRT significantly accelerates inference throughput by automatically optimizing both the graph execution and layer execution:

  • TensorRT provides full support to execute the model inference in mixed FP32/INT8 precision, with minimal accuracy loss compared to FP16 and FP32 precision.
  • TensorRT automatically selects optimized kernels for group convolutions across all 16 ResNeXt blocks.
  • TensorRT provides fusion patterns for convolution, activation, and (optional) pooling layers, which optimize the memory movement for faster inference by merging the layer weights and reducing the number of operations.
  • For the post-processing NMS layer, NVIDIA leverages EfficientNMS, which is an open-sourced high-performance CUDA kernel specialized for NMS tasks, provided as a TensorRT plugin.

NVIDIA Jetson AGX Orin optimizations

NVIDIA Jetson AGX Orin is the latest NVIDIA platform for edge AI and robotics applications. In this round of MLPerf Inference, Jetson AGX Orin demonstrated excellent performance and energy efficiency improvements across the breadth of MLPerf Inference 2.1 edge workloads. Improvements included a 45% reduction in ResNet-50 multi-stream latency and a 17% boost in BERT offline throughput  over the previous round (v2.0). In the power submission, Orin achieved up to 52% power reduction and 48% perf-per-watt improvement on selected benchmarks.  The submissions used the 22.08 Jetson CUDA-X AI Developer Preview software, which includes an optimized NVIDIA Jetson Linux (L4T) image, TensorRT 8.5.0, CUDA 11.4.14, and cuDNN 8.5.0, allowing customers to easily benefit from these same improvements. RetinaNet is fully supported and performant on Jetson AGX Orin with this software stack. This demonstrates the ability of the NVIDIA platform and software to support performant DL inference out-of-the-box.

NVIDIA Orin performance improvements

The significant improvement in MLPerf-Inference v2.1 came from both the general performance boost enabled by the system image and TensorRT 8.5 in 22.08 Jetson CUDA-X AI Developer Preview. The optimized Jetson L4T image provides users access to MaxN power mode, which boosts the frequencies of both the GPU and the DLA units. Meanwhile, this image has the option to use an enlarged page size of 64K that can reduce TLB cache misses when running certain inference workloads. Furthermore, the 3.10.1 DLA compiler natively included in the image incorporates a series of optimization features, which increases the performance of workloads running on the Orin DLA by up to 53%.

TensorRT 8.5 includes two new optimizations that improve inference performance. The first is native support for cuDLA which removes the imposition of inserting copy nodes between DLA nodes and GPU nodes. We observed approximately 1.8% DLA engine end-to-end improvements switching from NVMedia to cuDLA. The second is the addition of optimized kernels for small channel * filter size convolutions fused with a beta=1 residual connection. This improved BERT performance by 17% and ResNet50 by 5% on the GPU in Orin.

NVIDIA Orin energy efficiency improvements

The NVIDIA Orin power submission benefited from all the above performance improvements and also focused on further power reduction. Using the updated L4T image for Orin the power consumption is reduced by fine tuning the CPU, GPU, and DLA frequencies per benchmark to achieve the optimum perf-per-watt. This image also enables new platform power saving features like regulator auto phase shedding and low-power states in low-load conditions. The flexibility of USB-C support in Orin was leveraged to consolidate all I/O through Ethernet-over-USB communication. System power was further reduced by disabling I/O subsystems like Ethernet, WiFi, and DP that are not essential for inference and also by using off-the-shelf higher efficiency GaN power adapters.

These platform and software optimizations reduced system power consumption by up to 52% and improved perf-per-watt by up to 48% over our previous submission in 2.0.

3D U-Net performance improvement

In MLPerf Inference v2.0, the 3D U-Net medical imaging workload switched to the KITS19 dataset, which increased image sizes by up to 8x and raised the amount of compute processing required for a given sample up to 18x due to the sliding window inference. For more information about the NVIDIA MLPerf Inference v2.0 submission, see Getting the Best Performance on MLPerf Inference 2.0.

For MLPerf Inference 2.1, we further improved the performance of the first convolution layer with the TensorRT IPluginV2DDynamicExt plugin.

KiTS19 images are single-channel tensors, and this challenges the performance of the very first 3D convolution in 3D U-Net. In 3D convolution, this channel dimension typically contributes to GEMM’s K dimension. This is specially relevant because the overall performance on 3D U-Net is dominated by the first two and last two 3D-Convolutions. In MLPerf Inference v2.0, these four convolutions contributed to roughly 38% of the entire network run-time; the very first layer responsible for 8%. A non-trivial factor that explains that is the need to use zero-padding to accommodate for the NC/32DHW32 vectorized format layout in where the tensor cores can be utilized most efficiently.

With our updated plugin, we use a INT8 Linear format to leverage efficient computation on this single-channel limited 3D shape input. The advantages of this are twofold:

  • Higher effective use of flops: by not performing unneeded computations
  • PCIe transfer B/W savings: avoids the overhead of either moving zero padded input tensor between host and GPU memory or zero-padding on GPU before sending the input tensor to TensorRT

This optimization improved the first layer performance by 2.7x. Additionally, the slicing kernel no longer needs to deal with zero-padding, and therefore its performance also improved by 2x. As a net result, 3D-UNet’s end-to-end performance improved by 5% in MLPerf Inference 2.1.

Breaking performance records across workloads

In MLPerf Inference 2.1, the first NVIDIA H100 submission set new per-accelerator performance records on all workloads in the data center scenario, and delivered up to 4.5x higher performance than the A100 and 36x higher performance than leading CPUs. This generational performance uplift was possible due to both the many breakthroughs of the NVIDIA Hopper Architecture as well as immense software optimizations that take advantage of those capabilities.

NVIDIA Jetson AGX Orin saw an up to 50% boost in energy efficiency in just one round and it continues to deliver overall inference performance leadership for edge AI and robotics applications.

This latest round of MLPerf Inference showcases the leading performance and versatility of the NVIDIA AI platform for the full breadth of AI workloads and scenarios. With the H100 Tensor Core GPU, we are supercharging the NVIDIA AI platform for the most advanced models and providing users with new levels of performance and capabilities for the most demanding workloads.

For more information, see NVIDIA Hopper Architecture In-Depth.

Categories
Misc

Upcoming Event: Recommender Systems Sessions at GTC 2022

Learn about transformer-powered personalized online advertising, cross-framework model evaluation, the NVIDIA Merlin ecosystem, and more with these featured GTC…

Learn about transformer-powered personalized online advertising, cross-framework model evaluation, the NVIDIA Merlin ecosystem, and more with these featured GTC 2022 sessions.