Think fast. Enterprise AI, new gaming technology, the metaverse and the 3D internet, and advanced AI technologies tailored to just about every industry are all coming your way. NVIDIA founder and CEO Jensen Huang’s keynote at NVIDIA GTC on Tuesday, Sept. 20, is the best way to get ahead of all these trends. NVIDIA’s virtual Read article >
More than 6 million pairs of eyes will be on real-time AI avatar technology in this week’s finale of America’s Got Talent — currently the second-most popular primetime TV show in the U.S.. Metaphysic, a member of the NVIDIA Inception global network of technology startups, is one of 11 acts competing for $1 million and Read article >
GPU-accelerated processing is vital to many automotive and embedded systems. Safety-critical and real-time applications have different requirements and…
GPU-accelerated processing is vital to many automotive and embedded systems. Safety-critical and real-time applications have different requirements and deployment priorities than consumer applications, but they often are developed using GPU APIs that have been primarily designed for use in games.
Vulkan SC (Safety Critical) is a newly released open standard to streamline the use of GPUs in markets where functional safety and hitch-free performance are essential.
NVIDIA helped lead the creation of the Vulkan SC 1.0 API and is now shipping production drivers on its NVIDIA DRIVE and NVIDIA Jetson platforms.
Deterministic GPU processing
Vulkan is a royalty-free open standard from the Khronos Group standards organization. It is the only modern, cross-platform GPU API. Launched in 2016, Vulkan is primarily designed for use in games and professional design applications on desktop and mobile devices using Windows, Linux, and Android.
Khronos derived Vulkan SC from Vulkan 1.2, with the Vulkan SC 1.0 specification being released in March 2022. Vulkan SC defines the subset of the Vulkan API that is essential for embedded markets in order to reduce API surface area for streamlined implementation and testing.
Vulkan SC also increases API robustness by eliminating ignored parameters and undefined behaviors, and enhancing detection, reporting, and correction of run-time faults. Vulkan SC enables predictable, hitch-free execution by moving pipeline compilation offline, and providing sophisticated functionality for managing static memory allocation and resource management with explicit synchronization.
Vulkan SC and the NVIDIA DRIVE automotive platform
The streamlined Vulkan SC API reduces the cost and effort of system-level safety certification to standards such as ISO 26262, a functional safety standard used in the automotive industry. Simplifying system certification enables manufacturers to smoothly deploy advanced graphics capabilities in driver assistance systems on the NVIDIA DRIVE platform.
For example, Level 2 and Level 3 AI-assisted vehicles require the driver to remain in the loop during vehicle operation. Safe visualization inside the cockpit and the digital instrument cluster is key to ensuring the human driver is aware of how the system is perceiving and reacting to the surrounding environment.
The confidence view is a rendering of the mind of the vehicle’s AI and how it sees the world. It shows exactly what the sensor suite and perception system are detecting in real time using a 3D surround model. By incorporating this view in the cabin interior, the vehicle can communicate to its occupants the accuracy and reliability of the autonomous driving system at every step of the journey.
The ability to support such in-vehicle graphics safely and securely is what makes Vulkan SC critical to the next-generation intelligent vehicle experience. Production Vulkan SC 1.0 drivers are included in DRIVE OS 6.0.4.0, which shipped August 29, 2022.
Vulkan SC on the NVIDIA Jetson embedded platform
NVIDIA Jetson is the world’s leading platform for autonomous machines and other embedded applications. It includes Jetson modules, which are small form-factor, high-performance computers, the NVIDIA JetPack SDK for accelerating software, and an ecosystem with sensors, SDKs, services, and products to speed development.
Applications for Jetson-based systems typically do not require formal safety certification. However, many embedded and autonomous systems can directly benefit from the deterministic, real-time GPU graphics and compute acceleration provided by Vulkan SC. With these capabilities, the Jetson platform can support a broader diversity of applications.
The NVIDIA Jetpack 5.0.2 SDK released on August 15 2022 includes conformant, production Vulkan SC 1.0 drivers for the Linux OS.
Ongoing NVIDIA commitment to the Vulkan SC API
NVIDIA will continue to invest in the evolution of the Vulkan SC open standard API at Khronos. We are committed to providing conformant, production drivers on platforms such as NVIDIA DRIVE and Jetson.
Later in 2022, NVIDIA will also ship support for Vulkan SC in NVIDIA Nsight developer tools. Vulkan SC streamlines the open, cross-platform Vulkan API for deterministic GPU graphics and compute, enabling advanced applications and use cases on safety-certified and real-time embedded platforms.
Now, NVIDIA provides industry-leading support for this groundbreaking open standard, enabling GPU acceleration in new classes of products. Download the latest NVIDIA DRIVE or NVIDIA Jetpack releases with Vulkan SC drivers today.
It is well-known that GPUs are the typical go-to solution for large machine learning (ML) applications, but what if GPUs were applied to earlier stages of the…
It is well-known that GPUs are the typical go-to solution for large machine learning (ML) applications, but what if GPUs were applied to earlier stages of the data-to-AI pipeline?
For example, it would be simpler if you did not have to switch out cluster configurations for each pipeline processing stage. You might still have some questions:
Would this be practical from a cost perspective?
Could you still meet SLAs on a data-processing time budget for some near-real-time processing?
How difficult is it to optimize these GPU clusters?
If you optimized the configuration for one stage, would the same work for other stages?
At AT&T, these questions arose when our data teams were working to manage cloud costs while balancing simplicity at scale. We also observed that many of our data engineers and scientist colleagues were not aware of GPUs being an effective and efficient infrastructure on which to run more mundane ETL and feature engineering stages.
It was also not clear what the relative performance could be of CPU compared to GPU configurations. Our goal at AT&T was to run a few typical configuration examples to understand the difference.
In this post, we share our data pipeline analysis in terms of speed, cost, and full pipeline simplicity. We also provide insights on design considerations and explain how we optimized the performance and price of our GPU cluster. The optimization came from using the RAPIDS accelerator for Apache Spark, an open-source library that enables GPU-accelerated ETL and feature engineering.
SPOILER ALERT: We were pleasantly surprised that, at least for the examples examined, the use of GPUs for each pipeline stage proved to be faster, cheaper, and simpler!
Use cases
Data-to-AI pipelines include multiple stages of batch processing:
Data preparation or federation
Transformation
Feature engineering
Data extraction
Batch processing involves processing a large volume of data containing trillions of records. Batch processing jobs are generally optimized for either cost or performance depending on the SLA for that use case.
A good example of a batch processing job optimized for cost is creating features from call records, which go on to be used to train an ML model. On the other hand, a real-time, inference use case to detect fraud is optimized for performance. GPUs are often overlooked and considered to be expensive for these batch-processing stages of AI/ML pipelines.
These batch processing jobs often involve large joins, aggregation, ranking, and transformation operations. As you can imagine, AT&T has many data and AI use cases that involve batch processing:
Network planning and optimization
Fraud
Sales and marketing
Tax
Depending on the use case, these pipelines can use NVIDIA GPUs and RAPIDS Accelerator for Apache Spark to optimize costs or improve performance.
For this analysis, we look at a couple of data-to-AI pipelines. The first uses feature engineering of call records for a marketing use case, and the second use case carries out an ETL transformation of a complex tax dataset.
Speeding up feature engineering and transformation with GPUs
Efficiently scaling data-to-AI pipelines remains a need for data teams. High-cost pipelines are processing hundreds of terabytes to petabytes of data on a monthly, weekly, or even daily basis.
When examining efficiency, it is important to identify optimization opportunities across all ETL and feature engineering stages and then compare speed, cost, and pipeline simplicity.
For our data pipeline analysis, we compared three options:
An Apache Spark CPU cluster using Databricks’ newly released Photon engine
To gauge how far we were from the optimum cost, we compared a bare-bone VM solution using AT&T’s open-sourced GS-lite solution that enables you to write SQL that then compiles to C++.
As mentioned earlier, after optimizing each solution, we discovered that the RAPIDS accelerator for Apache Spark running on a GPU cluster, had the best overall speed, cost, and design simplicity trade-off.
In the following sections, we discuss the chosen optimizations and design considerations for each.
Design considerations for optimizing the AI/ML pipeline solution
To compare the performance of the three potential solutions, we carried out two experiments, one for each of the selected use cases. For each, we optimized different parameters to gain insight into how speed, cost, and design are affected.
Example 1: Optimizing simple group by aggregations for the call-records use case
For the first feature engineering example, we chose to create features from call-record datasets containing close to three trillion records (rows) a month (Table 1). This data preprocessing use case is a fundamental building block in several sales and marketing AI pipelines, such as customer segmentation, predicting customer churn, and anticipating customer trends and sentiments. There were various data transformations in this use case, but many of them involved simple “group by” aggregations such as the following, for which we wanted to optimize the processing.
res=spark.sql("""
Select DataHour, dev_id,
sum(fromsubbytes) as fromsubbytes_total,
sum(tosubbytes) as tosubbytes_total,
From df
Group By DataHour, dev_id
""")
Accessing insights from data and carrying out data analysis is still one of the biggest pain points for many enterprises. This is not because of lack of data, but because the amount of time spent on data preparation and analysis is still an impediment to data engineers and data scientists.
Here are some of the key infrastructure challenges in this preprocessing example:
Query execution on CPU cluster takes too long, leading to timeout issues.
The compute cost is expensive.
Days
# of Rows
Size (Bytes)
1
~110 Billion
>2 TB
7
~800 Billion
~16 TB
30
~3 Trillion
~70 TB
Table 1. Number of call records and data size
Also, this call-records use case had an extra dimension of experimentation in terms of compression types. The data was arriving from the network edge to the cloud with some form of compression that we could specify and evaluate tradeoffs for. We thus experimented with several compression schemes, including txt/gzip, Parquet/Z standard, and Parquet/Snappy.
The Z-standard compression had the smallest file size (about half in this case). As we show later, we found better speed/cost tradeoffs with Parquet/Snappy.
Next, we considered the type of cluster with regard to the number of cores per VM, number of VMs, the allocation of worker nodes, and whether to use a CPU or GPU.
For CPU clusters, we chose the lowest number of cores that could handle the workload, that is, the lowest number of VMs and workers to prevent an over-allocation of resources.
For GPUs, we used the RAPIDS Accelerator tuning guide [spark-rapids-tuning], which provides sizing recommendations with regard to concurrent tasks per executor, maxPartitionBytes, shuffle partitions, and concurrent GPU tasks.
One goal after implementing the data processing on the GPU was to ensure that all key feature engineering steps remained on the GPU (Figure 1).
Example 2: Optimizing multiple ETL and feature creation stages for the tax dataset
The use case for Example 2 allowed us to compare many different transformations and processing stages for ETL, feature creation, and AI. Each stage had different record volume sizes (Figure 2).
This ETL pipeline with multiple stages is a common bottleneck in enterprises where data lives in silos. Most often massive data processing requires querying and joining data from two or more data sources using fuzzy logic. As you can see from Figure 2, even though we started only with 20 million rows of data, the data volume grew exponentially as we moved through the data processing stages.
As in Example 1, the design considerations were the number of cores per VM, number of VMs, and the allocation of worker nodes, when comparing CPUs and GPUs.
Results
After trying different core, worker, and cluster configurations for the use cases shown in examples 1 and 2, we collected the results. We ensured that any particular ETL job finished within the allocated time to keep up with the data input data rate. The best approach in both had the lowest cost and highest simplicity.
Example 1 results
Figure 3 shows the cost/speed tradeoffs across a range of setups for simple group by aggregations in the call records use case. You can make several observations:
The lowest cost and simplest solution is using the GPU cluster with Snappy compression, being ~33% cheaper than the lowest-cost Photon solution and close to half the cost of the fastest Photon solution.
All standard Databricks clusters performed worse on both cost and execution time. Photon was the best CPU solution.
While not shown in Figure 3, the GS-lite solution was actually the cheapest, needing only two VMs.
Example 2 results
Like Example 1, we tried several CPU and GPU cluster configurations for the five ETL and AI data processing stages with the Databricks 10.4 LTS ML runtime. Table 2 shows the resulting best configurations.
Configuration
CPU
GPU
Worker type
Standard_D13_v2 56 GB memory, 8 cores, min workers 8, max workers 12
Standard_NC8as_T4_v3 56 GB memory, 1 GPU, min workers 2, max workers 16
Driver type
Standard_D13_v2 56 GB memory, 8 cores
Standard_NC8as_T4_v3 56 GB memory, 1 GPU
Table 2. Cloud VM instances for CPU and GPU
These configurations yielded the relative cost and execution time (speed) performances favoring GPUs (Figure 4).
While not shown here, we confirmed that the next stages of the AI pipeline in Example 1, which used XGBoost modeling, also benefited from GPUs and RAPIDS Accelerator for Apache Spark. This confirms that GPUs could be the best end-to-end solution.
Conclusion
While not exhaustive of all AT&T’s data and AI pipelines, it appears that GPU-based pipelines were beneficial in all examples examined. In these cases, we were able to cut down the time for data preparation, model training, and optimization. This resulted in spending less money with a simpler design, as there was no configuration switching across stages.
We encourage you to experiment with your own data to AI pipelines, especially if you’re already using GPUs for AI/ML training. You may also find that GPUs are your “go-to,” simpler, faster, and cheaper solution!
Interested in learning more about these use cases and experiments, or getting tips on how to cut down your data processing time with the RAPIDS Accelerator for Apache Spark? Register for the free AT&T GTC session, How AT&T Supercharged their Data Science Efforts.
Automatic speech recognition (ASR) research generally focuses on high-resource languages such as English, which is supported by hundreds of thousands of hours…
Automatic speech recognition (ASR) research generally focuses on high-resource languages such as English, which is supported by hundreds of thousands of hours of speech. Recent literature has renewed focus on more complex languages, such as Japanese. Like other Asian languages, Japanese has a vast base character set (upwards of 3,000 unique characters are used in common vernacular), and poses unique challenges, such as multiple word order.
To learn more about the technology discussed in this post, watch the author’s INTERSPEECH 2022 presentation on September 20, Neural Transducers, Streaming ASR, and Novel ASR Models.
This post discusses recent work that improved both accuracy and speed for Japanese language ASR. First, we improved Conformer, a state-of-the-art ASR neural network architecture, to achieve a significant improvement in training and inferencing speed without accuracy loss. Second, we enhanced a pure and deep convolutional network with a multiple-head self attention mechanism for enriching the learning of global contextual representations of the input speech waves.
Deep sparse conformer for speech recognition
Conformer is a neural network architecture widely used in the ASR systems of numerous languages and has achieved high levels of accuracy. However, Conformer is relatively slow at both training and inferencing because it uses multi-head self-attention with a time/memory complexity of quadratic to the length of input audio waves.
This prevents its efficient processing of long audio sequences since relatively high memory footprints are required during training and inferencing. These motivate the usage of sparseattention for efficient Conformer constructing. In addition, with sparse attention and relatively low memory cost, we are able to build a deeper network that can process long sequences fed with large-scale speech datasets.
As depicted in Figure 1, we improved the Conformer long-sequence representation ability in two directions: sparser and deeper. We used a ranking criteria to only select a small scale of dominant queries instead of the whole query set to save time for attention score computing.
A deep normalization strategy is used when performing residual connections to ensure the training of hundred-level Conformer blocks. This strategy involves discounting the parameters of the encoder and decoder parts with a function that is related respectively to the number of encoder layers and decoder layers.
In addition, this deep normalization strategy ensures a successful building of 10 to 100 layers so that the model is more expressive. In contrast, the deep sparse Conformer time and memory costs decrease at a rate of 10% to 20% when compared to the usual Conformer.
Attention-enhanced Citrinet for speech recognition
Citrinet, proposed by NVIDIA researchers, is an end-to-end convolutional Connectionist Temporal Classification (CTC)-based ASR model. To capture local and global contextual information, Citrinet uses 1D time-channel separable convolutions combined with sub-word encoding and squeeze-and-excitation (SE), enabling the whole architecture to achieve state-of-the-art accuracies compared with transformer-based counterparts.
Applying Citrinet to Japanese ASR involves several challenges. Specifically, it is relatively slower at convergence and more difficult to train a model with a comparable accuracy than similar deep neural network models. Considering that there are as many as 235 convolutional layers that influence the convergence speed of Citrinet, we aim to reduce the CNN layers by introducing multi-head attentions in the convolution module in Citrinet blocks while keeping the SE and residual modules unchanged.
As shown in Figure 2, speeding up the training time involves reducing eight convolution layers in each attention-enhanced Citrinet block. In addition, considering that self-attention has time/memory complexities of quadraticto the length of input audio waves, we reduced the original 23 Jasper blocks to eight blocks with a significant model size reduction. This design ensures the attention-enhanced Citrinet reaches comparable inference time for long speech sequences with layers from the 20s to the 100s.
Preliminary experiments show that attention-based models converge at 100 to 200 epochs while 500 to 1,000 epochs are required for Citrinet’s convergence to the best error rates. Experiments on the Japanese CSJ-500-hour dataset show that the attention-Citrinet requires fewer layers of blocks and converges faster with lower character error rates than Citrinet with 80% training time and Conformer with 40% training time and 18.5% model size.
Summary
Generally, we propose two novel architectures to build end-to-end Japanese ASR models. In one direction, we improved the transformer-based Conformer training and inferencing speed and retained its accuracy. We successfully built sparser and deeper Conformer models. We also improved the CNN-based Citrinet convergence speed and accuracy by introducing a multi-head self-attention mechanism and pruning 80% of the CNN layers. These proposals are general and applicable to other Asian languages.
Loss functions for training automatic speech recognition (ASR) models are not set in stone. The older rules of loss functions are not necessarily optimal….
Loss functions for training automatic speech recognition (ASR) models are not set in stone. The older rules of loss functions are not necessarily optimal. Consider connectionist temporal classification (CTC) and see how changing some of its rules enables you to reduce GPU memory, which is required for training and inference of CTC-based models and more.
For more information about the technology behind this post, tune into the author’s presentation at INTERSPEECH 2022 on Monday, September 19, 14:30-16:30 (KST), Virtual Poster: Novel models and training methods for ASR II.
Overview of connectionist temporal classification
If you are going to train an ASR model, whether it’s a convolutional or recurrent neural network, a transformer, or a combination, you are most likely training it with the CTC loss.
CTC is simple and convenient because it does not require per-frame information about “what sound is pronounced when” (so-called audio-text time alignment). In most cases, this knowledge is simply unavailable, as in a typical ASR dataset of audio with associated text with no time marks.
True time alignment is not always trivial. Suppose that most of the recording is without speech with only one short phrase at the end. CTC loss does not tell the model when exactly to emit a prediction. Instead, it allows for every possible alignment and just regulates the form of these alignments.
Here’s how exactly CTC manages all possible ways to align audio against text.
First, the target text is tokenized, that is, words are sliced into letters or word pieces. The number of the resulting units (whatever they are) should be less than the number of audio “timeframes”: segments of audio of length 0.01 to 0.08 seconds.
If there are fewer timeframes than units, then the algorithm fails. In this case, you should make your timeframes shorter. Otherwise, only RNN Transducer can save you. If the timeframes are as many as the units, then there can be only one alignment (a one-in-a-million case).
Most of the time, there are way more timeframes than units, so some of the frames are left without a unit. For such empty frames, CTC has a special unit. This unit tells you that the model has nothing to give you at this particular frame. This may be because there is no speech, or maybe the model is just too lazy to predict something meaningful here. The ability to predict nothing if the model doesn’t want to do that is provided by the most important rule of the CTC: the rule.
Other rules are related to unit continuation. Suppose that your unit is a vowel that lasts longer than one frame. On which of two frames should the model output the unit? CTC allows for the same unit emission on multiple consecutive frames. But then, the same consecutive units should be merged into one to convert the recognition results into a sequence of units that resembles text.
Now, what if your tokenized text itself contains the same repeated units like “ll” or “pp”? If left unprocessed, these units are merged into one even if they shouldn’t be. For this special case, CTC has a rule that if the target text has repeated units, then such units must be separated by during inference.
To sum up, in almost every frame, the model is allowed to emit the same unit from the previous frame, , or the next unit if it is different from the previous one. These rules are more sophisticated than the rule and they are not exactly necessary for CTC to work.
CTC implementation
Here’s how CTC loss can be represented. Like most loss functions in machine learning, it is usually represented as a dynamic algorithm that applies these rules to a training utterance or to the model’s softmax output.
In training, loss values and gradients are computed from conditional probabilities of all possible alignments by the Baum–Welch algorithm, applied to the CTC rules. The CTC implementations often have hundreds to thousands of lines of code and can be hard to modify.
Fortunately, there is another way to implement CTC. The weighted finite-state transducers (WFST) approach, apart from other application areas, allows the model to represent a dynamic algorithm as a set of graphs and associated graph operations. This approach enables you to decouple the CTC rules by applying them to specific audio and text and by calculating loss and gradients.
CTC WFST applications
With WFST, you can easily take the CTC rules and use them with different criteria, like maximum mutual information (MMI). These models usually have a lower word error rate (WER) than CTC models. MMI incorporates prior language information into the training process.
In contrast to CTC, MMI not only maximizes the probability of the most feasible path but also minimizes the probabilities of every other path. To do this, MMI has a so-called denominator graph, which can occupy a lot of GPU memory during training. Fortunately, some CTC rules can be modified to reduce denominator memory consumption without compromising speech recognition accuracy.
Also, a WFST representation of CTC rules, or a so-called topology, can be used to allow for WFST decoding of CTC models. To do that, you convert an N-gram language model to a WFST graph and compose it with the topology graph. The resulting decoding graph can be passed to, for example, the Riva CUDA WFST Decoder.
Decoding graphs can be large to the point that they may not fit in GPU memory. But with some CTC topology modifications, you can reduce decoding graph sizes for CTC.
CTC topologies
Figure 1 shows the CTC topology, Correct-CTC. This is a directed complete graph with self-loops, so for N units, including blank, there are N states and the square number of arcs.
Correct-CTC is the most commonly used CTC representation. Look at the typical sizes that this topology produces. For the LibriSpeech 4-gram word language model and 256 units of the model vocabulary, the decoding graph size is ~16Gb. Cold-start MMI training on a 32Gb GPU with the model vocabulary size 2048 is possible only with batch size 1.
Reduce the memory consumption induced by Correct-CTC by dropping some of the CTC rules. First, drop the mandatory separation of repeated units with . Without this rule, you end up with the topology called Compact-CTC ( Figure 2). It has 3N – 2 arcs for N units.
Decoding graph sizes with Compact-CTC are a quarter smaller than with Correct-CTC. It also requires 2x less GPU memory for MMI training.
Now drop the same unit emission on multiple consecutive frames, leaving only the rule. This way, you end up with a topology with only one state and N arcs for N units.
This is the smallest possible CTC topology, so we call it Minimal-CTC (Figure 3). It requires even less GPU memory for MMI training (4x less compared to Correct-CTC), but the accuracy of an MMI-trained model with the Minimal-CTC topology will degrade compared to the baseline.
The smallest topology also produces the smallest decoding WFSTs with half the size of the baseline graph. Decoding graphs compiled with Minimal-CTC are incompatible with models built with Correct-CTC or Compact-CTC.
Finally, l come back to Correct-CTC but this time leave mandatory separation of repeated units and drop unit continuation. The topology called Selfless-CTC was designed to remedy the shortcomings of Minimal-CTC.
Figures 1 and 4 show that Correct-CTC and Selfless-CTC differ only in non- self-loops. These two topologies also give the same MMI model accuracy and an even better one if the model has a long context window. However, Selfless-CTC is also compatible with Minimal-CTC at decoding. You get the 2x graph size reduction from Minimal-CTC by the cost of only 0.2% higher WER.
Conclusion
There are several tips for better performance:
Use Compact-CTC instead of Correct-CTC for decoding graph construction and MMI training.
For the best decoding graph size reduction, train your models with Selfless-CTC and decode with Minimal-CTC.
Loss functions are not set in stone: Experiment with your own WFST representations of existing loss functions and create new ones. It’s fun!
Posted by Yuxiang Yang, Student Researcher, Robotics at Google
An important promise for quadrupedal robots is their potential to operate in complex outdoor environments that are difficult or inaccessible for humans. Whether it’s to find natural resources deep in the mountains, or to search for life signals in heavily-damaged earthquake sites, a robust and versatile quadrupedal robot could be very helpful. To achieve that, a robot needs to perceive the environment, understand its locomotion challenges, and adapt its locomotion skill accordingly. While recent advances in perceptive locomotion have greatly enhanced the capability of quadrupedal robots, most works focus on indoor or urban environments, thus they cannot effectively handle the complexity of off-road terrains. In these environments, the robot needs to understand not only the terrain shape (e.g., slope angle, smoothness), but also its contact properties (e.g., friction, restitution, deformability), which are important for a robot to decide its locomotion skills. As existing perceptive locomotion systems mostly focus on the use of depth cameras or LiDARs, it can be difficult for these systems to estimate such terrain properties accurately.
In “Learning Semantics-Aware Locomotion Skills from Human Demonstrations”, we design a hierarchical learning framework to improve a robot’s ability to traverse complex, off-road environments. Unlike previous approaches that focus on environment geometry, such as terrain shape and obstacle locations, we focus on environment semantics, such as terrain type (grass, mud, etc.) and contact properties, which provide a complementary set of information useful for off-road environments. As the robot walks, the framework decides the locomotion skill, including the speed and gait (i.e., shape and timing of the legs’ movement) of the robot based on the perceived semantics, which allows the robot to walk robustly on a variety of off-road terrains, including rocks, pebbles, deep grass, mud, and more.
Our framework selects skills (gait and speed) of the robot from the camera RGB image. We first compute the speed from terrain semantics, and then select a gait based on the speed.
Overview The hierarchical framework consists of a high-level skill policy and a low level motor controller. The skill policy selects a locomotion skill based on camera images, and the motor controller converts the selected skill into motor commands. The high-level skill policy is further decomposed into a learned speed policy and a heuristic-based gait selector. To decide a skill, the speed policy first computes the desired forward speed, based on the semantic information from the onboard RGB camera. For energy efficiency and robustness, quadrupedal robots usually select a different gait for each speed, so we designed the gait selector to compute a desired gait based on the forward speed. Lastly, a low-level convex model-predictive controller (MPC) converts the desired locomotion skill into motor torque commands, and executes them on the real hardware. We train the speed policy directly in the real world using imitation learning because it requires fewer training data compared to standard reinforcement learning algorithms.
The framework consists of a high-level skill policy and a low-level motor controller.
Learning Speed Command from Human Demonstrations As the central component in our pipeline, the speed policy outputs the desired forward speed of the robot based on the RGB image from the onboard camera. Although many robot learning tasks can leverage simulation as a source of lower-cost data collection, we train the speed policy in the real world because accurate simulation of complex and diverse off-road environments is not yet available. As policy learning in the real world is time-consuming and potentially unsafe, we make two key design choices to improve the data efficiency and safety of our system.
The first is learning from human demonstrations. Standard reinforcement learning algorithms typically learn by exploration, where the agent attempts different actions in an environment and builds preferences based on the rewards received. However, such explorations can be potentially unsafe, especially in off-road environments, since any robot failures can damage both the robot hardware and the surrounding environment. To ensure safety, we train the speed policy using imitation learning from human demonstrations. We first ask a human operator to teleoperate the robot on a variety of off-road terrains, where the operator controls the speed and heading of the robot using a remote joystick. Next, we collect the training data by storing (image, forward_speed) pairs. We then train the speed policy using standard supervised learning to predict the human operator’s speed command. As it turns out, the human demonstration is both safe and high-quality, and allows the robot to learn a proper speed choice for different terrains.
The second key design choice is the training method. Deep neural networks, especially those involving high-dimensional visual inputs, typically require lots of data to train. To reduce the amount of real-world training data required, we first pre-train a semantic segmentation model on RUGD (an off-road driving dataset where the images look similar to those captured by the robot’s onboard camera), where the model predicts the semantic class (grass, mud, etc.) for every pixel in the camera image. We then extract a semantic embedding from the model’s intermediate layers and use that as the feature for on-robot training. With the pre-trained semantic embedding, we can train the speed policy effectively using less than 30 minutes of real-world data, which greatly reduces the amount of effort required.
We pre-train a semantic segmentation model and extract a semantic embedding to be fine-tuned on robot data.
Gait Selection and Motor Control The next component in the pipeline, the gait selector, computes the appropriate gait based on the speed command from the speed policy. The gait of a robot, including its stepping frequency, swing height, and base height, can greatly affect the robot’s ability to traverse different terrains.
Scientific studies have shown that animals switch between different gaits at different speeds, and this result is further validated in quadrupedal robots, so we designed the gait selector to compute a robust gait for each speed. Compared to using a fixed gait across all speeds, we find that the gait selector further enhances the robot’s navigation performance on off-road terrains (more details in the paper).
The last component of the pipeline is a motor controller, which converts the speed and gait commands into motor torques. Similar to previous work, we use separate control strategies for swing and stance legs. By separating the task of skill learning and motor control, the skill policy only needs to output the desired speed, and does not need to learn low-level locomotion controls, which greatly simplifies the learning process.
Experiment Results We implemented our framework on an A1 quadrupedal robot and tested it on an outdoor trail with multiple terrain types, including grass, gravel, and asphalt, which pose varying degrees of difficulty for the robot. For example, while the robot needs to walk slowly with high foot swings in deep grass to prevent its foot from getting stuck, on asphalt it can walk much faster with lower foot swings for better energy efficiency. Our framework captures such differences and selects an appropriate skill for each terrain type: slow speed (0.5m/s) on deep grass, medium speed (1m/s) on gravel, and high speed (1.4m/s) on asphalt. It completes the 460m-long trail in 9.6 minutes with an average speed of 0.8m/s (i.e., that’s 1.8 miles or 2.9 kilometers per hour). In contrast, non-adaptive policies either cannot complete the trail safely or walk significantly slower (0.5m/s), illustrating the importance of adapting locomotion skills based on the perceived environments.
The framework selects different speeds based on conditions of the trail.
To test generalizability, we also deployed the robot to a number of trails that are not seen during training. The robot traverses through all of them without failure, and adjusts its locomotion skills based on terrain semantics. In general, the skill policy selects a faster skill on rigid and flat terrains and a slower speed on deformable or uneven terrain. At the time of writing, the robot has traversed over 6km of outdoor trails without failure.
With the framework, the robot walks safely on a variety of outdoor terrains not seen during training.
Conclusion In this work, we present a hierarchical framework to learn semantic-aware locomotion skills for off-road locomotion. Using less than 30 minutes of human demonstration data, the framework learns to adjust the speed and gait of the robot based on the perceived semantics of the environment. The robot can walk safely and efficiently on a wide variety of off-road terrains. One limitation of our framework is that it only adjusts locomotion skills for standard walking and does not support more agile behaviors such as jumping, which can be essential for traversing more difficult terrains with gaps or hurdles. Another limitation is that our framework currently requires manual steering commands to follow a desired path and reach the goal. In future work, we plan to look into a deeper integration of high-level skill policy with the low-level controller for more agile behaviors, and incorporate navigation and path planning into the framework so that the robot can operate fully autonomously in challenging off-road environments.
Acknowledgements We would like to thank our paper co-authors: Xiangyun Meng, Wenhao Yu, Tingnan Zhang, Jie Tan, and Byron Boots. We would also like to thank the team members of Robotics at Google for discussions and feedback.