Categories
Misc

Detecting Objects in Point Clouds Using ROS 2 and TAO-PointPillars

Accurate, fast object detection is an important task in robotic navigation and collision avoidance. Autonomous agents need a clear map of their surroundings to…

Accurate, fast object detection is an important task in robotic navigation and collision avoidance. Autonomous agents need a clear map of their surroundings to navigate to their destination while avoiding collisions. For example, in warehouses that use autonomous mobile robots (AMRs) to transport objects, avoiding hazardous machines that could potentially damage robots has become a challenging problem.

This post presents a ROS 2 node for detecting objects in point clouds using a pretrained model from NVIDIA TAO Toolkit based on PointPillars. The node takes point clouds as input from real or simulated lidar scans, performs TensorRT-optimized inference to detect objects in this input data, and outputs the resulting 3D bounding boxes as a Detection3DArray message for each point cloud. 

While multiple ROS nodes exist for object detection from images, the advantages of performing object detection from lidar input include the following:

  • Lidar can calculate accurate distances to many detected objects simultaneously. With object distance and direction information provided directly from lidar, it’s possible to get an accurate 3D map of the environment. To obtain the same information in camera/image-based systems, a separate distance estimation process is required which demands more compute power.
  • Lidar is not sensitive to changing lighting conditions (including shadows and bright light), unlike cameras.

An autonomous system can be made more robust by using a combination of lidar and cameras. This is because cameras can perform tasks that lidar cannot, such as detecting text on a sign. 

TAO-PointPillars is based on work presented in the paper, PointPillars: Fast Encoders for Object Detection from Point Clouds, which describes an encoder to learn features from point clouds organized in vertical columns (or pillars). TAO-PointPillars uses both the encoded features as well as the downstream detection network described in the paper.

For our work, a PointPillar model was trained on a point cloud dataset collected by a solid state lidar from Zvision. The PointPillar model detects objects of three classes: Vehicle, Pedestrian, and Cyclist. You can train your own detection model following the TAO Toolkit 3D Object Detection steps, and use it with this node.

For details on running the node, visit NVIDIA-AI-IOT/ros2_tao_pointpillars on GitHub. You can also check out NVIDIA Isaac ROS for more hardware-accelerated ROS 2 packages provided by NVIDIA for various perception tasks. 

Block diagram of the ROS 2 TAO-PointPillars node with names of ROS 2 topics subscribed to and published by the node.
Figure 1. ROS 2 TAO-PointPillars node

ROS 2 TAO-PointPillars node 

This section provides more details about using the ROS 2 TAO-PointPillars node with your robotic application, including the input/output formats and how to visualize results. 

Node Input: The node takes point clouds as input in the PointCloud2 message format. Among other information, point clouds must contain four features for each point (x, y, z, r) where (x, y, z, r) represent the X coordinate, Y coordinate, Z coordinate and reflectance (intensity), respectively.

Reflectance represents the fraction of a laser beam reflected back at some point in 3D space. Note that the range for reflectance values should be the same in the training data and inference data. Parameters including intensity range, class names, NMS IOU threshold can be set from the launch file of the node.

You can find ROS 2 bags for testing the node by visiting ZVISION-lidar/zvision_ugv_data on GitHub.

An example of the Zvision camera point of view (left), corresponding point cloud from the lidar (center), and the result after inference using TAO-PointPillars (right). The people and truck are detected correctly.
Figure 2. An example of the Zvision camera point of view (left), corresponding point cloud from the lidar (center), and the result after inference using TAO-PointPillars (right). The people and truck are detected correctly.

Node Output: The node outputs 3D bounding box information, object class ID, and score for each object detected in a point cloud in the Detection3DArray message format. Each 3D bounding box is represented by (x, y, z, dx, dy, dz, yaw) where (x, y, z, dx, dy, dz, yaw) are, respectively, the X coordinate of object center, Y coordinate of object center, Z coordinate of object center, length (in X direction), width (in Y direction), height (in Z direction) and orientation in 3D Euclidean space. 

The coordinate system used by the model during training and that used by the input data during inference must be the same for meaningful results. Figure 3 shows the coordinate system used by the TAO-PointPillars model.

The coordinate system used by TAO-PointPillars. Origin is the center of lidar. X axis is to the front, Y axis is to the left and Z axis is upwards. Yaw is the rotation in the X-Y plane, in counter-clockwise direction. So X axis corresponds to yaw = 0 and Y axis corresponds to yaw = pi / 2.
Figure 3. The coordinate system used by the TAO-PointPillars model

Since Detection3DArray messages cannot currently be visualized on RViz, you can find a simple tool to visualize results by visiting NVIDIA-AI-IOT/viz_3Dbbox_ros2_pointpillars on GitHub.

For the example shown in Figure 4 below, the frequency of input point clouds is ~10 FPS and of output Detection3DArray messages is ~10 FPS on Jetson AGX Orin.

A GIF compilation of three images: The Zvision camera point of view; the point cloud from the Zvision lidar; and the detection results using TAO-PointPillars.
Figure 4. Counterclockwise from top left: An image from the Zvision camera point of view; the point cloud from the Zvision lidar; and the detection results using TAO-PointPillars

Summary

Accurate object detection in real time is necessary for an autonomous agent to navigate its environment safely. This post showcases a ROS 2 node that can detect objects in point clouds using a pretrained TAO-PointPillars model. (Note that the TensorRT engine for the model currently only supports a batch size of one.) This model performs inference directly on lidar input, which maintains advantages over using image-based methods. For performing inference on lidar data, a model trained on data from the same lidar must be used. There will be a significant drop in accuracy otherwise, unless a method like statistical normalization is implemented.

Categories
Misc

Google Colab’s ‘Pay As You Go’ Offers More Access to Powerful NVIDIA Compute for Machine Learning

Colabs’s new Pay as You Go option helps you accomplish more with machine learning.Access additional time on NVIDIA GPUs with the ability to upgrade to NVIDIA…

Colabs’s new Pay as You Go option helps you accomplish more with machine learning.
Access additional time on NVIDIA GPUs with the ability to upgrade to NVIDIA A100 Tensor Core GPUs when you need more power for your ML project.

Categories
Misc

The Wheel Deal: ‘Racer RTX’ Demo Revs to Photorealistic Life, Built on NVIDIA Omniverse

NVIDIA artists ran their engines at full throttle for the stunning Racer RTX demo, which debuted at last week’s GTC keynote, showcasing the power of NVIDIA Omniverse and the new GeForce RTX 4090 GPU. “Our goal was to create something that had never been done before,” said Gabriele Leone, creative director at NVIDIA, who led Read article >

The post The Wheel Deal: ‘Racer RTX’ Demo Revs to Photorealistic Life, Built on NVIDIA Omniverse appeared first on NVIDIA Blog.

Categories
Misc

All This and Mor-a Are Yours With Exclusive ‘Genshin Impact’ GeForce NOW Membership Reward

It’s good to be a GeForce NOW member. Genshin Impact’s new Version 3.1 update launches this GFN Thursday, just in time for the game’s second anniversary. Even better: GeForce NOW members can get an exclusive starter pack reward, perfect for their first steps in HoYoverse’s open-world adventure, action role-playing game. And don’t forget the nine Read article >

The post All This and Mor-a Are Yours With Exclusive ‘Genshin Impact’ GeForce NOW Membership Reward appeared first on NVIDIA Blog.

Categories
Misc

Explainer: What is Zero Trust?

Zero trust is a cybersecurity strategy for verifying every user, device, application, and transaction in the belief that no user or process should be trusted.

Zero trust is a cybersecurity strategy for verifying every user, device, application, and transaction in the belief that no user or process should be trusted.

Categories
Misc

AI Model Matches Radiologists’ Accuracy Identifying Breast Cancer in MRIs

Researchers from NYU Langone Health aim to improve breast cancer diagnostics with a new AI model. Recently published in Science Translational Medicine, the…

Researchers from NYU Langone Health aim to improve breast cancer diagnostics with a new AI model. Recently published in Science Translational Medicine, the study outlines a deep learning framework that predicts breast cancer from MRIs as accurately as board-certified radiologists. The research could help create a foundational framework for implementing AI-based cancer diagnostic models in clinical settings.

“Breast MRI exams are difficult and time-consuming to interpret, even for experienced radiologists. AI has tremendous potential in improving medical diagnosis, as it can learn from tens of thousands of exams. Using AI to assist radiologists can make the process more accurate, and provide a higher level of confidence in the results,” said study senior author Krzysztof J. Geras, an assistant professor in the Department of Radiology at the NYU Grossman School of Medicine.

As a sensitive tool in breast cancer diagnostics, MRIs can help identify malignant lesions sometimes missed in mammograms and clinical applications.

Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is often used as a screening tool in high-risk patients and its applications are expanding. The tool can help doctors investigate potentially suspicious lesions or evaluate the extent of disease in newly diagnosed patients. This type of medical imaging can also help inform a doctor’s treatment plan, including whether to perform a biopsy or the extent of surgery needed. Both scenarios influence short- and long-term patient outcomes.

According to the researchers, MRIs have untapped potential in predicting disease pathology and achieving a better understanding of tumor biology. Developing AI models from large, well-annotated datasets could be key in refining the sensitivity of these scans and reducing unnecessary biopsies. 

The researchers produced an AI model that improves the accuracy of breast cancer diagnosis using a dataset extracted from clinical exams performed at a NYU Langone Health breast imaging site. The dataset consisted of bilateral DCE-MRI studies from patients categorized as high-risk screening, preoperative planning, routine surveillance, or follow-up after a suspicious finding.

They trained an ensemble of deep neural networks with 3D convolutions for detecting spatiotemporal features using 14,198 labeled MRI examinations with mixed precision on the NVIDIA Apex open-source library. According to the team, the use of this library made it possible to increase the batch size during training.

The networks were trained using the cuDNN-accelerated PyTorch framework on the university’s HPC cluster equipped with 136 NVIDIA V100 GPUs and NVIDIA NVLink for scaling memory and performance.

Multi-GPU training was enhanced by the NVIDIA Collective Communication Library and training a single model took 12 days on average.

“The GPUs did all the heavy lifting for training with high-dimensional data as we used full MRI volumes,” said Geras. 

The model performance was validated on a total of 3,936 MRIs from NYU Langone Health. Using three additional datasets from Duke University, Jagiellonian University Hospital in Poland, and the Cancer Genome Atlas Breast Invasive Carcinoma data collection, the team validated that the model can work across different populations and data sources.  

A framework outlining the steps taken in the study from data collection to personalizing management.
Figure 1. Overview of the study that trains and evaluates an AI system based on deep neural networks to predict the probability of breast cancer in DCE-MRI studies

The researchers compared the model results against five board-certified breast radiology attendings with 2 to 12 years of experience interpreting breast MRI exams. The clinicians interpreted 100 randomly selected MRI studies from the NYU Langone Health data. 

The team found no statistical significance in the results between the radiologists and the AI system. Averaging the AI and radiologist predictions together increased overall accuracy by at least 5%, suggesting a hybrid approach may be most beneficial.

The model was also as accurate among patients with various subtypes of cancer, even among less common malignancies. Patient demographics, such as age and race, did not influence the AI system, despite the little training data available for some groups.  

The model output can also be combined with the personal preference of a clinician or patient deciding whether to pursue a biopsy after a suspicious finding. By default, all suspicious lesions classified with category BI-RADS 4 are recommended for biopsy, which leads to a substantial number of false positives. The AI model predictions can help avoid benign biopsies in up to 20% of all BI-RADS category 4 patients.

The authors note that the work has a few limitations, including understanding how a hybrid approach could impact a radiologist’s decision in a hospital setting or how the model makes its predictions.

“Although we only looked at retrospective data in our study, the results are strong enough that we are confident in the accuracy of the model. We look forward to further translating this work into clinical practice, deploying our AI systems in real life, and improving breast cancer diagnostics for both doctors and patients,” said study lead author Jan Witowski, a postdoctoral research fellow at NYU School of Medicine.

Read the study Improving breast cancer diagnostics with artificial intelligence for MRI in Science Translational Medicine. >

Categories
Misc

Powering NVIDIA-Certified Enterprise Systems with Arm CPUs

Organizations are rapidly becoming more advanced in the use of AI, and many are looking to leverage the latest technologies to maximize workload performance and…

Organizations are rapidly becoming more advanced in the use of AI, and many are looking to leverage the latest technologies to maximize workload performance and efficiency. One of the most prevalent trends today is the use of CPUs based on Arm architecture to build data center servers. 

To ensure that these new systems are enterprise-ready and optimally configured, NVIDIA has approved the first NVIDIA-Certified Systems with Arm CPUs and NVIDIA GPUs. This post presents the benefits of NVIDIA-Certified Arm systems, and what customers should expect to see in the near future.

Using Arm architecture for HPC

Arm-based systems are common for edge applications. They are already widely used by large-scale cloud service providers, and are starting to become more popular for data center applications. According to Gartner®, 12% of new servers for high-performance computing (HPC) will be Arm-based by 2025.1 

Systems based on Arm architecture have the ability to run many cores with high energy efficiency, along with high memory bandwidth and low latency. In fact, recent results for the MLPerf benchmarks show Arm systems delivering almost the same performance for inference as x86-based systems, with one test showing the Arm-based server outperforming a similar x86 system.

The certification by NVIDIA of Arm-based systems is the culmination of a process that started in 2019, when NVIDIA ported the CUDA-X libraries to Arm. This paved the way for NVIDIA partners to start building energy-efficient, AI-enabled systems. NVIDIA also partnered with GIGABYTE in 2021 to develop and offer the Arm HPC Developer Kit

Now, NVIDIA Certification will help businesses choose the best enterprise-grade systems.

NVIDIA-Certified Arm systems

NVIDIA-Certified Systems offer NVIDIA GPUs and NVIDIA high-speed, secure network adapters from leading NVIDIA partners in configurations validated for optimum performance, manageability, and scale. Announced at the beginning of 2021, the program gives customers and partners confidence to choose enterprise-grade hardware solutions to power their accelerated computing workloads—from the desktop to the data center and edge.

More than 200 certified systems are now available—covering data center, desktop, and edge—from over 30 partners. NVIDIA-Certified Systems have excellent performance on a range of modern accelerated computing workloads, including AI and data science, 3D computing and visualization, and HPC. 

The certification also validates key enterprise capabilities, including management, security, and scalability. This ensures that certified systems can take advantage of powerful software including: 

GIGABYTE: The first Arm-ready certified system

The first NVIDIA-Certified Arm system is the GIGABYTE G242-P33, which features the Neoverse-based Ampere Altra processor and up to four NVIDIA A100 Tensor Core GPUs. GIGABYTE has been part of the NVIDIA-Certified Systems program since its inception, and now offers more than 15 NVIDIA-Certified Systems. 

“Qualifying Arm-based servers for NVIDIA accelerators continues to be one of GIGABYTE’s top priorities, and with NVIDIA-Certified Systems we will take the performance validation a step further to not only support the new NVIDIA H100 but also to include NVIDIA BlueField-2 DPU and InfiniBand products,” said Etay Lee, CEO of GIGABYTE. 

“Customers want an Arm-ready solution that comes with a wealth of NVIDIA resources and support to achieve faster insights,” Lee added. “That is what our Ampere Altra servers have delivered, starting with our server for the NVIDIA Arm HPC Developer Kit.”

As the Arm architecture becomes more adopted in data centers, it will be important to choose systems that are optimally configured. This is particularly the case for Arm systems equipped with GPUs and high-speed networking, since this architecture is new to many enterprises. 

Customers might not have expertise to design such a system properly, but NVIDIA-Certified Systems provide them with an easy way to make the best choices. To find Arm-based certified systems, see the Qualified Systems Catalog. The catalog will grow as more systems are certified.

1Gartner, “Forecast Analysis: Arm-Based Servers, Worldwide,” G00755363, November 2021.

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

Categories
Misc

Finding Out Where Your Application and Network Intersect

Modern data centers can run thousands of services and applications. When an issue occurs, as a network administrator, you are guilty by default. You have to…

Modern data centers can run thousands of services and applications. When an issue occurs, as a network administrator, you are guilty by default. You have to prove your innocence on a daily basis, as it is easy to blame the network. It is an unfair world.

Correlating application performance issues to the network is hard to do. You can start by checking basic connectivity using simple pings or traceroutes, check your SNMP-based monitoring tools, sniffers, or even reading device counters to look for drops. In the meantime, users suffer from application slowness, poor performance, or even unavailability.

Unfortunately, all these classic network troubleshooting methods are time-consuming and don’t guarantee success, as it is sometimes nearly impossible to pinpoint problems using them.

NetQ to the rescue

To facilitate network troubleshooting, NVIDIA developed NetQ—a scalable, modern network operations toolset that provides network visibility in real time.

The NetQ team recently introduced the unique flow analysis tool to provide further visibility enhancements. Flow analysis allows network administrators to instantly correlate service traffic flows to the paths taken in the fabric, dramatically reducing the mean time to innocence (MTTI) or even ensuring there is no network issue.

Flow analysis enables you to discover and visualize all paths that a specific application’s traffic flow takes between endpoints in the fabric. It monitors the fabric-wide latency and buffer utilization statistics. With EVPN and multi-tenancy becoming the standard solution in most modern data centers, the flow analysis tool was designed to sample TCP or UDP data on overlay and underlay networks within different VRFs.

Flow analysis becomes even more powerful when used with What Just Happened (WJH) ASIC telemetry. While flows are being analyzed, flow-related WJH events from all switches in traffic paths are presented to help you discover if there were drops that caused the service issue. These two features working together maximize the probability of pinpointing the actual problem affecting an application.   

Screen shot of the dashboard showing latency results and a flow graph.
Figure 1. NetQ flow analysis dashboard

By the numbers

Flow analysis is supported on NVIDIA Spectrum 2 and later switches running Cumulus Linux 5.0 or later. It can also provide partial-path discovery for brownfield deployments with unsupported switches or switches running older versions of Cumulus Linux or SONiC.

Flow analysis samples traffic based on the packet’s four or five tuples, including VXLAN inner and outer headers. Its sampling lifetime is limited to 10, 15, 20, or 30 minutes. You can decide whether to run it on creation or schedule it for a later time.

The sample rate granularity is also configurable to low (1 per 10000), medium (1 per 1000), high (1 per 100), or all packets (1 per 1). The higher the sampling rate, the more accurate your analyzed data. A higher sampling rate results in higher CPU utilization, so I recommend setting lower sampling rates for heavy traffic flows.

Try it yourself in NVIDIA Air

NVIDIA Air is a tool for creating data center digital twins. With Air, you can build your own Cumulus Linux virtual data center, test it, validate it with NetQ, explore features, and learn some best practices. It is entirely free to use!

Try out flow analysis by spinning up the prebuilt NVIDIA Air Infrastructure Simulation Platform demo in the Air Marketplace. Follow the guided tour and see the significant benefits that flow analysis with NetQ can bring to your organization.

For more information, see the following resources:

Categories
Misc

Video Virtuoso Sabour Amirazodi Shares AI-Powered Editing Tips This Week ‘In the NVIDIA Studio’

NVIDIA artist Sabour Amirazodi demonstrates his video editing workflows featuring AI this week in a special edition of In the NVIDIA Studio.

The post Video Virtuoso Sabour Amirazodi Shares AI-Powered Editing Tips This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Categories
Offsites

Quantization for Fast and Environmentally Sustainable Reinforcement Learning

Deep reinforcement learning (RL) continues to make great strides in solving real-world sequential decision-making problems such as balloon navigation, nuclear physics, robotics, and games. Despite its promise, one of its limiting factors is long training times. While the current approach to speed up RL training on complex and difficult tasks leverages distributed training scaling up to hundreds or even thousands of computing nodes, it still requires the use of significant hardware resources which makes RL training expensive, while increasing its environmental impact. However, recent work [1, 2] indicates that performance optimizations on existing hardware can reduce the carbon footprint (i.e., total greenhouse gas emissions) of training and inference.

RL can also benefit from similar system optimization techniques that can reduce training time, improve hardware utilization and reduce carbon dioxide (CO2) emissions. One such technique is quantization, a process that converts full-precision floating point (FP32) numbers to lower precision (int8) numbers and then performs computation using the lower precision numbers. Quantization can save memory storage cost and bandwidth for faster and more energy-efficient computation. Quantization has been successfully applied to supervised learning to enable edge deployments of machine learning (ML) models and achieve faster training. However, there remains an opportunity to apply quantization to RL training.

To that end, we present “QuaRL: Quantization for Fast and Environmentally SustainableReinforcement Learning”, published in the Transactions of Machine Learning Research journal, which introduces a new paradigm called ActorQ that applies quantization to speed up RL training by 1.5-5.4x while maintaining performance. Additionally, we demonstrate that compared to training in full-precision, the carbon footprint is also significantly reduced by a factor of 1.9-3.8x.

Applying Quantization to RL Training
In traditional RL training, a learner policy is applied to an actor, which uses the policy to explore the environment and collect data samples. The samples collected by the actor are then used by the learner to continuously refine the initial policy. Periodically, the policy trained on the learner side is used to update the actor’s policy. To apply quantization to RL training, we develop the ActorQ paradigm. ActorQ performs the same sequence described above, with one key difference being that the policy update from learner to actors is quantized, and the actor explores the environment using the int8 quantized policy to collect samples.

Applying quantization to RL training in this fashion has two key benefits. First, it reduces the memory footprint of the policy. For the same peak bandwidth, less data is transferred between learners and actors, which reduces the communication cost for policy updates from learners to actors. Second, the actors perform inference on the quantized policy to generate actions for a given environment state. The quantized inference process is much faster when compared to performing inference in full precision.

An overview of traditional RL training (left) and ActorQ RL training (right).

In ActorQ, we use the ACME distributed RL framework. The quantizer block performs uniform quantization that converts the FP32 policy to int8. The actor performs inference using optimized int8 computations. Though we use uniform quantization when designing the quantizer block, we believe that other quantization techniques can replace uniform quantization and produce similar results. The samples collected by the actors are used by the learner to train a neural network policy. Periodically the learned policy is quantized by the quantizer block and broadcasted to the actors.

Quantization Improves RL Training Time and Performance
We evaluate ActorQ in a range of environments, including the Deepmind Control Suite and the OpenAI Gym. We demonstrate the speed-up and improved performance of D4PG and DQN. We chose D4PG as it was the best learning algorithm in ACME for Deepmind Control Suite tasks, and DQN is a widely used and standard RL algorithm.

We observe a significant speedup (between 1.5x and 5.41x) in training RL policies. More importantly, performance is maintained even when actors perform int8 quantized inference. The figures below demonstrate this for the D4PG and DQN agents for Deepmind Control Suite and OpenAI Gym tasks.

A comparison of RL training using the FP32 policy (q=32) and the quantized int8 policy (q=8) for D4PG agents on various Deepmind Control Suite tasks. Quantization achieves speed-ups of 1.5x to 3.06x.
A comparison of RL training using the FP32 policy (q=32) and the quantized int8 policy (q=8) for DQN agents in the OpenAI Gym environment. Quantization achieves a speed-up of 2.2x to 5.41x.

Quantization Reduces Carbon Emission
Applying quantization in RL using ActorQ improves training time without affecting performance. The direct consequence of using the hardware more efficiently is a smaller carbon footprint. We measure the carbon footprint improvement by taking the ratio of carbon emission when using the FP32 policy during training over the carbon emission when using the int8 policy during training.

In order to measure the carbon emission for the RL training experiment, we use the experiment-impact-tracker proposed in prior work. We instrument the ActorQ system with carbon monitor APIs to measure the energy and carbon emissions for each training experiment.

Compared to the carbon emission when running in full precision (FP32), we observe that the quantization of policies reduces the carbon emissions anywhere from 1.9x to 3.76x, depending on the task. As RL systems are scaled to run on thousands of distributed hardware cores and accelerators, we believe that the absolute carbon reduction (measured in kilograms of CO2) can be quite significant.

Carbon emission comparison between training using a FP32 policy and an int8 policy. The X-axis scale is normalized to the carbon emissions of the FP32 policy. Shown by the red bars greater than 1, ActorQ reduces carbon emissions.

Conclusion and Future Directions
We introduce ActorQ, a novel paradigm that applies quantization to RL training and achieves speed-up improvements of 1.5-5.4x while maintaining performance. Additionally, we demonstrate that ActorQ can reduce RL training’s carbon footprint by a factor of 1.9-3.8x compared to training in full-precision without quantization.

ActorQ demonstrates that quantization can be effectively applied to many aspects of RL, from obtaining high-quality and efficient quantized policies to reducing training times and carbon emissions. As RL continues to make great strides in solving real-world problems, we believe that making RL training sustainable will be critical for adoption. As we scale RL training to thousands of cores and GPUs, even a 50% improvement (as we have experimentally demonstrated) will generate significant savings in absolute dollar cost, energy, and carbon emissions. Our work is the first step toward applying quantization to RL training to achieve efficient and environmentally sustainable training.

While our design of the quantizer in ActorQ relied on simple uniform quantization, we believe that other forms of quantization, compression and sparsity can be applied (e.g., distillation, sparsification, etc.). We hope that future work will consider applying more aggressive quantization and compression methods, which may yield additional benefits to the performance and accuracy tradeoff obtained by the trained RL policies.

Acknowledgments
We would like to thank our co-authors Max Lam, Sharad Chitlangia, Zishen Wan, and Vijay Janapa Reddi (Harvard University), and Gabriel Barth-Maron (DeepMind), for their contribution to this work. We also thank the Google Cloud team for providing research credits to seed this work.