Categories
Offsites

Toward Generalized Sim-to-Real Transfer for Robot Learning

Reinforcement and imitation learning methods in robotics research can enable autonomous environmental navigation and efficient object manipulation, which in turn opens up a breadth of useful real-life applications. Previous work has demonstrated how robots that learn end-to-end using deep neural networks can reliably and safely interact with the unstructured world around us by comprehending camera observations to take actions and solve tasks. However, while end-to-end learning methods can generalize and scale for complicated robot manipulation tasks, they require hundreds of thousands real world robot training episodes, which can be difficult to obtain. One can attempt to alleviate this constraint by using a simulation of the environment that allows virtual robots to learn more quickly and at scale, but the simulations’ inability to exactly match the real world presents a challenge c ommonly referred to as the sim-to-real gap. One important source of the gap comes from discrepancies between the images rendered in simulation and the real robot camera observations, which then causes the robot to perform poorly in the real world.

To-date, work on bridging this gap has employed a technique called pixel-level domain adaptation, which translates synthetic images to realistic ones at the pixel level. One example of this technique is GraspGAN, which employs a generative adversarial network (GAN), a framework that has been very effective at image generation, to model this transformation between simulated and real images given datasets of each domain. These pseudo-real images correct some sim-to-real gap, so policies learned with simulation execute more successfully on real robots. A limitation for their use in sim-to-real transfer, however, is that because GANs translate images at the pixel-level, multi-pixel features or structures that are necessary for robot task learning may be arbitrarily modified or even removed.

To address the above limitation, and in collaboration with the Everyday Robot Project at X, we introduce two works, RL-CycleGAN and RetinaGAN, that train GANs with robot-specific consistencies — so that they do not arbitrarily modify visual features that are specifically necessary for robot task learning — and thus bridge the visual discrepancy between sim and real. We demonstrate how these consistencies preserve features critical to policy learning, eliminating the need for hand-engineered, task-specific tuning, which in turn allows for this sim-to-real methodology to work flexibly across tasks, domains, and learning algorithms. With RL-CycleGAN, we describe our sim-to-real transfer methodology and demonstrate state-of-the-art performance on real world grasping tasks trained with RL. With RetinaGAN, we extend our approach to include imitation learning with a door opening task.

RL-CycleGAN
In “RL-CycleGAN: Reinforcement Learning Aware Simulation-To-Real”, we leverage a variation of CycleGAN for sim-to-real adaptation by ensuring consistency of task-relevant features between real and simulated images. CycleGAN encourages preservation of image contents by ensuring an adapted image transformed back to the original domain is identical to the original image, which is called cycle consistency. To further encourage the adapted images to be useful for robotics, the CycleGAN is jointly trained with a reinforcement learning (RL) robot agent that ensures the robot’s actions are the same given both the original images and those after GAN-adaptation. That is, task-specific features like robot arm or graspable object locations are unaltered, but the GAN may still alter lighting or textural differences between domains that do not affect task-level decisions.

Evaluating RL-CycleGAN
We evaluated RL-CycleGAN on a robotic indiscriminate grasping task. Trained on 580,000 real trials and simulations adapted with RL-CycleGAN, the robot grasps objects with 94% success, surpassing the 89% success rate of the prior state-of-the-art sim-to-real method GraspGAN and the 87% mark using real-only data without simulation. With only 28,000 trials, the RL-CycleGAN method reaches 86%, comparable to the previous baselines with 20x the data. Some examples of the RL-CycleGAN output alongside the simulation images are shown below.

Comparison between simulation images of robot grasping before (left) and after RL-CycleGAN translation (right).

RetinaGAN
While RL-CycleGAN reliably transfers from sim-to-real for the RL domain using task awareness, a natural question arises: can we develop a more flexible sim-to-real transfer technique that applies broadly to different tasks and robot learning techniques?

In “RetinaGAN: An Object-Aware Approach to Sim-to-Real Transfer”, presented at ICRA 2021, we develop such a task-decoupled, algorithm-decoupled GAN approach to sim-to-real transfer by instead focusing on robots’ perception of objects. RetinaGAN enforces strong object-semantic awareness through perception consistency via object detection to predict bounding box locations for all objects on all images. In an ideal sim-to-real model, we expect the object detector to predict the same box locations before and after GAN translation, as objects should not change structurally. RetinaGAN is trained toward this ideal by backpropagation, such that there is consistency in perception of objects both when a) simulated images are transformed from simulation to real and then back to simulation and b) when real images are transformed from real to simulation and then back to real. We find this object-based consistency to be more widely applicable than the task-specific consistency required by RL-CycleGAN.

Diagram of RetinaGAN stages. The simulated image (top left) is transformed by the sim-to-real generator and subsequently by the real-to-sim generator. The real image (bottom left) undergoes the transformation in reverse order. Having separate pipelines that start with the simulated and real images improves the GAN’s performance.

Evaluating RetinaGAN on a Real Robot
Given the goal of building a more flexible sim-to-real transfer technique, we evaluate RetinaGAN in multiple ways to understand for which tasks and under what conditions it accomplishes sim-to-real transfer.

We first apply RetinaGAN to a grasping task. As demonstrated visually below, RetinaGAN emphasizes the translation of realistic object textures, shadows, and lighting, while maintaining the visual quality and saliency of the graspable objects. We couple a pre-trained RetinaGAN model with the distributed reinforcement learning method Q2-Opt to train a vision-based task model for instance grasping. On real robots, this policy grasps object instances with 80% success when trained on a hundred thousand episodes — outperforming prior adaptation methods RL-CycleGAN and CycleGAN (both achieving ~68%) and training without domain adaptation (grey bars below: 19% with sim data, 22% with real data, and 54% with mixed data). This gives us confidence that perception consistency is a valuable strategy for sim-to-real transfer. Further, with just 10,000 training episodes (8% of the data), the RL policy with RetinaGAN grasps with 66% success, demonstrating performance of prior methods with significantly less data.

Evaluation performance of RL policies on instance grasping, trained with various datasets and sim-to-real methods. Low-Data RetinaGAN uses 8% of the real dataset.
The simulated grasping environment (left) is translated to a realistic image (right) using RetinaGAN.

Next, we pair RetinaGAN with a different learning method, behavioral cloning, to open conference room doors given demonstrations by human operators. Using images from both simulated and real demonstrations, we train RetinaGAN to translate the synthetic images to look realistic, bridging the sim-to-real gap. We then train a behavior cloning model to imitate the task-solving actions of the human operators within real and RetinaGAN-adapted sim demonstrations. When evaluating this model by predicting actions to take, the robot enters real conference rooms over 93% of the time, surpassing baselines of 75% and below.

Both of the above images show the same simulation, but RetinaGAN translates simulated door opening images (left) to look more like real robot sensor data (right).
Three examples of the real robot successfully opening conference room doors using the RetinaGAN-trained behavior cloning policy.

Conclusion
This work has demonstrated how additional constraints on GANs may address the visual sim-to-real gap without requiring task-specific tuning; these approaches reach higher real robot success rates with less data collection. RL-CycleGAN translates synthetic images to realistic ones with an RL-consistency loss that automatically preserves task-relevant features. RetinaGAN is an object-aware sim-to-real adaptation technique that transfers robustly across environments and tasks, agnostic to the task learning method. Since RetinaGAN is not trained with any task-specific knowledge, we show how it can be reused for a novel object pushing task. We hope that work on the sim-to-real gap further generalizes toward solving task-agnostic robotic manipulation in unstructured environments.

Acknowledgements
Research into RL-CycleGAN was conducted by Kanishka Rao, Chris Harris, Alex Irpan, Sergey Levine, Julian Ibarz, and Mohi Khansari. Research into RetinaGAN was conducted by Daniel Ho, Kanishka Rao, Zhuo Xu, Eric Jang, Mohi Khansari, and Yunfei Bai. We’d also like to give special thanks to Ivonne Fajardo, Noah Brown, Benjamin Swanson, Christopher Paguyo, Armando Fuentes, and Sphurti More for overseeing the robot operations. We thank Paul Wohlhart, Konstantinos Bousmalis, Daniel Kappler, Alexander Herzog, Anthony Brohan, Yao Lu, Chad Richards, Vincent Vanhoucke, and Mrinal Kalakrishnan, Max Braun and others in the Robotics at Google team and the Everyday Robot Project for valuable discussions and help.

Categories
Misc

High-Performance Python Communication with UCX-Py

TL;DR UCX/UCX-Py is an accelerated networking library designed for low-latency high-bandwidth transfers for both host and GPU device memory objects.  You can easily get started by installing through conda (limited to linux-64): Introduction RAPIDS is committed to delivering the highest achievable performance for the PyData ecosystem. There are now numerous GPU libraries for high-performance and … Continued

This post was originally published on the RAPIDS AI Blog.

TL;DR

UCX/UCX-Py is an accelerated networking library designed for low-latency high-bandwidth transfers for both host and GPU device memory objects.  You can easily get started by installing through conda (limited to linux-64):

> conda install -c rapidsai ucx-py ucx

Introduction

RAPIDS is committed to delivering the highest achievable performance for the PyData ecosystem. There are now numerous GPU libraries for high-performance and scalable alternatives to underlying common data science and analytics workflows. cuDF, which implements the Pandas DataFrame API, and cuML, which implements scikit-learn’s machine learning API, are two such accelerated libraries. These “scale-up” solutions, together with Dask, enable significant end-to-end performance improvements by combining state-of-the-art GPUs with distributed processing. However, with any scale-out solution, the bottleneck for distributed computation often lies with communication. We present here how easy it is to get started with our new Python communication library UCX-Py, based on OpenUCX, the Unified Communication X open framework, along with the current benchmarks.

UCX-Py allows RAPIDS to take advantage of hardware interconnects available on the system, such as NVLink (for direct GPU-GPU communication) and InfiniBand (for direct GPU-Fiber communication), thus providing higher bandwidth and lower latency for the application. UCX-Py does not only enable communication with NVLink and InfiniBand — including GPUDirectRDMA capability — but all transports supported by OpenUCX, including TCP and shared memory.

UCX-Py is a high-level library, meaning users are not required to do any complex message passing to be able to use it. The user triggers send on one process and receives on a different process, it’s that simple. More importantly, applications which already scale using Dask, can easily start using UCX with a small, two line change to existing code. Just switch to the UCX protocol, instead of the default TCP protocol.

Just like RAPIDS, UCX-Py is open source and currently hosted at https://github.com/rapidsai/ucx-py. Always up-to-date documentation is available in https://ucx-py.readthedocs.io/en/latest/.

Using UCX-Py with Dask

RAPIDS allows both multi-node and multi-GPU scalability by utilizing Dask, therefore Dask was the first and simplest use case for UCX-Py. While RAPIDS users may already be familiar with the dask-cuda package, one very simple use case is to start a cluster using all GPUs available in the system and connect a Dask client to it. Dask users often start a cluster as follows:

from dask_cuda import LocalCUDACluster
from distributed import Client
 
cluster = LocalCUDACluster(protocol="tcp")
client = Client(cluster)

The code above utilizes Python sockets for all communication over the TCP protocol. To switch to the UCX protocol for communication, there are two changes needed in the LocalCUDACluster constructor:

  • Specify the UCX protocol;
  • Specify the transport we want to use.

TCP over UCX

The simplest use of the UCX protocol in Dask is enabling TCP over UCX, and it looks as follows:

cluster = LocalCUDACluster(protocol="ucx", enable_tcp_over_ucx=True)

The change above does not need any special hardware and can be used in any machine that’s currently using dask-cuda (it also applies to LocalCluster in dask/distributed) as long as the UCX-Py package has been installed as well.

NVIDIA NVLink and NVSwitch

Now that we know how we can get started with UCX-Py in Dask, we can move on to more interesting cases. The first one is using NVLink (including NVSwitch if available) for all GPU-GPU communication, and we only add one more flag as seen below:

cluster = LocalCUDACluster(protocol="ucx", enable_tcp_over_ucx=True, enable_nvlink=True)

Note that in the above we also enable TCP over UCX, and it is a required step to allow NVLink. As a facility to users, the code above is equivalent to the one below (enable_tcp_over_ucx=True is implicit):

cluster = LocalCUDACluster(protocol="ucx", enable_nvlink=True)

NVIDIA InfiniBand

The last case we will discuss here is enabling InfiniBand. At this stage, you probably guessed that enabling it requires an enable_infiniband=True flag, and you guessed right. However, since InfiniBand requires communication between a GPU and an InfiniBand device, we need to specify the name of the correct InfiniBand interface to be used by each GPU. Suppose there is an InfiniBand interface named mlx5_0:1 available, a cluster could be created as follows:

cluster = LocalCUDACluster(protocol="ucx", enable_infiniband=True, ucx_net_devices="mlx5_0:1")

From a topological point-of-view, we prefer to select the InfiniBand device closest to the GPU, ensuring highest bandwidth and lowest latency. A complex system such as a DGX-1 or a DGX-2 may contain several GPUs and several InfiniBand interfaces, in such a case we don’t want to simply choose the first one, but the most optimal for that particular system. To address that, we introduced automatic detection of InfiniBand devices:

cluster = LocalCUDACluster(protocol="ucx", enable_infiniband=True, ucx_net_devices="auto")

It is also possible to have a custom configuration passing a function to the ucx_net_devices keyword argument, for that please check the documentation for the LocalCUDACluster class.

TCP, NVLink and InfiniBand

The transports listed above are not mutually exclusive, so you can use all of them together as well:

cluster = LocalCUDACluster(protocol="ucx", enable_tcp_over_ucx=True, enable_nvlink=True, enable_infiniband=True, ucx_net_devices="mlx5_0:1")

UCX with dask-cuda-worker CLI

All the options described above are also available in the dask-cuda-worker CLI. To use UCX with that, you must first start a dask-scheduler and choosing the UCX protocol:

DASK_UCX__CUDA_COPY=True DASK_UCX__TCP=True DASK_UCX__NVLINK=True DASK_UCX__INFINIBAND=True DASK_UCX__NET_DEVICES=mlx5_0:1 dask-scheduler --protocol="ucx"

The command above will start a scheduler and tell its address, which will be something such as ucx://10.10.10.10:8786 (note the protocol is now ucx://, as opposed to the default tcp://). Note also that we have to set Dask environment variables to that command to mimic the Dask-CUDA flags to enable each individual transport, this is because dask-scheduler is part of the mainline Dask Distributed project where we can’t overload it with all the same CLI arguments that we can with dask-cuda-worker, for more details on UCX configuration with Dask-CUDA, please see the Dask-CUDA documentation on Enabling UCX Communication.

We will then connect dask-cuda-worker to that address, and since the scheduler uses the ucx:// protocol, workers will too without any --protocol="ucx". It is still necessary to specify what transports workers need to use:

dask-cuda-worker ucx://10.10.10.10:8786 --enable-tcp-over-ucx --enable-nvlink --enable-infiniband --ucx-net-devices="auto"

Benchmarks

We want to analyze the performance gains of UCX-Py in two scenarios with a variety of accelerated devices. First, how does UCX-Py perform when passing host and device buffers between two endpoints for both InfiniBand and NVLink. Second, how does UCX-Py perform in a common data analysis workload.

When benchmarking 1GB messages between endpoints over NVLink we measured bandwidth of 46.5 GB/s — quite close to the theoretical limit of NVLink: 50 GB/s. This shows that the addition of a Python layer to UCX core (written in C) does not significantly impact performance. Using a similar setup, we configured UCX/UCX-Py to pass messages over InfiniBand. Measurements present a bandwidth of 11.2 GB/s in that case, which is also similarly near the hardware limit (12.5 GB/s) of Connect X-4 Mellanox InfiniBand controllers.

As the numbers above suggest, passing single message GPU objects between endpoints is very efficient and we now ask: how does UCX-Py perform in the context of a common workflow?  Dataframe merges are good candidates to benchmark as they are extremely common and require a high degree of communication. Additionally, a merge, performed by Dask-cuDF, will also test critical operations in RAPIDS and Dask libraries which can degrade communications, most notably serialization.

The graph in Figure 1 shows the cuDF merge benchmark performance on a DGX1 with all 8 GPUs in use. This workflow generates two dask-cudf dataframes with random, equally distributed data, and merges both dataframes into one.

 Measured bandwidth of cuDF merge benchmark with UCX-Py on a DGX-1 server. Utilizing both InfiniBand and NVLink transports reaches just over 18 GB/s, InfiniBand without NVLink reaches 12.6 GB/s, NVLink with TCP achieves 4.3 GB/s, TCP over UCX throughput is of 2.8 GB/s and TCP with Python Sockets (meaning no UCX involved) only achieve 970 MB/s.
Figure 1: Measured bandwidth of cuDF merge benchmark with UCX-Py on a DGX-1 server.

Overall we see improvements using UCX-Py compared to using Python Sockets for TCP communication. However, there are some unexpected results. If NVLink is capable of 32GB/s and InfiniBand is 12GB/s, why then does InfiniBand perform better than NVLink? The answer is: topology! The layout of hardware interconnection is of great importance and can often be complex.

Represents the complex networking topology of a DGX-1 server
Figure 2: Networking topology of a DGX-1 server.

Figure 2 presents the layout of an NVIDIA DGX-1 server — we can think of this single box as a small supercomputer. The DGX-1 has eight P100 GPUs, four InfiniBand devices, and two CPUs. What is the most efficient route to take when passing data between GPU0 and GPU1? What about GPU2 and GPU7? GPU0 and GPU1 are connected via NVLink but GPU2 and GPU7 have direct communication lines, therefore, when passing data, we must perform two costly operations: Device-to-Host (moving data from GPU2 to the host) and Host-to-Device (moving data from the host to GPU7), and transferring data through main memory is expensive. Fortunately, a DGX-1 allows transfers between two GPUs that are not NVLink connected to use a more efficient transfer method than the costly operations described. With InfiniBand and GPUDirectRDMA, two GPUs can communicate without touching the main memory at all, allowing GPU7 to directly read GPU2’s memory via the InfiniBand interconnect only. 

Referring back to Figure 1, you can now see why only using InfiniBand outperforms the NVLink connection alone — Host-to-Device/Device-to-Host transfers are costly! However, when combining InfiniBand with NVLink we achieve more optimal performance. 

Further improvements

UCX-Py is under constant development, which means changes may occur without notice. Our ultimate commitment is to deliver high-performance communication to Python in a high-quality library package. With that said, we expect in the near future to keep on improving stability and ease-of-use, enabling users to benefit from enhanced performance with little to no effort to configure UCX-Py, thus automatically enabling accelerated communication hardware.

So far, we have tested UCX-Py mostly in DGX-1 and DGX-2 systems, where point-to-point and end-to-end performance was demonstrated. We scaled our testing up to one hundred Dask CUDA workers. We are constantly working to improve performance and stability, particularly for large scale workflows, and encourage users to engage with our team on GitHub and extend the discussion on UCX-Py.

Categories
Misc

Coming June to GFN Thursday: 38 More Games Streaming on GeForce NOW This Month

What a gamer wants. What a gamer needs. Games that’ll make you happy and cloud gaming that’ll set you free … to play your favorite PC games anywhere, anytime, on GeForce NOW. Get a sneak peek of all the exciting games coming in June this GFN Thursday. Starting with titles launching today. Keeping It Fresh Read article >

The post Coming June to GFN Thursday: 38 More Games Streaming on GeForce NOW This Month appeared first on The Official NVIDIA Blog.

Categories
Misc

NOOB: How to get TensorFlow model to run on Jetson Nano?

Hello –

I built a model (for ANPR) using TensorFlow and EasyOCR. Over the past week or so, getting TensorFlow to install on the Jetson Nano has been next to impossible. Tons of issues with it (some are documented) and overall I found one person that was able to get it running well which took over 50hrs to install on the Jetson Nano.

This said, there has to be a better way to get a TensorFlow model to run on a Jetson Nano. Does anyone know how to go about doing this? Is this where ONNX comes in (if so, any ideas/resources you might be able to point me to that shows how?)

Maybe I should go down another rabbit hole and try to do ANPR with PyTorch as I have read in the NVIDIA forums that PyTorch seems to be a bit easier to run on the Jetson line?

Any/all thoughts/help is much appreciated.

Thanks!

submitted by /u/InstantAmmo
[visit reddit] [comments]

Categories
Misc

Train Model for Pothole Detection to be used on raspberrypi

Hello, i followed random tutorials and tried training my custom model.

can somebody suggest me some tutorial or give me some tips on training an improved model?

any help would greatly be appreciated.

i am a beginner.

submitted by /u/abracadabra61
[visit reddit] [comments]

Categories
Misc

NVIDIA Spotlights GeForce Partners at COMPUTEX

Highlighting deep support from a flourishing roster of GeForce partners, NVIDIA’s Jeff Fisher delivered a virtual keynote at COMPUTEX 2021 in Taipei Tuesday. Fisher, senior vice president of NVIDIA’s GeForce business, announced a pair of powerful new gaming GPUs — the GeForce RTX 3080 Ti and GeForce RTX 3070 Ti — and detailed the fast-growing Read article >

The post NVIDIA Spotlights GeForce Partners at COMPUTEX appeared first on The Official NVIDIA Blog.

Categories
Misc

Multi-GPUs and custom training loops in TensorFlow 2

Multi-GPUs and custom training loops in TensorFlow 2 submitted by /u/ZnVjayBhY25l
[visit reddit] [comments]
Categories
Misc

NVIDIA Announces Upcoming Events for Financial Community

SANTA CLARA, Calif., June 01, 2021 (GLOBE NEWSWIRE) — NVIDIA will present at the following events for the financial community: Evercore ISI Inaugural TMT ConferenceMonday, June 7, at 10:15 …

Categories
Misc

DL server configuration for Unet training

My company is solving very specific task of semantic segmentation using small UNet and we need to speed up training. Our current workstation has single Tesla v100 and we are looking for new workstation with several A100 but can’t measure how much speed up we will get with the increase of GPU number with new Ampere architecture. The second question is do we need NVSwitch or NVLink for training UNet and what speed improvement it would give us. According to our budget we can possibly get DGX A100(40GB A100) or custom configuration without useless options for our task, for example NVSwitch. The only thing I find is NVidia Unet industrial performance but the evaluation has been done on DGX-1 and DGX A100 with NVLink/NVSwitch both so the impact of GPU interconnection is not obvious.

submitted by /u/gogasius
[visit reddit] [comments]

Categories
Misc

Why is it soooo slow to iterate over a specific dataset?

Situation:

I have a dataset (<class ‘tensorflow.python.data.ops.dataset_ops.MapDataset’>) which is a result of some neural network output. In order to get my final prediction, I am currently iterating over it as follows:

for row in dataset: ap_distance, an_distance = row y_pred.append(int(ap_distance.numpy() > an_distance.numpy())) 

The dataset has two columns, each holding a scalar wrapped in a tensor.

Problem:

The loop body is very simple, it takes < 1e-5 seconds to compute. Sadly, one iteration takes ca. 0.3 seconds, so fetching the next row from the dataset seems to take almost 0.3s! This behavior is very weird given my hardware (running on a rented GPU server with 16 AMD EPYC cores and 258GB RAM) and the fact that a colleague on his laptop can finish an iteration in 1-2 orders of magnitude less time than I can. The dataset has ca. 60k rows, hence it is unacceptable to wait for so long.

What I tried:

I tried mapping the above loop body onto every row of the dataset object, but sadly .numpy() is not available inside dataset.map()!

Questions:

  1. Why does it take so long to get a new row from the dataset?
  2. How can I fix this performance degradation?

submitted by /u/iMrFelix
[visit reddit] [comments]