Categories
Misc

XOR Problem with MLP

Hi guys, I’m trying to reproduce this mlp in tf (constraining it to have only one hidden layer with 2 units). However, like in playground tf, many times do not converge to global maxima. I think is due to weight initialization, tried xavier and he initializations but no success. The following piece of code is the model in tf keras.

model = tf.keras.models.Sequential([ Dense(units=2, activation='sigmoid', input_shape=(2,)), Dense(units=1, activation='sigmoid') 

])

Any help, would be appreciated. Thanks.

https://preview.redd.it/otv9rmbrtr251.png?width=1375&format=png&auto=webp&s=6757653b702f45a728dd032a89368fedbd6968e2

submitted by /u/Idea_Cultural
[visit reddit] [comments]

Categories
Misc

How can tensorflow utilize multiple cores by default if python is limited to one core by default?

All the threads I’ve read online regarding utilizing multiple cores in python requires multiprocessing. By default python can only bye running on one core (and one on thread, because of the GIL).

On the other hand, I often read that tensorflow by default uses multiple cores. How can this work if python itself is limited to one core?

submitted by /u/hattrickostrich
[visit reddit] [comments]

Categories
Misc

Fast Track AI Model Adaptation with TAO

NVIDIA Train, Adapt, and Optimize (TAO) is an AI model adaptation platform that simplifies and accelerates the creation of enterprise AI applications.

Building a state-of-the-art deep learning model is a complex and time-consuming process. To achieve this, large datasets collected for the model must be of high quality. Once the data is collected, it must be prepared and then trained, and optimized over several iterations. This is not always an option for many enterprises looking to bring their AI applications to market faster while reducing operational costs.

NVIDIA TAO is being developed to address these challenges. NVIDIA Train, Adapt, and Optimize (TAO) is an AI model adaptation platform that simplifies and accelerates the creation of enterprise AI applications. By fine-tuning state-of-the-art pre-trained models created by NVIDIA experts with custom data through a UI-based, guided workflow, you can produce highly accurate computer vision, speech, and language understanding models in hours rather than months, eliminating the need for large training runs and deep AI expertise.

As a managed and guided workflow, TAO lowers the barrier to building AI by unifying key existing NVIDIA technologies, such as pre-trained models from the NGC catalog, Transfer Learning Toolkit (TLT), Federated Learning with NVIDIA Clara, and TensorRT.

Registration for the Early Access Program is now open. Later this year we will begin accepting applicants into the program which will provide you with an exclusive opportunity to collaborate with the NVIDIA product team to help shape TAO.

Key Highlights of the Early Access Program:

  • Access to state-of-the-art accurate pre-trained models that can be customized for your use case
  • Access to compute infrastructure
  • Hands-on support to help you navigate through the entire process

Apply to the TAO Early Access Program here.

Categories
Misc

Using Physics-Informed Deep Learning for Transport in Porous Media

Simulations are pervasive in every domain of science and engineering, but they are often constrained by large computational times, limited compute resources, tedious manual setup efforts, and the need for technical expertise. NVIDIA SimNet is a simulation toolkit that addresses these challenges with a combination of AI and physics.  A success story of SimNet’s application … Continued

Simulations are pervasive in every domain of science and engineering, but they are often constrained by large computational times, limited compute resources, tedious manual setup efforts, and the need for technical expertise. NVIDIA SimNet is a simulation toolkit that addresses these challenges with a combination of AI and physics. 

A success story of SimNet’s application today is in modeling the flow and transport in porous media. This effort was led by Cedric Frances, a PhD student at Stanford University.

Use case study

Cedric is researching the applicability and limitations of mesh-free reservoir simulations using physics-informed neural networks (PINNs). He’s keenly interested in the flow and transport problem in porous media (conservation of mass and Darcy flow). Cedric’s application is a Python-based reservoir simulator, which computes the pressure and concentrations of various fluids in a porous media and enables predictions that typically affect large, industrial energy projects. This includes the production of hydrocarbons, storage of carbon dioxide, water disposal, air storage, waste management, and so on.

Researchers previously tried to use the PINNs approach to capture the solution of a hyperbolic problem with nonconvex flux term (Riemann problem) in a forward setting with no data other than initial and boundary conditions. Unfortunately, these attempts were unsuccessful.

Before trying out SimNet, Cedric developed his own implementations of PINNs using Python and deep learning frameworks such as TensorFlow and Keras. He used various architectures of networks, such as residual, GAN, periodic activation, CNN, PDE-Net, and so on. However, it was difficult to implement all of them to find which one worked best or worked at all. The emergence of open-source code on GitHub made it easy to test these implementations out. The high overhead involved in every new implementation, such as environment setup, hardware configuration, modification of code to test his own problem, and so on, was not efficient.

Cedric wanted to have a good, unified framework maintained by a team of professional software developers to solve problems that allowed him to focus on the physics of the problem and extensively test the methods that have been recently published. His search for such a framework ended when he stumbled upon SimNet.

Cedric downloaded SimNet and started using fully connected networks with tanh activation functions and spatial weighing of the loss function. He discovered that SimNet’s general-purpose framework with multiple architectures and well-documented examples served as a good starting point. Its ability to emulate solutions with sharp shocks, introduce new dynamic constraints such as entropy and velocity saved him weeks of development. More importantly, it provided a quick turnaround on testing methods to determine their usefulness.

The problem presented here is that of incompressible, immiscible displacement of two phases in a porous medium. This is also referred to as transport problem and has been delineated in various forms over the years. It has been applied to the displacement of oil by water for waterflood problems in reservoir for over half a century. More recently, it’s been applied to the displacement of brine by CO2 in carbon sequestration applications. For more information, see Mechanism of Fluid Displacement in Sands and Theory of Gas Injection Processes.

Assume that a wetting phase (w) is displacing a nonwetting phase (n). Wettability is the preference of a fluid to be in contact with a solid surrounded by another fluid; for example, water is wetting on most surfaces compared to air. Conservation of mass applies to both phases. For the wetting phase:

phi frac{partial S_w}{partial t} + nabla q_w = 0 (1)

In this formula, phi is the porosity,  S_w is the saturation (or concentration) of that wetting phase, and is the saturation of the nonwetting. The flow rate of the wetting phase  q_w can be written as follows:

q_w = - frac{kk_{rw}(S_w)}{mu_w}nabla p (2)

In this formula,  k is the absolute permeability that quantifies the propensity of a material to allow liquid or gas to flow through it.  k_{rw}(S_w) is the wetting phase relative permeability that is function of the saturation and characterizes the effective permeability of a given phase in the presence or absence of it. A phase preferentially flows through a path where it is already present. Think of a drop of water dripping down from a window and following existing trails.

You can formulate the phase flux of the wetting phase  q_w as a function of the total flux q = q_w+q_n using a simple homogenization rule:                                               

q = -kleft[frac{k_{rw}(S_w)}{mu_w} + frac{k_{rn}(S_n)}{mu_n}right]nabla p (3)

You can rewrite this equation as a function of the total flux. This gives rise to the fractional flow:

f_w = frac{q_w}{q} = frac{1}{1+frac{k_{rn}mu_w}{k_{rw}mu_n}} (4)

The conservation equation can now be written:

frac{partial S_w}{partial t} + frac{q}{phi}nabla f_w = 0 (5)

For a one-dimensional case where you assume that the total flux is equal to one pore volume injected per time step ( q = phi), you can obtain the equation:

frac{partial S_w}{partial t} + frac{partial f(S_w)}{partial x} = 0 (6)

In this formula, the fractional flow is a nonlinear equation defined as follows:

f_w(S) = frac{(S - S_{wc})^2}{(S - S_{wc})^2 + (1 - S - S_{nr})^2/M} (7)

In this formula, S_{wc} and S_{nr} are the residual (irreducible) saturations for the wetting and nonwettings resulting from trapping mechanisms and M is the endpoint mobility ratio defined as the ratio of endpoint-relative permeability and viscosity of both phases. We used the Corey and Brooks relative permeability relationship. For more information, see Hydraulic Properties of Porous Media.

The partial differential equation solved here is a hyperbolic of order 1 and the fractional flow term is nonconvex. It belongs to the class of Riemann conservation problem that is typically solved using finite volume methods. For more information, see Hyperbolic systems of conservation laws and the mathematical theory of shock waves.

With uniform Dirichlet boundary conditions:

S(x=0,t) = S_{inj}        (8)

S(x,t=0) = S_{wc}        (9)

You can apply the method of characteristics (MOC) to build an analytical solution to this equation. For the MOC or any finite volume method to be conservative, you must modify the fractional flow term as shown in Figure 1.

Figure 1. Fractional flow curve (blue) along with Welge construction (dotted black) for case with Swc=Sor= 0. Source: Physics Informed Deep Learning for Flow and Transport in Porous Media

Until now, no other known method solved such a problem using a sampling method, so this remained an open question. A previous attempt by Fuks and Tchelepi concluded that physics-informed approaches were not suitable for the problem described (Figure 2).

Figure 2. Results of saturation inference computed using the PINN approach (dashed red) conditioned on the weak form of the hyperbolic problem. The reference solution using MOC is plotted in blue. Source: Physics Informed Deep Learning for Flow and Transport in Porous Media
Figure 3. Results of saturation inference using PINN (dashed red) vs MOC (blue) with velocity constraint and entropy condition. The convex hull of the fractional flow curve is used to model the displacement. Source: Physics Informed Deep Learning for Flow and Transport in Porous Media

Cedric’s research on this topic has now been published: Physics Informed Deep Learning for Flow and Transport in Porous Media.

Important theoretical milestones are now being reached on simple yet challenging 1D examples. Cedric plans on expanding his study to larger dimensions (2D and 3D), where the scalability of the code and the easy deployment on larger arrays will be put to the test. He expects to encounter similar issues and is looking forward to the gain provided by SimNet going from 2D to 3D, for example.

Cedric elaborated on his experience with SimNet. “SimNet’s clear APIs, clean and easily navigable code, environment and hardware configurations well handled with Docker containers, scalability, ease of deployment and the competent support team made it easy to adopt and has provided some very promising results. This has been great so far and we look forward to using SimNet on problems with much larger dimensions.”

To view the GTC’21 session, see Physics-Informed Neural Network for Flow and Transport in Porous Media. For more information about features and to download the toolkit, see NVIDIA SimNet.

Categories
Offsites

Toward Generalized Sim-to-Real Transfer for Robot Learning

Reinforcement and imitation learning methods in robotics research can enable autonomous environmental navigation and efficient object manipulation, which in turn opens up a breadth of useful real-life applications. Previous work has demonstrated how robots that learn end-to-end using deep neural networks can reliably and safely interact with the unstructured world around us by comprehending camera observations to take actions and solve tasks. However, while end-to-end learning methods can generalize and scale for complicated robot manipulation tasks, they require hundreds of thousands real world robot training episodes, which can be difficult to obtain. One can attempt to alleviate this constraint by using a simulation of the environment that allows virtual robots to learn more quickly and at scale, but the simulations’ inability to exactly match the real world presents a challenge c ommonly referred to as the sim-to-real gap. One important source of the gap comes from discrepancies between the images rendered in simulation and the real robot camera observations, which then causes the robot to perform poorly in the real world.

To-date, work on bridging this gap has employed a technique called pixel-level domain adaptation, which translates synthetic images to realistic ones at the pixel level. One example of this technique is GraspGAN, which employs a generative adversarial network (GAN), a framework that has been very effective at image generation, to model this transformation between simulated and real images given datasets of each domain. These pseudo-real images correct some sim-to-real gap, so policies learned with simulation execute more successfully on real robots. A limitation for their use in sim-to-real transfer, however, is that because GANs translate images at the pixel-level, multi-pixel features or structures that are necessary for robot task learning may be arbitrarily modified or even removed.

To address the above limitation, and in collaboration with the Everyday Robot Project at X, we introduce two works, RL-CycleGAN and RetinaGAN, that train GANs with robot-specific consistencies — so that they do not arbitrarily modify visual features that are specifically necessary for robot task learning — and thus bridge the visual discrepancy between sim and real. We demonstrate how these consistencies preserve features critical to policy learning, eliminating the need for hand-engineered, task-specific tuning, which in turn allows for this sim-to-real methodology to work flexibly across tasks, domains, and learning algorithms. With RL-CycleGAN, we describe our sim-to-real transfer methodology and demonstrate state-of-the-art performance on real world grasping tasks trained with RL. With RetinaGAN, we extend our approach to include imitation learning with a door opening task.

RL-CycleGAN
In “RL-CycleGAN: Reinforcement Learning Aware Simulation-To-Real”, we leverage a variation of CycleGAN for sim-to-real adaptation by ensuring consistency of task-relevant features between real and simulated images. CycleGAN encourages preservation of image contents by ensuring an adapted image transformed back to the original domain is identical to the original image, which is called cycle consistency. To further encourage the adapted images to be useful for robotics, the CycleGAN is jointly trained with a reinforcement learning (RL) robot agent that ensures the robot’s actions are the same given both the original images and those after GAN-adaptation. That is, task-specific features like robot arm or graspable object locations are unaltered, but the GAN may still alter lighting or textural differences between domains that do not affect task-level decisions.

Evaluating RL-CycleGAN
We evaluated RL-CycleGAN on a robotic indiscriminate grasping task. Trained on 580,000 real trials and simulations adapted with RL-CycleGAN, the robot grasps objects with 94% success, surpassing the 89% success rate of the prior state-of-the-art sim-to-real method GraspGAN and the 87% mark using real-only data without simulation. With only 28,000 trials, the RL-CycleGAN method reaches 86%, comparable to the previous baselines with 20x the data. Some examples of the RL-CycleGAN output alongside the simulation images are shown below.

Comparison between simulation images of robot grasping before (left) and after RL-CycleGAN translation (right).

RetinaGAN
While RL-CycleGAN reliably transfers from sim-to-real for the RL domain using task awareness, a natural question arises: can we develop a more flexible sim-to-real transfer technique that applies broadly to different tasks and robot learning techniques?

In “RetinaGAN: An Object-Aware Approach to Sim-to-Real Transfer”, presented at ICRA 2021, we develop such a task-decoupled, algorithm-decoupled GAN approach to sim-to-real transfer by instead focusing on robots’ perception of objects. RetinaGAN enforces strong object-semantic awareness through perception consistency via object detection to predict bounding box locations for all objects on all images. In an ideal sim-to-real model, we expect the object detector to predict the same box locations before and after GAN translation, as objects should not change structurally. RetinaGAN is trained toward this ideal by backpropagation, such that there is consistency in perception of objects both when a) simulated images are transformed from simulation to real and then back to simulation and b) when real images are transformed from real to simulation and then back to real. We find this object-based consistency to be more widely applicable than the task-specific consistency required by RL-CycleGAN.

Diagram of RetinaGAN stages. The simulated image (top left) is transformed by the sim-to-real generator and subsequently by the real-to-sim generator. The real image (bottom left) undergoes the transformation in reverse order. Having separate pipelines that start with the simulated and real images improves the GAN’s performance.

Evaluating RetinaGAN on a Real Robot
Given the goal of building a more flexible sim-to-real transfer technique, we evaluate RetinaGAN in multiple ways to understand for which tasks and under what conditions it accomplishes sim-to-real transfer.

We first apply RetinaGAN to a grasping task. As demonstrated visually below, RetinaGAN emphasizes the translation of realistic object textures, shadows, and lighting, while maintaining the visual quality and saliency of the graspable objects. We couple a pre-trained RetinaGAN model with the distributed reinforcement learning method Q2-Opt to train a vision-based task model for instance grasping. On real robots, this policy grasps object instances with 80% success when trained on a hundred thousand episodes — outperforming prior adaptation methods RL-CycleGAN and CycleGAN (both achieving ~68%) and training without domain adaptation (grey bars below: 19% with sim data, 22% with real data, and 54% with mixed data). This gives us confidence that perception consistency is a valuable strategy for sim-to-real transfer. Further, with just 10,000 training episodes (8% of the data), the RL policy with RetinaGAN grasps with 66% success, demonstrating performance of prior methods with significantly less data.

Evaluation performance of RL policies on instance grasping, trained with various datasets and sim-to-real methods. Low-Data RetinaGAN uses 8% of the real dataset.
The simulated grasping environment (left) is translated to a realistic image (right) using RetinaGAN.

Next, we pair RetinaGAN with a different learning method, behavioral cloning, to open conference room doors given demonstrations by human operators. Using images from both simulated and real demonstrations, we train RetinaGAN to translate the synthetic images to look realistic, bridging the sim-to-real gap. We then train a behavior cloning model to imitate the task-solving actions of the human operators within real and RetinaGAN-adapted sim demonstrations. When evaluating this model by predicting actions to take, the robot enters real conference rooms over 93% of the time, surpassing baselines of 75% and below.

Both of the above images show the same simulation, but RetinaGAN translates simulated door opening images (left) to look more like real robot sensor data (right).
Three examples of the real robot successfully opening conference room doors using the RetinaGAN-trained behavior cloning policy.

Conclusion
This work has demonstrated how additional constraints on GANs may address the visual sim-to-real gap without requiring task-specific tuning; these approaches reach higher real robot success rates with less data collection. RL-CycleGAN translates synthetic images to realistic ones with an RL-consistency loss that automatically preserves task-relevant features. RetinaGAN is an object-aware sim-to-real adaptation technique that transfers robustly across environments and tasks, agnostic to the task learning method. Since RetinaGAN is not trained with any task-specific knowledge, we show how it can be reused for a novel object pushing task. We hope that work on the sim-to-real gap further generalizes toward solving task-agnostic robotic manipulation in unstructured environments.

Acknowledgements
Research into RL-CycleGAN was conducted by Kanishka Rao, Chris Harris, Alex Irpan, Sergey Levine, Julian Ibarz, and Mohi Khansari. Research into RetinaGAN was conducted by Daniel Ho, Kanishka Rao, Zhuo Xu, Eric Jang, Mohi Khansari, and Yunfei Bai. We’d also like to give special thanks to Ivonne Fajardo, Noah Brown, Benjamin Swanson, Christopher Paguyo, Armando Fuentes, and Sphurti More for overseeing the robot operations. We thank Paul Wohlhart, Konstantinos Bousmalis, Daniel Kappler, Alexander Herzog, Anthony Brohan, Yao Lu, Chad Richards, Vincent Vanhoucke, and Mrinal Kalakrishnan, Max Braun and others in the Robotics at Google team and the Everyday Robot Project for valuable discussions and help.

Categories
Misc

High-Performance Python Communication with UCX-Py

TL;DR UCX/UCX-Py is an accelerated networking library designed for low-latency high-bandwidth transfers for both host and GPU device memory objects.  You can easily get started by installing through conda (limited to linux-64): Introduction RAPIDS is committed to delivering the highest achievable performance for the PyData ecosystem. There are now numerous GPU libraries for high-performance and … Continued

This post was originally published on the RAPIDS AI Blog.

TL;DR

UCX/UCX-Py is an accelerated networking library designed for low-latency high-bandwidth transfers for both host and GPU device memory objects.  You can easily get started by installing through conda (limited to linux-64):

> conda install -c rapidsai ucx-py ucx

Introduction

RAPIDS is committed to delivering the highest achievable performance for the PyData ecosystem. There are now numerous GPU libraries for high-performance and scalable alternatives to underlying common data science and analytics workflows. cuDF, which implements the Pandas DataFrame API, and cuML, which implements scikit-learn’s machine learning API, are two such accelerated libraries. These “scale-up” solutions, together with Dask, enable significant end-to-end performance improvements by combining state-of-the-art GPUs with distributed processing. However, with any scale-out solution, the bottleneck for distributed computation often lies with communication. We present here how easy it is to get started with our new Python communication library UCX-Py, based on OpenUCX, the Unified Communication X open framework, along with the current benchmarks.

UCX-Py allows RAPIDS to take advantage of hardware interconnects available on the system, such as NVLink (for direct GPU-GPU communication) and InfiniBand (for direct GPU-Fiber communication), thus providing higher bandwidth and lower latency for the application. UCX-Py does not only enable communication with NVLink and InfiniBand — including GPUDirectRDMA capability — but all transports supported by OpenUCX, including TCP and shared memory.

UCX-Py is a high-level library, meaning users are not required to do any complex message passing to be able to use it. The user triggers send on one process and receives on a different process, it’s that simple. More importantly, applications which already scale using Dask, can easily start using UCX with a small, two line change to existing code. Just switch to the UCX protocol, instead of the default TCP protocol.

Just like RAPIDS, UCX-Py is open source and currently hosted at https://github.com/rapidsai/ucx-py. Always up-to-date documentation is available in https://ucx-py.readthedocs.io/en/latest/.

Using UCX-Py with Dask

RAPIDS allows both multi-node and multi-GPU scalability by utilizing Dask, therefore Dask was the first and simplest use case for UCX-Py. While RAPIDS users may already be familiar with the dask-cuda package, one very simple use case is to start a cluster using all GPUs available in the system and connect a Dask client to it. Dask users often start a cluster as follows:

from dask_cuda import LocalCUDACluster
from distributed import Client
 
cluster = LocalCUDACluster(protocol="tcp")
client = Client(cluster)

The code above utilizes Python sockets for all communication over the TCP protocol. To switch to the UCX protocol for communication, there are two changes needed in the LocalCUDACluster constructor:

  • Specify the UCX protocol;
  • Specify the transport we want to use.

TCP over UCX

The simplest use of the UCX protocol in Dask is enabling TCP over UCX, and it looks as follows:

cluster = LocalCUDACluster(protocol="ucx", enable_tcp_over_ucx=True)

The change above does not need any special hardware and can be used in any machine that’s currently using dask-cuda (it also applies to LocalCluster in dask/distributed) as long as the UCX-Py package has been installed as well.

NVIDIA NVLink and NVSwitch

Now that we know how we can get started with UCX-Py in Dask, we can move on to more interesting cases. The first one is using NVLink (including NVSwitch if available) for all GPU-GPU communication, and we only add one more flag as seen below:

cluster = LocalCUDACluster(protocol="ucx", enable_tcp_over_ucx=True, enable_nvlink=True)

Note that in the above we also enable TCP over UCX, and it is a required step to allow NVLink. As a facility to users, the code above is equivalent to the one below (enable_tcp_over_ucx=True is implicit):

cluster = LocalCUDACluster(protocol="ucx", enable_nvlink=True)

NVIDIA InfiniBand

The last case we will discuss here is enabling InfiniBand. At this stage, you probably guessed that enabling it requires an enable_infiniband=True flag, and you guessed right. However, since InfiniBand requires communication between a GPU and an InfiniBand device, we need to specify the name of the correct InfiniBand interface to be used by each GPU. Suppose there is an InfiniBand interface named mlx5_0:1 available, a cluster could be created as follows:

cluster = LocalCUDACluster(protocol="ucx", enable_infiniband=True, ucx_net_devices="mlx5_0:1")

From a topological point-of-view, we prefer to select the InfiniBand device closest to the GPU, ensuring highest bandwidth and lowest latency. A complex system such as a DGX-1 or a DGX-2 may contain several GPUs and several InfiniBand interfaces, in such a case we don’t want to simply choose the first one, but the most optimal for that particular system. To address that, we introduced automatic detection of InfiniBand devices:

cluster = LocalCUDACluster(protocol="ucx", enable_infiniband=True, ucx_net_devices="auto")

It is also possible to have a custom configuration passing a function to the ucx_net_devices keyword argument, for that please check the documentation for the LocalCUDACluster class.

TCP, NVLink and InfiniBand

The transports listed above are not mutually exclusive, so you can use all of them together as well:

cluster = LocalCUDACluster(protocol="ucx", enable_tcp_over_ucx=True, enable_nvlink=True, enable_infiniband=True, ucx_net_devices="mlx5_0:1")

UCX with dask-cuda-worker CLI

All the options described above are also available in the dask-cuda-worker CLI. To use UCX with that, you must first start a dask-scheduler and choosing the UCX protocol:

DASK_UCX__CUDA_COPY=True DASK_UCX__TCP=True DASK_UCX__NVLINK=True DASK_UCX__INFINIBAND=True DASK_UCX__NET_DEVICES=mlx5_0:1 dask-scheduler --protocol="ucx"

The command above will start a scheduler and tell its address, which will be something such as ucx://10.10.10.10:8786 (note the protocol is now ucx://, as opposed to the default tcp://). Note also that we have to set Dask environment variables to that command to mimic the Dask-CUDA flags to enable each individual transport, this is because dask-scheduler is part of the mainline Dask Distributed project where we can’t overload it with all the same CLI arguments that we can with dask-cuda-worker, for more details on UCX configuration with Dask-CUDA, please see the Dask-CUDA documentation on Enabling UCX Communication.

We will then connect dask-cuda-worker to that address, and since the scheduler uses the ucx:// protocol, workers will too without any --protocol="ucx". It is still necessary to specify what transports workers need to use:

dask-cuda-worker ucx://10.10.10.10:8786 --enable-tcp-over-ucx --enable-nvlink --enable-infiniband --ucx-net-devices="auto"

Benchmarks

We want to analyze the performance gains of UCX-Py in two scenarios with a variety of accelerated devices. First, how does UCX-Py perform when passing host and device buffers between two endpoints for both InfiniBand and NVLink. Second, how does UCX-Py perform in a common data analysis workload.

When benchmarking 1GB messages between endpoints over NVLink we measured bandwidth of 46.5 GB/s — quite close to the theoretical limit of NVLink: 50 GB/s. This shows that the addition of a Python layer to UCX core (written in C) does not significantly impact performance. Using a similar setup, we configured UCX/UCX-Py to pass messages over InfiniBand. Measurements present a bandwidth of 11.2 GB/s in that case, which is also similarly near the hardware limit (12.5 GB/s) of Connect X-4 Mellanox InfiniBand controllers.

As the numbers above suggest, passing single message GPU objects between endpoints is very efficient and we now ask: how does UCX-Py perform in the context of a common workflow?  Dataframe merges are good candidates to benchmark as they are extremely common and require a high degree of communication. Additionally, a merge, performed by Dask-cuDF, will also test critical operations in RAPIDS and Dask libraries which can degrade communications, most notably serialization.

The graph in Figure 1 shows the cuDF merge benchmark performance on a DGX1 with all 8 GPUs in use. This workflow generates two dask-cudf dataframes with random, equally distributed data, and merges both dataframes into one.

 Measured bandwidth of cuDF merge benchmark with UCX-Py on a DGX-1 server. Utilizing both InfiniBand and NVLink transports reaches just over 18 GB/s, InfiniBand without NVLink reaches 12.6 GB/s, NVLink with TCP achieves 4.3 GB/s, TCP over UCX throughput is of 2.8 GB/s and TCP with Python Sockets (meaning no UCX involved) only achieve 970 MB/s.
Figure 1: Measured bandwidth of cuDF merge benchmark with UCX-Py on a DGX-1 server.

Overall we see improvements using UCX-Py compared to using Python Sockets for TCP communication. However, there are some unexpected results. If NVLink is capable of 32GB/s and InfiniBand is 12GB/s, why then does InfiniBand perform better than NVLink? The answer is: topology! The layout of hardware interconnection is of great importance and can often be complex.

Represents the complex networking topology of a DGX-1 server
Figure 2: Networking topology of a DGX-1 server.

Figure 2 presents the layout of an NVIDIA DGX-1 server — we can think of this single box as a small supercomputer. The DGX-1 has eight P100 GPUs, four InfiniBand devices, and two CPUs. What is the most efficient route to take when passing data between GPU0 and GPU1? What about GPU2 and GPU7? GPU0 and GPU1 are connected via NVLink but GPU2 and GPU7 have direct communication lines, therefore, when passing data, we must perform two costly operations: Device-to-Host (moving data from GPU2 to the host) and Host-to-Device (moving data from the host to GPU7), and transferring data through main memory is expensive. Fortunately, a DGX-1 allows transfers between two GPUs that are not NVLink connected to use a more efficient transfer method than the costly operations described. With InfiniBand and GPUDirectRDMA, two GPUs can communicate without touching the main memory at all, allowing GPU7 to directly read GPU2’s memory via the InfiniBand interconnect only. 

Referring back to Figure 1, you can now see why only using InfiniBand outperforms the NVLink connection alone — Host-to-Device/Device-to-Host transfers are costly! However, when combining InfiniBand with NVLink we achieve more optimal performance. 

Further improvements

UCX-Py is under constant development, which means changes may occur without notice. Our ultimate commitment is to deliver high-performance communication to Python in a high-quality library package. With that said, we expect in the near future to keep on improving stability and ease-of-use, enabling users to benefit from enhanced performance with little to no effort to configure UCX-Py, thus automatically enabling accelerated communication hardware.

So far, we have tested UCX-Py mostly in DGX-1 and DGX-2 systems, where point-to-point and end-to-end performance was demonstrated. We scaled our testing up to one hundred Dask CUDA workers. We are constantly working to improve performance and stability, particularly for large scale workflows, and encourage users to engage with our team on GitHub and extend the discussion on UCX-Py.

Categories
Misc

Coming June to GFN Thursday: 38 More Games Streaming on GeForce NOW This Month

What a gamer wants. What a gamer needs. Games that’ll make you happy and cloud gaming that’ll set you free … to play your favorite PC games anywhere, anytime, on GeForce NOW. Get a sneak peek of all the exciting games coming in June this GFN Thursday. Starting with titles launching today. Keeping It Fresh Read article >

The post Coming June to GFN Thursday: 38 More Games Streaming on GeForce NOW This Month appeared first on The Official NVIDIA Blog.

Categories
Misc

NOOB: How to get TensorFlow model to run on Jetson Nano?

Hello –

I built a model (for ANPR) using TensorFlow and EasyOCR. Over the past week or so, getting TensorFlow to install on the Jetson Nano has been next to impossible. Tons of issues with it (some are documented) and overall I found one person that was able to get it running well which took over 50hrs to install on the Jetson Nano.

This said, there has to be a better way to get a TensorFlow model to run on a Jetson Nano. Does anyone know how to go about doing this? Is this where ONNX comes in (if so, any ideas/resources you might be able to point me to that shows how?)

Maybe I should go down another rabbit hole and try to do ANPR with PyTorch as I have read in the NVIDIA forums that PyTorch seems to be a bit easier to run on the Jetson line?

Any/all thoughts/help is much appreciated.

Thanks!

submitted by /u/InstantAmmo
[visit reddit] [comments]

Categories
Misc

Train Model for Pothole Detection to be used on raspberrypi

Hello, i followed random tutorials and tried training my custom model.

can somebody suggest me some tutorial or give me some tips on training an improved model?

any help would greatly be appreciated.

i am a beginner.

submitted by /u/abracadabra61
[visit reddit] [comments]

Categories
Misc

NVIDIA Spotlights GeForce Partners at COMPUTEX

Highlighting deep support from a flourishing roster of GeForce partners, NVIDIA’s Jeff Fisher delivered a virtual keynote at COMPUTEX 2021 in Taipei Tuesday. Fisher, senior vice president of NVIDIA’s GeForce business, announced a pair of powerful new gaming GPUs — the GeForce RTX 3080 Ti and GeForce RTX 3070 Ti — and detailed the fast-growing Read article >

The post NVIDIA Spotlights GeForce Partners at COMPUTEX appeared first on The Official NVIDIA Blog.