Categories
Misc

Lending a Helping Hand: Jules Anh Tuan Nguyen on Building a Neuroprosthetic

With deep learning, amputees can now control their prosthetics by simply thinking through the motion. Jules Anh Tuan Nguyen spoke with NVIDIA AI Podcast host Noah Kravitz about his efforts to allow amputees to control their prosthetic limb — right down to the finger motions — with their minds. The AI Podcast · Jules Anh Read article >

The post Lending a Helping Hand: Jules Anh Tuan Nguyen on Building a Neuroprosthetic appeared first on The Official NVIDIA Blog.

Categories
Misc

Watch: Making Masterpieces in the Cloud With Virtual Reality

Immersive 3D design and character creation are going sky high this week at SIGGRAPH, in a demo showcasing NVIDIA CloudXR running on Google Cloud. The clip shows an artist with an untethered VR headset creating a fully rigged character with Masterpiece Studio Pro, which is running remotely in Google Cloud and interactively streamed to the Read article >

The post Watch: Making Masterpieces in the Cloud With Virtual Reality appeared first on The Official NVIDIA Blog.

Categories
Misc

Upcoming Webinar: Accelerate AI Model Development with PyTorch Lightning

The NGC team is hosting a webinar with live Q&A to dive into how to build AI models using PyTorch Lightening, an AI framework built on top of PyTorch, from the NGC catalog.

The NGC team is hosting a webinar with live Q&A to dive into how to build AI models using PyTorch Lightning, an AI framework built on top of PyTorch, from the NGC catalog.

Simplify and Accelerate AI Model Development with PyTorch Lightning, NGC, and AWS
September 2 at 10 a.m. PT

Organizations across industries are using AI to help build better products, streamline operations, and increase customer satisfaction. 

Today, speech recognition services are deployed in financial organizations to transcribe earnings calls, in hospitals to assist doctors writing patient notes, and in video broadcasting for live captioning.

Under the hood, researchers and data scientists are building hundreds of AI models to experiment and identify the most impactful models to deploy for their use cases.

PyTorch Lightning, an AI framework built on top of PyTorch, simplifies coding, so researchers can focus on building models and reduce time spent on the engineering process. It also speeds up the development of hundreds of models by easily scaling on GPUs within and across nodes.

By joining this webinar, you will learn:

  • About the benefit of using PyTorch Lightning and how it simplifies building complex models
  • How NGC helps accelerate AI development by simplifying software deployment
  • How researchers can quickly build models using PyTorch Lightning from NGC on AWS

Register now >>>

Categories
Misc

NVIDIA and Google Cloud Deliver High-Quality XR Streaming

NVIDIA CloudXR is coming to NVIDIA RTX Virtual Workstation instances on Google Cloud. Built on NVIDIA RTX GPUs, NVIDIA CloudXR enables streaming of immersive AR, VR, or mixed reality experiences from anywhere. Organizations can easily set up and scale immersive experiences from any location, to any VR or AR device.  By streaming from Google Cloud, … Continued

NVIDIA CloudXR is coming to NVIDIA RTX Virtual Workstation instances on Google Cloud.

Built on NVIDIA RTX GPUs, NVIDIA CloudXR enables streaming of immersive AR, VR, or mixed reality experiences from anywhere. Organizations can easily set up and scale immersive experiences from any location, to any VR or AR device. 

By streaming from Google Cloud, XR users can securely access data from the cloud at any time, and they can easily share the immersive experiences with other teams or customers.

NVIDIA CloudXR Enables Masterpiece Creation Anywhere

NVIDIA is partnering with Masterpiece Studio to showcase state-of-the-art character animation in VR. With NVIDIA CloudXR, Masterpiece Studio customers around the world can stream and collaborate on character creation workflows in a VR environment.

And with NVIDIA CloudXR, creators anywhere in the world can access a high-powered NVIDIA RTX Virtual Workstation, provisioned and delivered by Google Cloud.

Creators should have the freedom of working from anywhere, without needing to be physically tethered to a workstation to work on characters or 3D models in VR,” said Jonathan Gagne, CEO at Masterpiece Studio. “With NVIDIA CloudXR, our customers will be able to power their creative workflows in high-quality immersive environments, from any location, on any device.” 

“NVIDIA CloudXR technology delivered via Google Cloud’s global private fiber optic network provides an optimized, high-quality user experience for remotely streamed VR experiences. This unique combination unlocks the ability to easily stream work from anywhere using NVIDIA RTX Virtual Workstation,” said Rob Martin, Chief Architect for Gaming at Google. “With NVIDIA CloudXR on Google Cloud, the future of VR workflows can be more collaborative, intuitive, and productive.”

Learn more about NVIDIA CloudXR on Google Cloud with this SIGGRAPH demo.

NVIDIA CloudXR Availability

With NVIDIA CloudXR running on GPU-powered virtual machine instances on Google Cloud, companies can provide XR creators and end users with high-quality virtual experiences from anywhere in the world. NVIDIA CloudXR on Google Cloud will be generally available later this year, with a private beta available soon. 

The NVIDIA CloudXR platform includes:

  • NVIDIA CloudXR SDK, which supports all OpenVR apps and includes broad client support for phones, tablets and HMDs.
  • NVIDIA RTX Virtual Workstations to provide high-quality graphics at the fastest frame rates.
  • NVIDIA AI SDKs to accelerate performance and increase immersive presence.

Apply now for early access to the SDK, and sign up to get the latest news and updates on upcoming NVIDIA CloudXR releases, including the private beta.

Categories
Misc

object detection: exporting a specific checkpoint

How would one export a specific named checkpoint from the outputs/-directory? It seems running python tensorflow_models/research/object_detection/exporter_main_v2.py … only uses the latest checkpoin in the trained_checkpoint_dir directory.

Can I export the latest checkpoint and then manually replace it with my desired one, or does the exported do some extra processing after copying?

submitted by /u/Meriipu
[visit reddit] [comments]

Categories
Misc

All AI Do Is Win: NVIDIA Research Nabs ‘Best in Show’ with Digital Avatars at SIGGRAPH

In a turducken of a demo, NVIDIA researchers stuffed four AI models into a serving of digital avatar technology for SIGGRAPH 2021’s Real-Time Live showcase — winning the Best in Show award. The showcase, one of the most anticipated events at the world’s largest computer graphics conference, held virtually this year, celebrates cutting-edge real-time projects Read article >

The post All AI Do Is Win: NVIDIA Research Nabs ‘Best in Show’ with Digital Avatars at SIGGRAPH appeared first on The Official NVIDIA Blog.

Categories
Misc

Leveling up CUDA Performance on WSL2 with New Enhancements

In June 2020, we released the first NVIDIA Display Driver that enabled GPU acceleration in the Windows Subsystem for Linux (WSL) 2 for Windows Insider Program (WIP) Preview users. At that time, it was still an early preview with a limited set of features. A year later, as we have steadily added new capabilities, we … Continued

In June 2020, we released the first NVIDIA Display Driver that enabled GPU acceleration in the Windows Subsystem for Linux (WSL) 2 for Windows Insider Program (WIP) Preview users. At that time, it was still an early preview with a limited set of features. A year later, as we have steadily added new capabilities, we have also been focusing on optimizing the CUDA driver to deliver top performance on WSL2.

WSL is a Windows 10 feature that enables you to run native Linux command-line tools directly on Windows, without requiring the complexity of a dual-boot environment. Internally, WSL is a containerized environment that is tightly integrated with the Microsoft Windows OS. WSL2 enables you to run Linux applications alongside traditional Windows desktop and modern store apps. For more information about CUDA on WSL, see Announcing CUDA on Windows Subsystem for Linux 2.

In this post, we focus on the current state of the CUDA performance on WSL2, the various performance-centric optimizations that have been made, and what to look forward to in the future.

Current state of WSL performance

Over the past several months, we have been tuning the performance of the CUDA Driver on WSL2 by analyzing and optimizing multiple critical driver paths, both on the NVIDIA and the Microsoft sides. In this post, we go into detail on what we have done exactly to reach the current performance level. Before we start that, here’s the current state of WSL2 on a couple of baseline benchmarks.

On WSL2, all the GPU operations are serialized through VMBUS and sent to the host kernel interface. One of the most common performance questions around WSL2 is the overhead of said operations. We understand that developers want to know whether there is any overhead to running workloads in WSL2 compared to running them directly on native Linux. Is there a difference? Is this overhead significant?

Figure showing near native performance results using the Blender benchmark test.
Figure 1. Blender benchmark results (WSL2 vs. Native, results in seconds, lower is better).

For the Blender benchmark, WSL2 performance is comparable or close to native Linux (within 1%). Because Blender Cycles push a long running kernel on the GPU, the overhead of WSL2 is not visible on any of those benchmarks.

Figure showing near native performance results using the Rodinia benchmark test.
Figure 2. Rodinia benchmark suite results (WSL2 vs. Native, results in seconds, lower is better).

When it comes to the Rodinia Benchmark suite (Figure 2), we have come a long way from the performance we were able to achieve when we first launched support for WSL2.

The new driver can perform considerably better and can even reach close to native execution time for Particle Filter tests. It also finally closes the gap for the Myocyte benchmark. This is especially of consequence for the Myocyte benchmark where the early results with WSL2 were up to 10 times slower compared to native Linux. Myocyte is particularly hard on WSL2, as this benchmark consists of many extremely small sequential submissions (less than microseconds), making it a sequential launch latency microbenchmark. This is an area that we’re investigating to achieve complete performance parity.

Figure showing performance results within 10% using the GenomeWorks benchmark test.
Figure 3. GenomeWorks CUDA Aligner sample execution time (WSL2 vs. Native, results in seconds, lower is better).

For the GenomeWorks benchmark (Figure 3), we are using CUDA aligner for GPU-Accelerated pairwise alignment. To show the worst-case scenario of performance overhead, the benchmark runs here were done with a sample dataset composed of short running kernels. Due to how short the kernel launches are, you can observe the launch latency overhead on WSL2. However, even for this worst-case example, the performance is equal to or more than 90% of the native speed. Our expectation is that for real-world use cases, where dataset sizes are typically larger, performance will be close to native performance.

To explore this key trade-off between kernel size and WSL2 performance, look at the next benchmark.

Figure showing different batch size performance results using the PyTorch MNIST benchmark test. These areas are where we intend to improve performance.
Figure 4. PyTorch MNIST sample time per epoch, with various batch sizes (WSL2 vs. Native, results in seconds, lower is better).

Figure 4 shows the PyTorch MNIST test, a purposefully small, toy machine learning sample that highlights how important it is to keep the GPU busy to reach satisfactory performance on WSL2. As with native Linux, the smaller the workload, the more likely that you’ll see performance degradation due to the overhead of launching a GPU process. This degradation is more pronounced on WSL2, and scales differently compared to native Linux.

As you keep improving the WSL2 driver, this difference in scaling for exceedingly small workloads should become less and less pronounced. The best way to avoid these pitfalls, both on WSL2 and on native Linux, is to keep the GPU busy as much as possible.

WSL2 Native Linux
OS Latest Windows Insider Preview Ubuntu 20.04
WSL Kernel Driver 5.10.16.3-microsoft-standard-WSL2 N/A
Driver Model GPU Accelerated Hardware Scheduling N/A
System All benchmarks were run on the same system with an NVIDIA RTX 6000 All benchmarks were run on the same system with an NVIDIA RTX 6000
Table 1. System configuration and software releases used for benchmark testing.
Benchmark name Description
Blender Classic blender benchmark run with CUDA (not NVIDIA OptiX) on the BMW and Pavillion Barcelona scenes.
NVIDIA GenomeWork CUDA pairwise alignment sample (available as a sample in the GenomeWork repository).
PyTorch MNIST Modified (code added to time each epoch) MNIST sample.
Myocyte, Particle Filter Benchmarks that are part of the RODINIA benchmark suite.
Table 2. Benchmark test names used with a brief description of each.

Launch latency optimizations

Launch latency is one of the leading causes of performance disparities between some native Linux applications and WSL2. There are two important metrics here:

  • GPU kernel launch latency: The time it takes to launch a kernel with a CUDA call and start execution by the GPU.
  • End-to-end overhead (launch latency plus synchronization overhead): The overall time it takes to launch a kernel with a CUDA call and wait for its completion on the CPU, excluding the kernel run time itself.

Launch latency is usually negligible when the workload pushed onto the GPU is significantly bigger than the latency itself. Thanks to CUDA primitives like streams and graphs, you can keep the GPU busy and can leverage the asynchronous nature of these APIs to overcome any latency issues. However, when the execution time of the workload sent to the GPU is close to the launch latency, then it quickly becomes a major performance bottleneck. The launch latency will act as a launch rate limiter, which causes kernel execution performance to plunge.

Launch latency on native Windows

Before diving into what makes launch latency a significant obstacle to overcome on WSL2, we explain the launch path of a CUDA kernel on native Windows. There are two different launch models implemented in the CUDA driver for Windows: one for packet scheduling and another for hardware-accelerated GPU scheduling.

Packet scheduling

In packet scheduling, the OS is responsible for most of the scheduling work. However, to compensate for the submission model and the significant launch overhead, the CUDA driver always tries to batch a certain number of kernel launches based on various heuristics. Figure 5 shows that in packet scheduling mode, the OS schedules submissions and they are serialized for a given context. This means that all work of one submission must finish before any work of the next submission can start.

To improve the throughput in packet scheduling mode, the CUDA driver tries to aggregate some of the launches together in a single submission, even though internally they are dispatched across multiple GPU queues. This heuristic helps with false dependency and parallelism, and it also reduces the number of packets submitted, reducing scheduling overhead times.

Figure showing how WDDM packet scheduling works within the CUDA Driver.
Figure 5. Overview of the WDDM Packet Scheduling model and its use in the CUDA Driver.

In this submission model, you see performance reach its limits when the workload is launch latency bound. You can force outstanding submissions to be issued, by querying the status of a stream with a small pending workload. In this case, it still suffers from high scheduling overhead, on top of having to deal with potential false dependencies.

Hardware-accelerated GPU scheduling

More recently, Microsoft introduced a new model called hardware-accelerated GPU scheduling. Using this model, hardware queues are directly exposed for a given context and the user mode driver (in this case, CUDA) is solely responsible for managing the work submissions and the dependencies between the work items. It removes the need for batching multiple kernel launches into a single submission, enabling you to adopt the same strategy as used in a native Linux driver where work submissions are almost instantaneous (Figure 6).

Figure showing how WDDM hardware scheduling works within the CUDA driver.
Figure 6. Overview of the WDDM Hardware Scheduling model and it is used in the CUDA Driver.

This hardware scheduling-based submission model removes the false dependency and avoids the need for buffering. It also reduces the overhead by offloading some of the OS scheduling tasks previously handled on the CPUs to the GPU.

Leveraging HW-accelerated GPU scheduling on WSL2

Why do these scheduling details matter? Native Windows applications were traditionally designed to hide the higher latency. However, launch latency was never a factor for native Linux applications, where the threshold at which latency affects performance was an order of magnitude smaller than the one on Windows.

When these same Linux applications run in WSL2, the launch latency becomes more prominent. Here, the benefits of hardware-accelerated GPU scheduling can offset the latency-induced performance loss, as CUDA adopts the same submission strategy followed on native Linux for both WSL2 and native Windows. We strongly recommend switching to hardware-accelerated GPU scheduling mode when running WSL2.

Even with hardware-accelerated GPU scheduling, submitting work to the GPU is still done with a call to the OS, just like in packet scheduling. Not only submission but, in some cases, synchronization might also have to make some OS calls for error detection. Each such call to the OS on WSL2 involves crossing the WSL2 boundary to reach the host kernel mode through VMBUS. This can quickly become the single bottleneck for the driver (Figure 7). Linux applications that are doing small batches of GPU work at a time may still not perform well.

Figure 7. Overview of the submission path on WSL2 and the various locations of the extra overhead.

Asynchronous submissions to reduce the launch latency

We found a solution to mitigate the extra launch latency on WSL through a change made by Microsoft to make the Submit call asynchronous. By leveraging this call, you can start overlapping other operations while the submission is happening and hide the extra WSL overhead in this way. Thanks to the new asynchronous nature of the submit call, the launch latency is now comparable to native Windows.

Figure showing WSL2 and native Windows launch latency differences.
Figure 8. Microbenchmark of the launch latency on WSL2 and Native Windows.

Despite the optimization made in the synchronization path, the total overhead of launching and synchronizing on a submission is still higher compared to native Windows. The VMBUS overhead at point 1 causes this, not the synchronization path itself (Figure 7). This effect can be seen in Figure 8, where we measure the overhead of a single launch, followed by synchronization. The extra latency induced by VMBUS is clearly visible.

Making the submission call asynchronous does not necessarily remove the launch latency cost altogether. Instead, it enables you to offset it by doing other operations at the same time. An application can pipeline multiple launches on a stream for instance, assuming that the kernel launches are long enough to cover the extra latency. In that case, this cost can be shadowed and designed to be visible only at the beginning of a long series of submissions.

In short, we have and will continue to improve and optimize performance on WSL2. Despite all the optimizations mentioned thus far, if applications are not pipelining enough workload on the GPU, or worse, if the workload is too small, a performance gap between native Linux and WSL2 will start to appear. This is also why comparisons between WSL2 and native Linux are challenging and vary widely from benchmark to benchmark.

Imagine that the application is pipelining enough work to shadow the latency overhead and keep the GPU busy during the entire lifetime of the application. With the current set of optimizations, chances are that the performance will be close to or even comparable with native Linux applications.

When the GPU workload submitted by an application is not long enough to overcome that latency, a performance gap between native Linux and WSL2 will start to appear. The gap is proportional to the difference between the overall latency and the size of the work pushed at one time.

This is also why, despite all the improvements made in this area, we will keep focusing on reducing this latency to bring it closer and closer to native Linux.

New allocation optimization

Another area of focus for us has been memory allocation. Unlike launch latency, which affects the performance for as long as the application is launching work on the GPU, memory allocations mostly affect the startup and loading and unloading phases of a program.

This does not mean that it is unimportant; far from it. Even if those operations are infrequent compared to just submitting work on the GPU, the associated driver overhead is usually an order of magnitude higher. The allocation of several megabytes at a time end up taking several milliseconds to complete.

To optimize this path, one of our main approaches has been to enable asynchronous paging operation in CUDA. This capability has been available in the Windows Display Driver model for a while, but the CUDA driver never used it, until now. The main advantage of this strategy is that you can exit the allocation call and give control back to the user code. You don’t have to wait for an expensive GPU operation to complete, for example, updating the page table. Instead, the wait is postponed to the next operation that references the allocation.

Not only does this improve the overlap between the CPU and GPU work, but it can also eliminate the wait altogether. If the paging operation completes early enough, the CUDA driver can avoid issuing an OS call to wait on the paging operation by snooping a mapped fence value. On WSL2, this is particularly important. Anytime that you avoid calling into the host kernel mode, you also avoid the VMBUS overhead.

Figure showing how the CUDA driver allocates memory and how asynchronous mapping works.
Figure 9. Overview of the asynchronous mapping of allocation done in the CUDA Driver.

Are we there yet?

We have come a long way when it comes to WSL2 performance over the past months, and we are now seeing results comparable or close to native Linux for many benchmarks. This doesn’t mean that we have reached our goal and that we will stop optimizing the driver. Not at all!

First, future optimization in hardware scheduling, currently being looked at by Microsoft, might allow us to bring the launch overhead to a minimum. In the meantime, until those features are fully developed, we will keep optimizing the CUDA driver on WSL, with recommendations for native Windows as well.

Second, we will focus on fast and efficient memory allocation through some special form of memory copy. We will also soon start looking at better multi-GPU features and optimizations on WSL2 to enable even more intensive workload to run fast.

WSL2 is a fully supported platform for NVIDIA, and it will be given the same feature offerings and performance focus that CUDA strives for all its other supported platforms. It is our intent to make WSL2 performance better and suitable for development. We will also make this into a CUDA platform that is attractive for every use case, with performance as close as possible to any native Linux system.

Last, but not least, we heartily thank the developer community that has been rapidly adopting GPU acceleration in the WSL2 preview, reporting issues, and providing feedback continuously over the past year. You have helped us uncover potential issues and make big strides on performance by sharing with us performance use cases that we might have missed otherwise. Without your unwavering support, GPU acceleration on WSL2 would not be where it is today. We look forward to engaging with the community further as we work on achieving future milestones for CUDA on WSL2.

To access the driver installers and documentation, register for the NVIDIA Developer Program and Microsoft Windows Insider Program.

The following resources contain valuable information to aid you on how CUDA works with WSL2, including how to get started with running applications, and deep learning containers:

We encourage everyone to use our forum and share their experience with the larger WSL community.

Categories
Misc

How can I achieve sparse connections between filters in two convolutional layers?

How can I achieve sparse connections between filters in two convolutional layers? submitted by /u/ElvishChampion
[visit reddit] [comments]
Categories
Misc

Release John Snow Labs Spark-NLP 3.2.0: New Longformer embeddings, BERT and DistilBERT for Token Classification, GraphExctraction, Spark NLP Configurations, new state-of-the-art multilingual NER models, and lots more! · JohnSnowLabs/spark-nlp

Release John Snow Labs Spark-NLP 3.2.0: New Longformer embeddings, BERT and DistilBERT for Token Classification, GraphExctraction, Spark NLP Configurations, new state-of-the-art multilingual NER models, and lots more! · JohnSnowLabs/spark-nlp submitted by /u/dark-night-rises
[visit reddit] [comments]
Categories
Misc

Announcing Nsight Deep Learning Designer 2021.1 – SDK for Efficient Deep Learning Model Design and Development

NVIDIA announces Nsight DL Designer – the first in-class integrated development environment to support efficient design of deep neural networks for in-app inference. 

Nsight Deep Learning Designer 2021.1

Today NVIDIA announced Nsight DL Designer – the first in-class integrated development environment to support efficient design of deep neural networks for in-app inference. Download Now!

This SDK aims at streamlining the often iterative process of designing deep neural network models for in-app inferencing by providing efficient support at every stage of the process. 

Nsight DL Designer is a GUI-based tool for model design and has integrated profiling capabilities that are based on GPU metrics. It provides a convenient way to import models into Pytorch for training. A visual analysis mode allows developers to dive deep into the inference process in real-time and in an interactive manner, with flexible options to export the finalized mode for inference deployment.

End-to-end Nsight DL Designer Workflow

Developers start by designing their deep neural network models inside Nsight DL Designer, using a built-in set of high-level neural network layers implemented by NVIDIA as the NvNeural inference engine. After creating the model, you can do performance profiling to get some basic idea whether your model meets the allocated timing budget. The profiling can be done early, even before you spend time on training the network. 

For the training phase, Nsight DL Designer provides a variety of Python scripts that automatically converts a Nsight DL Designer model into a Pytorch model that can be easily added to your training loop. When training is done, you can save the learned weights data from your model into Numpy files. You can go back to DL Designer, load the weights file and enter the analysis mode to examine the inference results. The analysis mode also allows developers to dive deep into the inference process, visually inspect what’s happening at each inference step. This feedback can potentially guide developers to optimize their network model for improved quality and performance.

Once you are satisfied with both the quality and performance of your model, it’s time for deployment. Nsight DL Designer provides several ways to support deploying a model. One way is to export your model as an ONNX file. With the ONNX file, your can deploy your model on any platform where ONNX runtime

To learn more, watch the SIGGRAPH session: Announcing Nsight Deep Learning Designer – Optimize Your Neural Network for Quality and High Performance.

Key features for this release include:

  • GUI based and NvNeural inference engine for model design
  • Inference performance profiling with GPU metrics
  • Interface with training framework – PyTorch
  • Interactive visual analysis of the inference process
  • Automated model export and code generation for deployment

Resources:

  • Learn More and download DL Designer today.
  • Find getting started documentation here.
  • Post comments on Forums.