DataBloom - Part 442

Misc

object detection: exporting a specific checkpoint

Post author By
Post date August 11, 2021
No Comments on object detection: exporting a specific checkpoint

How would one export a specific named checkpoint from the outputs/-directory? It seems running python tensorflow_models/research/object_detection/exporter_main_v2.py … only uses the latest checkpoin in the trained_checkpoint_dir directory.

Can I export the latest checkpoint and then manually replace it with my desired one, or does the exported do some extra processing after copying?

submitted by /u/Meriipu
[visit reddit] [comments]

Misc

All AI Do Is Win: NVIDIA Research Nabs ‘Best in Show’ with Digital Avatars at SIGGRAPH

Post author By
Post date August 10, 2021
No Comments on All AI Do Is Win: NVIDIA Research Nabs ‘Best in Show’ with Digital Avatars at SIGGRAPH

In a turducken of a demo, NVIDIA researchers stuffed four AI models into a serving of digital avatar technology for SIGGRAPH 2021’s Real-Time Live showcase — winning the Best in Show award. The showcase, one of the most anticipated events at the world’s largest computer graphics conference, held virtually this year, celebrates cutting-edge real-time projects Read article >

The post All AI Do Is Win: NVIDIA Research Nabs ‘Best in Show’ with Digital Avatars at SIGGRAPH appeared first on The Official NVIDIA Blog.

Misc

Leveling up CUDA Performance on WSL2 with New Enhancements

Post author By
Post date August 10, 2021
No Comments on Leveling up CUDA Performance on WSL2 with New Enhancements

In June 2020, we released the first NVIDIA Display Driver that enabled GPU acceleration in the Windows Subsystem for Linux (WSL) 2 for Windows Insider Program (WIP) Preview users. At that time, it was still an early preview with a limited set of features. A year later, as we have steadily added new capabilities, we … Continued

WSL is a Windows 10 feature that enables you to run native Linux command-line tools directly on Windows, without requiring the complexity of a dual-boot environment. Internally, WSL is a containerized environment that is tightly integrated with the Microsoft Windows OS. WSL2 enables you to run Linux applications alongside traditional Windows desktop and modern store apps. For more information about CUDA on WSL, see Announcing CUDA on Windows Subsystem for Linux 2.

In this post, we focus on the current state of the CUDA performance on WSL2, the various performance-centric optimizations that have been made, and what to look forward to in the future.

Current state of WSL performance

Over the past several months, we have been tuning the performance of the CUDA Driver on WSL2 by analyzing and optimizing multiple critical driver paths, both on the NVIDIA and the Microsoft sides. In this post, we go into detail on what we have done exactly to reach the current performance level. Before we start that, here’s the current state of WSL2 on a couple of baseline benchmarks.

On WSL2, all the GPU operations are serialized through VMBUS and sent to the host kernel interface. One of the most common performance questions around WSL2 is the overhead of said operations. We understand that developers want to know whether there is any overhead to running workloads in WSL2 compared to running them directly on native Linux. Is there a difference? Is this overhead significant?

Figure showing near native performance results using the Blender benchmark test. — *Figure 1. Blender benchmark results (WSL2 vs. Native, results in seconds, lower is better)*.

For the Blender benchmark, WSL2 performance is comparable or close to native Linux (within 1%). Because Blender Cycles push a long running kernel on the GPU, the overhead of WSL2 is not visible on any of those benchmarks.

Figure showing near native performance results using the Rodinia benchmark test. — Figure 2. Rodinia benchmark suite results (*WSL2* vs. Native, results in seconds, lower is better).

When it comes to the Rodinia Benchmark suite (Figure 2), we have come a long way from the performance we were able to achieve when we first launched support for WSL2.

The new driver can perform considerably better and can even reach close to native execution time for Particle Filter tests. It also finally closes the gap for the Myocyte benchmark. This is especially of consequence for the Myocyte benchmark where the early results with WSL2 were up to 10 times slower compared to native Linux. Myocyte is particularly hard on WSL2, as this benchmark consists of many extremely small sequential submissions (less than microseconds), making it a sequential launch latency microbenchmark. This is an area that we’re investigating to achieve complete performance parity.

Figure showing performance results within 10% using the GenomeWorks benchmark test. — *Figure 3. GenomeWorks CUDA Aligner sample execution time (WSL2 vs. Native, results in seconds, lower is better)*.

For the GenomeWorks benchmark (Figure 3), we are using CUDA aligner for GPU-Accelerated pairwise alignment. To show the worst-case scenario of performance overhead, the benchmark runs here were done with a sample dataset composed of short running kernels. Due to how short the kernel launches are, you can observe the launch latency overhead on WSL2. However, even for this worst-case example, the performance is equal to or more than 90% of the native speed. Our expectation is that for real-world use cases, where dataset sizes are typically larger, performance will be close to native performance.

To explore this key trade-off between kernel size and WSL2 performance, look at the next benchmark.

Figure showing different batch size performance results using the PyTorch MNIST benchmark test. These areas are where we intend to improve performance. — *Figure 4. PyTorch MNIST sample time per epoch, with various batch sizes (WSL2 vs. Native, results in seconds, lower is better)*.

Figure 4 shows the PyTorch MNIST test, a purposefully small, toy machine learning sample that highlights how important it is to keep the GPU busy to reach satisfactory performance on WSL2. As with native Linux, the smaller the workload, the more likely that you’ll see performance degradation due to the overhead of launching a GPU process. This degradation is more pronounced on WSL2, and scales differently compared to native Linux.

As you keep improving the WSL2 driver, this difference in scaling for exceedingly small workloads should become less and less pronounced. The best way to avoid these pitfalls, both on WSL2 and on native Linux, is to keep the GPU busy as much as possible.

	WSL2	Native Linux
OS	Latest Windows Insider Preview	Ubuntu 20.04
*WSL Kernel Driver*	5.10.16.3-microsoft-standard-WSL2	N/A
*Driver Model*	GPU Accelerated Hardware Scheduling	N/A
*System*	All benchmarks were run on the same system with an NVIDIA RTX 6000	All benchmarks were run on the same system with an NVIDIA RTX 6000

Table 1. System configuration and software releases used for benchmark testing.

Benchmark name	Description
Blender	Classic blender benchmark run with CUDA (not NVIDIA OptiX) on the BMW and Pavillion Barcelona scenes.
NVIDIA GenomeWork	CUDA pairwise alignment sample (available as a sample in the GenomeWork repository).
PyTorch MNIST	Modified (code added to time each epoch) MNIST sample.
Myocyte, Particle Filter	Benchmarks that are part of the RODINIA benchmark suite.

Table 2. Benchmark test names used with a brief description of each.

Launch latency optimizations

Launch latency is one of the leading causes of performance disparities between some native Linux applications and WSL2. There are two important metrics here:

GPU kernel launch latency: The time it takes to launch a kernel with a CUDA call and start execution by the GPU.
End-to-end overhead (launch latency plus synchronization overhead): The overall time it takes to launch a kernel with a CUDA call and wait for its completion on the CPU, excluding the kernel run time itself.

Launch latency is usually negligible when the workload pushed onto the GPU is significantly bigger than the latency itself. Thanks to CUDA primitives like streams and graphs, you can keep the GPU busy and can leverage the asynchronous nature of these APIs to overcome any latency issues. However, when the execution time of the workload sent to the GPU is close to the launch latency, then it quickly becomes a major performance bottleneck. The launch latency will act as a launch rate limiter, which causes kernel execution performance to plunge.

Launch latency on native Windows

Before diving into what makes launch latency a significant obstacle to overcome on WSL2, we explain the launch path of a CUDA kernel on native Windows. There are two different launch models implemented in the CUDA driver for Windows: one for packet scheduling and another for hardware-accelerated GPU scheduling.

Packet scheduling

In packet scheduling, the OS is responsible for most of the scheduling work. However, to compensate for the submission model and the significant launch overhead, the CUDA driver always tries to batch a certain number of kernel launches based on various heuristics. Figure 5 shows that in packet scheduling mode, the OS schedules submissions and they are serialized for a given context. This means that all work of one submission must finish before any work of the next submission can start.

To improve the throughput in packet scheduling mode, the CUDA driver tries to aggregate some of the launches together in a single submission, even though internally they are dispatched across multiple GPU queues. This heuristic helps with false dependency and parallelism, and it also reduces the number of packets submitted, reducing scheduling overhead times.

Figure showing how WDDM packet scheduling works within the CUDA Driver. — *Figure 5. Overview of the WDDM Packet Scheduling model and its use in the CUDA Driver*.

In this submission model, you see performance reach its limits when the workload is launch latency bound. You can force outstanding submissions to be issued, by querying the status of a stream with a small pending workload. In this case, it still suffers from high scheduling overhead, on top of having to deal with potential false dependencies.

Hardware-accelerated GPU scheduling

More recently, Microsoft introduced a new model called hardware-accelerated GPU scheduling. Using this model, hardware queues are directly exposed for a given context and the user mode driver (in this case, CUDA) is solely responsible for managing the work submissions and the dependencies between the work items. It removes the need for batching multiple kernel launches into a single submission, enabling you to adopt the same strategy as used in a native Linux driver where work submissions are almost instantaneous (Figure 6).

Figure showing how WDDM hardware scheduling works within the CUDA driver. — *Figure 6. Overview of the WDDM Hardware Scheduling model and it is used in the CUDA Driver*.

This hardware scheduling-based submission model removes the false dependency and avoids the need for buffering. It also reduces the overhead by offloading some of the OS scheduling tasks previously handled on the CPUs to the GPU.

Leveraging HW-accelerated GPU scheduling on WSL2

Why do these scheduling details matter? Native Windows applications were traditionally designed to hide the higher latency. However, launch latency was never a factor for native Linux applications, where the threshold at which latency affects performance was an order of magnitude smaller than the one on Windows.

When these same Linux applications run in WSL2, the launch latency becomes more prominent. Here, the benefits of hardware-accelerated GPU scheduling can offset the latency-induced performance loss, as CUDA adopts the same submission strategy followed on native Linux for both WSL2 and native Windows. We strongly recommend switching to hardware-accelerated GPU scheduling mode when running WSL2.

Even with hardware-accelerated GPU scheduling, submitting work to the GPU is still done with a call to the OS, just like in packet scheduling. Not only submission but, in some cases, synchronization might also have to make some OS calls for error detection. Each such call to the OS on WSL2 involves crossing the WSL2 boundary to reach the host kernel mode through VMBUS. This can quickly become the single bottleneck for the driver (Figure 7). Linux applications that are doing small batches of GPU work at a time may still not perform well.

*Figure 7. Overview of the submission path on WSL2 and the various locations of the extra overhead*.

Asynchronous submissions to reduce the launch latency

We found a solution to mitigate the extra launch latency on WSL through a change made by Microsoft to make the Submit call asynchronous. By leveraging this call, you can start overlapping other operations while the submission is happening and hide the extra WSL overhead in this way. Thanks to the new asynchronous nature of the submit call, the launch latency is now comparable to native Windows.

Figure showing WSL2 and native Windows launch latency differences. — *Figure 8. Microbenchmark of the launch latency on WSL2 and Native Windows*.

Despite the optimization made in the synchronization path, the total overhead of launching and synchronizing on a submission is still higher compared to native Windows. The VMBUS overhead at point 1 causes this, not the synchronization path itself (Figure 7). This effect can be seen in Figure 8, where we measure the overhead of a single launch, followed by synchronization. The extra latency induced by VMBUS is clearly visible.

Making the submission call asynchronous does not necessarily remove the launch latency cost altogether. Instead, it enables you to offset it by doing other operations at the same time. An application can pipeline multiple launches on a stream for instance, assuming that the kernel launches are long enough to cover the extra latency. In that case, this cost can be shadowed and designed to be visible only at the beginning of a long series of submissions.

In short, we have and will continue to improve and optimize performance on WSL2. Despite all the optimizations mentioned thus far, if applications are not pipelining enough workload on the GPU, or worse, if the workload is too small, a performance gap between native Linux and WSL2 will start to appear. This is also why comparisons between WSL2 and native Linux are challenging and vary widely from benchmark to benchmark.

Imagine that the application is pipelining enough work to shadow the latency overhead and keep the GPU busy during the entire lifetime of the application. With the current set of optimizations, chances are that the performance will be close to or even comparable with native Linux applications.

When the GPU workload submitted by an application is not long enough to overcome that latency, a performance gap between native Linux and WSL2 will start to appear. The gap is proportional to the difference between the overall latency and the size of the work pushed at one time.

This is also why, despite all the improvements made in this area, we will keep focusing on reducing this latency to bring it closer and closer to native Linux.

New allocation optimization

Another area of focus for us has been memory allocation. Unlike launch latency, which affects the performance for as long as the application is launching work on the GPU, memory allocations mostly affect the startup and loading and unloading phases of a program.

This does not mean that it is unimportant; far from it. Even if those operations are infrequent compared to just submitting work on the GPU, the associated driver overhead is usually an order of magnitude higher. The allocation of several megabytes at a time end up taking several milliseconds to complete.

To optimize this path, one of our main approaches has been to enable asynchronous paging operation in CUDA. This capability has been available in the Windows Display Driver model for a while, but the CUDA driver never used it, until now. The main advantage of this strategy is that you can exit the allocation call and give control back to the user code. You don’t have to wait for an expensive GPU operation to complete, for example, updating the page table. Instead, the wait is postponed to the next operation that references the allocation.

Not only does this improve the overlap between the CPU and GPU work, but it can also eliminate the wait altogether. If the paging operation completes early enough, the CUDA driver can avoid issuing an OS call to wait on the paging operation by snooping a mapped fence value. On WSL2, this is particularly important. Anytime that you avoid calling into the host kernel mode, you also avoid the VMBUS overhead.

Figure showing how the CUDA driver allocates memory and how asynchronous mapping works. — *Figure 9. Overview of the asynchronous mapping of allocation done in the CUDA Driver.*

Are we there yet?

We have come a long way when it comes to WSL2 performance over the past months, and we are now seeing results comparable or close to native Linux for many benchmarks. This doesn’t mean that we have reached our goal and that we will stop optimizing the driver. Not at all!

First, future optimization in hardware scheduling, currently being looked at by Microsoft, might allow us to bring the launch overhead to a minimum. In the meantime, until those features are fully developed, we will keep optimizing the CUDA driver on WSL, with recommendations for native Windows as well.

Second, we will focus on fast and efficient memory allocation through some special form of memory copy. We will also soon start looking at better multi-GPU features and optimizations on WSL2 to enable even more intensive workload to run fast.

WSL2 is a fully supported platform for NVIDIA, and it will be given the same feature offerings and performance focus that CUDA strives for all its other supported platforms. It is our intent to make WSL2 performance better and suitable for development. We will also make this into a CUDA platform that is attractive for every use case, with performance as close as possible to any native Linux system.

Last, but not least, we heartily thank the developer community that has been rapidly adopting GPU acceleration in the WSL2 preview, reporting issues, and providing feedback continuously over the past year. You have helped us uncover potential issues and make big strides on performance by sharing with us performance use cases that we might have missed otherwise. Without your unwavering support, GPU acceleration on WSL2 would not be where it is today. We look forward to engaging with the community further as we work on achieving future milestones for CUDA on WSL2.

To access the driver installers and documentation, register for the NVIDIA Developer Program and Microsoft Windows Insider Program.

The following resources contain valuable information to aid you on how CUDA works with WSL2, including how to get started with running applications, and deep learning containers:

We encourage everyone to use our forum and share their experience with the larger WSL community.

Misc

How can I achieve sparse connections between filters in two convolutional layers?

Post author By
Post date August 10, 2021
No Comments on How can I achieve sparse connections between filters in two convolutional layers?

How can I achieve sparse connections between filters in two convolutional layers?

submitted by /u/ElvishChampion
[visit reddit] [comments]

Misc

Release John Snow Labs Spark-NLP 3.2.0: New Longformer embeddings, BERT and DistilBERT for Token Classification, GraphExctraction, Spark NLP Configurations, new state-of-the-art multilingual NER models, and lots more! · JohnSnowLabs/spark-nlp

submitted by /u/dark-night-rises
[visit reddit] [comments]

Misc

Announcing Nsight Deep Learning Designer 2021.1 – SDK for Efficient Deep Learning Model Design and Development

Post author By
Post date August 10, 2021
No Comments on Announcing Nsight Deep Learning Designer 2021.1 – SDK for Efficient Deep Learning Model Design and Development

NVIDIA announces Nsight DL Designer – the first in-class integrated development environment to support efficient design of deep neural networks for in-app inference.

Nsight Deep Learning Designer 2021.1

Today NVIDIA announced Nsight DL Designer – the first in-class integrated development environment to support efficient design of deep neural networks for in-app inference. Download Now!

This SDK aims at streamlining the often iterative process of designing deep neural network models for in-app inferencing by providing efficient support at every stage of the process.

Nsight DL Designer is a GUI-based tool for model design and has integrated profiling capabilities that are based on GPU metrics. It provides a convenient way to import models into Pytorch for training. A visual analysis mode allows developers to dive deep into the inference process in real-time and in an interactive manner, with flexible options to export the finalized mode for inference deployment.

End-to-end Nsight DL Designer Workflow

Developers start by designing their deep neural network models inside Nsight DL Designer, using a built-in set of high-level neural network layers implemented by NVIDIA as the NvNeural inference engine. After creating the model, you can do performance profiling to get some basic idea whether your model meets the allocated timing budget. The profiling can be done early, even before you spend time on training the network.

For the training phase, Nsight DL Designer provides a variety of Python scripts that automatically converts a Nsight DL Designer model into a Pytorch model that can be easily added to your training loop. When training is done, you can save the learned weights data from your model into Numpy files. You can go back to DL Designer, load the weights file and enter the analysis mode to examine the inference results. The analysis mode also allows developers to dive deep into the inference process, visually inspect what’s happening at each inference step. This feedback can potentially guide developers to optimize their network model for improved quality and performance.

Once you are satisfied with both the quality and performance of your model, it’s time for deployment. Nsight DL Designer provides several ways to support deploying a model. One way is to export your model as an ONNX file. With the ONNX file, your can deploy your model on any platform where ONNX runtime

To learn more, watch the SIGGRAPH session: Announcing Nsight Deep Learning Designer – Optimize Your Neural Network for Quality and High Performance.

Key features for this release include:

GUI based and NvNeural inference engine for model design
Inference performance profiling with GPU metrics
Interface with training framework – PyTorch
Interactive visual analysis of the inference process
Automated model export and code generation for deployment

Resources:

Learn More and download DL Designer today.
Find getting started documentation here.
Post comments on Forums.

Misc

Basic assistance with matching an output layer’s shape with another output layer…

Post author By
Post date August 10, 2021
No Comments on Basic assistance with matching an output layer’s shape with another output layer…

I am trying to concatenate two layers for a U-Net during the deconvolution process, but my initial input shape has an odd number of layers for depth (its 3D). So after doing a few max pools, the depth becomes even – when I start doing Conv3DTranspose, the depth only ever doubles.

So I arrive at a point where the output shape of my Conv3DTranspose is (Height, Width, 76) and the item im trying to concat it with is (Height, Width, 77)

What can I do to match the shapes? I tried following along with what the documentation says for predicting the new_depth, yet somehow I can’t seem to make it work!

submitted by /u/little_seed
[visit reddit] [comments]

Misc

Parkinson’s Disease Dataset

Post author By
Post date August 10, 2021
No Comments on Parkinson’s Disease Dataset

I’m following this tutorial (https://youtu.be/XN16tmsxVRM), and I’m wondering if someone could help me with a larger dataset similar to this one:

https://preview.redd.it/tnddiusxnig71.png?width=1102&format=png&auto=webp&s=c41ce65846d2cd8b6f6f384a0d89fa35825b2788

submitted by /u/AdPsychological4804
[visit reddit] [comments]

Offsites

The C4_200M Synthetic Dataset for Grammatical Error Correction

Post author By
Post date August 10, 2021
No Comments on The C4_200M Synthetic Dataset for Grammatical Error Correction

Posted by Felix Stahlberg and Shankar Kumar, Research Scientists, Google Research

Grammatical error correction (GEC) attempts to model grammar and other types of writing errors in order to provide grammar and spelling suggestions, improving the quality of written output in documents, emails, blog posts and even informal chats. Over the past 15 years, there has been a substantial improvement in GEC quality, which can in large part be credited to recasting the problem as a “translation” task. When introduced in Google Docs, for example, this approach resulted in a significant increase in the number of accepted grammar correction suggestions.

One of the biggest challenges for GEC models, however, is data sparsity. Unlike other natural language processing (NLP) tasks, such as speech recognition and machine translation, there is very limited training data available for GEC, even for high-resource languages like English. A common remedy for this is to generate synthetic data using a range of techniques, from heuristic-based random word- or character-level corruptions to model-based approaches. However, such methods tend to be simplistic and do not reflect the true distribution of error types from actual users.

In “Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models”, presented at the EACL 16th Workshop on Innovative Use of NLP for Building Educational Applications, we introduce tagged corruption models. Inspired by the popular back-translation data synthesis technique for machine translation, this approach enables the precise control of synthetic data generation, ensuring diverse outputs that are more consistent with the distribution of errors seen in practice. We used tagged corruption models to generate a new 200M sentence dataset, which we have released in order to provide researchers with realistic pre-training data for GEC. By integrating this new dataset into our training pipeline, we were able to significantly improve on GEC baselines.

Tagged Corruption Models
The idea behind applying a conventional corruption model to GEC is to begin with a grammatically correct sentence and then to “corrupt” it by adding errors. A corruption model can be easily trained by switching the source and target sentences in existing GEC datasets, a method that previous studies have shown that can be very effective for generating improved GEC datasets.

A conventional corruption model generates an ungrammatical sentence (red) given a clean input sentence (green).

The tagged corruption model that we propose builds on this idea by taking a clean sentence as input along with an error type tag that describes the kind of error one wishes to reproduce. It then generates an ungrammatical version of the input sentence that contains the given error type. Choosing different error types for different sentences increases the diversity of corruptions compared to a conventional corruption model.

Tagged corruption models generate corruptions (red) for the clean input sentence (green) depending on the error type tag. A determiner error may lead to dropping the “a”, whereas a noun-inflection error may produce the incorrect plural “sheeps”.

To use this model for data generation we first randomly selected 200M clean sentences from the C4 corpus, and assigned an error type tag to each sentence such that their relative frequencies matched the error type tag distribution of the small development set BEA-dev. Since BEA-dev is a carefully curated set that covers a wide range of different English proficiency levels, we expect its tag distribution to be representative for writing errors found in the wild. We then used a tagged corruption model to synthesize the source sentence.

Synthetic data generation with tagged corruption models. The clean C4 sentences (green) are paired with the corrupted sentences (red) in the synthetic GEC training corpus. The corrupted sentences are generated using a tagged corruption model by following the error type frequencies in the development set (bar chart).

Results
In our experiments, tagged corruption models outperformed untagged corruption models on two standard development sets (CoNLL-13 and BEA-dev) by more than three F0.5-points (a standard metric in GEC research that combines precision and recall with more weight on precision), advancing the state-of-the-art on the two widely used academic test sets, CoNLL-14 and BEA-test.

In addition, the use of tagged corruption models not only yields gains on standard GEC test sets, it is also able to adapt GEC systems to the proficiency levels of users. This could be useful, for example, because the error tag distribution for native English writers often differs significantly from the distributions for non-native English speakers. For example, native speakers tend to make more punctuation and spelling mistakes, whereas determiner errors (e.g., missing or superfluous articles, like “a”, “an” or “the”) are more common in text from non-native writers.

Conclusion
Neural sequence models are notoriously data-hungry, but the availability of annotated training data for grammatical error correction is rare. Our new C4_200M corpus is a synthetic dataset containing diverse grammatical errors, which yields state-of-the-art performance when used to pre-train GEC systems. By releasing the dataset we hope to provide GEC researchers with a valuable resource to train strong baseline systems.

Misc

Three’s Company: NVIDIA Studio 3D Showcase at SIGGRAPH Spotlights NVIDIA Omniverse Update, New NVIDIA RTX A2000 Desktop GPU, August Studio Driver

The future of 3D graphics is on display at the SIGGRAPH 2021 virtual conference, where NVIDIA Studio is leading the way, showcasing exclusive benefits that NVIDIA RTX technologies bring to creators working with 3D workflows. It starts with NVIDIA Omniverse, an immersive and connected shared virtual world where artists create one-of-a-kind digital scenes, perfect 3D Read article >

The post Three’s Company: NVIDIA Studio 3D Showcase at SIGGRAPH Spotlights NVIDIA Omniverse Update, New NVIDIA RTX A2000 Desktop GPU, August Studio Driver appeared first on The Official NVIDIA Blog.