Categories
Misc

July NVIDIA Studio Driver Improves Performance for Chaos V-Ray 6 for 3ds Max

Creativity heats up In the NVIDIA Studio as the July NVIDIA Studio Driver, available now, accelerates the recent Chaos V-Ray 6 for 3ds Max release.Plus, this week’s In the NVIDIA Studio 3D artist, Brian Lai, showcases his development process for Afternoon Coffee and Waffle, a piece that went from concept to completion faster with NVIDIA RTX acceleration in Chaos V-Ray rendering software.

The post July NVIDIA Studio Driver Improves Performance for Chaos V-Ray 6 for 3ds Max appeared first on NVIDIA Blog.

Categories
Misc

A Guide to Understanding Essential Speech AI Terms

NVIDIA Riva facilitates the process of creating ASR services with the tools and methodologies to help you realize your skills, all the way from raw data to a ready-to-use service.

Speech AI is the ability of intelligent systems to communicate with users using a voice-based interface, which has become ubiquitous in everyday life. People regularly interact with smart home devices, in-car assistants, and phones through speech. Speech interface quality has improved leaps and bounds in recent years, making them a much more pleasant, practical, and natural experience than just a decade ago.

Components of intelligent systems with a speech AI interface include the following:

  • Automatic speech recognition (ASR):  Converts audio signals into text.
  • A fulfillment engine:  Analyzes the text, identifies the user’s intent, and fulfills it.
  • Text-to-speech (TTS):  Converts the textual elements of the response into high-quality and natural speech
Automatic speech recognition, fulfillment engine, and text-to-speech are the three primary components.
Figure 1. An intelligent system with a speech AI interface 

ASR is the first component of any speech AI system and plays a critical role. Any error made early in the ASR phase is then compounded by issues in the subsequent intent analysis and fulfillment phase.

There are over 6500 spoken languages in use today, and most of them don’t have commercial ASR products. ASR service providers cover a few dozen at most. NVIDIA Riva currently covers five languages (English, Spanish, German, Mandarin, and Russian), with more scheduled for the upcoming releases. 

While this set is still small, luckily Riva provides ready-to-use workflow, tools, and guidance for you to bring up an ASR service for a new language quickly, systematically, and easily. In this post, we present the workflow, tools, and best practices that the NVIDIA engineering team employed to make new world-class Riva ASR services. Start the journey!

The anatomy of a Riva ASR pipeline

Take a deeper look into the inner working of a Riva ASR pipeline, which includes the following main components:

  • Feature extractor: Raw temporal audio signals first pass through a feature extraction block, which segments the data into fixed-length blocks (for example, of 80 ms each), then converts the blocks from the temporal domain to the frequency domain (spectrogram). 
  • Acoustic model: Spectrogram data is then fed into an acoustic model, which outputs probabilities over characters (or more generally, text tokens) at each time step. 
  • Decoder and language model: A decoder converts this matrix of probabilities into a sequence of characters (or text tokens), which form words and sentences. A language model can give a sentence score indicating the likelihood of a sentence appearing in its training corpus. An advanced decoder can inspect multiple hypotheses (sentences) while combining the acoustic model score and the language model score and searching for the hypothesis with the highest combined score.
  • Punctuation and capitalization (P&C): The text produced by the decoder comes without punctuation and capitalization, which is the job of the Punctuation and Capitalization model. 
  • Inverse text normalization (ITN): Finally, ITN rules are applied to transform the text in verbal format into a desired written format, for example, “ten o’clock” to “10:00”, or “ten dollars” to “$10”.
Riva-pipeline
Figure 2. Anatomy of a Riva ASR pipeline 

Riva ASR workflow for a new language

Like solving other AI and machine learning problems, creating a new ASR service from scratch is a capital-intensive task involving data, computation, and expertise. Riva significantly reduces these barriers. 

With Riva, making a new ASR service for a new language requires, at the minimum, collecting data and training a new acoustic model. The feature extractor and the decoder are readily provided. 

The language model is optional but is often found to improve the accuracy of the pipeline up to a few percent and is often well worth the effort. P&C and ITN further improve the text readability for easier human consumption or further processing tasks.

The Riva new language workflow is divided into the following major phases:

  • Data collection
  • Data preparation
  • Training and validation
  • Riva deployment
Diagram shows the ASR service workflow: data collection from public or proprietary datasets; data preparation such as preprocessing, filtering, binning, splitting, and tarring; training with NeMo and validation on the test set; and Riva deployment.
Figure 3. Riva new language workflow

In the next sections, we discuss the details of each stage.

Phase 1: Data collection

When adapting Riva to a whole new language, a large amount of high-quality transcribed audio data is critical for training high-quality acoustic models. Where applicable, there are several significant sources of public datasets that you can readily leverage:

To train Riva world-class models, we also acquired proprietary datasets. The amount of data for Riva production models ranges from ~1,700–16,700 hours!

Phase 2: Data preparation

The data preparation phase carries out the steps required to convert the diverse raw audio datasets into a uniform format efficiently digestable by the NVIDIA NeMo toolkit, which is used for training. 

  • Data preprocessing
  • Data cleaning/filtering
  • Binning
  • Train and test splitting
  • Tarring

Data preprocessing

Data preprocessing is required to convert the audio or text data input into a readable format for your machine learning algorithms.

Audio data

Audio data acquired from various sources is inherently heterogeneous (file format, sample rate, bit depth, number of audio channels, and so on). As a preprocessing step, you build a separate data ingestion pipeline for each source and convert the audio data to a common format: 

  • WAV file format
  • Bit depth of 16 bits 
  • Sample rate of 16 KHz
  • Single audio channel 

Dataset ingestion scripts are used to convert the various datasets into the standard manifest format. 

Text data

Text cleaning removes characters that are not part of the language alphabet. For example, we observed and removed some Chinese characters in the public dataset for German collected from MCV, MLS, and Voxpopuli.

Text normalization converts text from written form into its verbalized form. It is used as a preprocessing step for preprocessing ASR training transcripts.

Next, you build a text tokenizer. There are two popular encoding choices for acoustic models that are almost identical: character encoding and subword encoding. The primary difference is that a subword encoding model accepts a subword tokenized text corpus and emits subword tokens in its decoding step. Research and practice have shown that subword encoding helps improve the accuracy of acoustic models.

Data cleaning and filtering

This step is carried out to filter out some outlying samples in the datasets. As the simplest procedure, samples that are too long, too short, or empty are filtered out. 

In addition, you can also filter out samples that are considered ‘noisy.’ This would include samples having a high word error rate (WER) or character error rate (CER) with regard to a previously trained ASR model. 

A manual inspection of these noisy samples can also reveal problematic issues with some samples, such as the transcription not matching the audio.

Binning

For training ASR models, audio with different lengths may be grouped into a batch, with padding to make them all the same length. The extra padding is a significant source of computation waste. 

Splitting the training samples into buckets with different lengths and sampling from the same bucket for each batch increases the computation efficiency. It may result in training speedup of more than 2x.

Train and test splitting

This step is a staple of any deep learning and machine learning development pipeline, to ensure that the model is learning to generalize without overfitting the training data. For the test set, use additionally curated data that isn’t from the same source as the training datasets, such as YouTube and TED talks.

Tarring

Suppose the experiments run on a cluster with datasets stored on a distributed file system. In that case, you will likely want to avoid constantly reading multiple small files and will tar the audio files instead. 

Phase 3: Training and validation

An ASR pipeline includes the following models:

  • Acoustic model:  Maps raw audio input to probabilities over text tokens at each time step. This matrix of probabilities is fed into a decoder that converts probabilities into a sequence of text tokens.
  • (Optional) Language model:  Used in the decoding phase of the acoustic model output.
  • (Optional) P&C model:  Formats the raw transcript, augmenting with punctuation and capitalization.
  • (Optional) ITN model:  Produces a desired written format from a spoken format.

Acoustic model

The acoustic models are the most important part of an ASR service. They are the most resource-intensive models, requiring a large amount of data to train on powerful GPU servers or clusters. They also have the largest impact on the overall ASR quality.

Some acoustic models supported by Riva are QuartzNet, CitriNet, Jasper, and Conformer.

Cross-language transfer learning is especially helpful when training new models for low-resource languages. But even when a substantial amount of data is available, cross-language transfer learning can help boost the performance further. It is based on the idea that phoneme representation can be shared across different languages.

When carrying out transfer learning, you must use a lower learning rate compared to training from scratch. When training models such as Conformer and CitriNet, we have also found that using large batch sizes in the [256, 2048] range helps stabilize the training loss. 

All Riva ASR models in production other than English were trained with cross-language transfer learning from an English base model that was trained with the most audio hours.

The English conformer model is used as the base model for cross-language transfer learning to other languages.
Figure 4. Cross-language transfer learning from English in Riva

Language model

A language model can give a score indicating the likelihood of a sentence appearing in its training corpus. For example, a model trained on an English corpus judges “Recognize speech” as more likely than “Wreck a nice peach.” It also judges “Je suis un étudiant” as quite unlikely, as that is a French sentence.

The language model, combined with beam search in the decoding phase, can further improve the quality of the ASR pipeline. In our experiments, we generally observe an additional 1-2% of WER reduction by using a simple n-gram model

When coupled with a language model, a decoder would be able to correct what it “hears” (for example, “I’ve got rose beef for lunch”) to what makes more sense (“I’ve got roast beef for lunch”). The model gives a higher score for the latter sentence than the former.

Create a training set by combining all the transcript text in the ASR set, normalizing, cleaning, and then tokenizing using the same tokenizer used for ASR transcript preprocessing mentioned earlier. The language models supported by Riva are n-gram models, which can be trained with the KenLM toolkit

P&C model

The P&C model consists of the pretrained Bidirectional Encoder Representations from Transformers (BERT) followed by two token classification heads. One classification head is responsible for the punctuation task, and the other one handles the capitalization task.

ITN model

Use the NeMo text inverse normalization module for the task. NeMo ITN is based on weighted finite-state transducer (WFST) grammars. The tool uses Pynini to construct WFSTs, and the created grammars can be exported and integrated into Sparrowhawk (an open-source version of the Kestrel TTS text normalization system) for production.

Phase 4: Riva deployment

When all the models have been trained, it’s time to deploy them to Riva for serving.

Bring your own models

Given the final .nemo models that you have trained so far, here are the steps and tools that are required to deploy on Riva:

  • The Riva Quickstart scripts provide the nemo2riva conversion tool, and scripts (riva_init.shriva_start.sh, and riva_start_client.sh) to download the servicemakerriva-speech-server, and riva-speech-client Docker images.
  • Build .riva assets using nemo2riva command in the servicemaker container.
  • Build RMIR assets using the riva-build tool in the servicemaker container.
  • Deploy the model in .rmir format with riva-deploy.
  • Start the server with riva-start.sh.

After the server successfully starts up, you can query the service to measure accuracy, latency, and throughput.

Riva pretrained models on NGC

Alternatively, you can make use of Riva pretrained models published on NGC. These models can be deployed as-is or served as a starting point for fine-tuning and further development.

Case study: German

For German, there are several significant sources of public datasets that you can readily access:

In addition, we acquired proprietary data for a total of 3,500 hours of training data!

We started the training of the final model from a NeMo DE Conformer-CTC large model (trained on MCV7.0 for 567 hours, MLS for 1,524 hours and VoxPopuli for 214 hours), which itself was trained using an English Conformer model as initialization (Figure 5).

The Riva German ASR model in production is fine-tuned from a NeMo German model, which in turns was trained with cross-language transfer-learning from an English model.
Figure 5. Riva German acoustic model training workflow

All Riva German assets are published on NGC (including .nemo.riva.tlt, and .rmir assets). You can use these models as starting points for your development.

Acoustic models

ITN model

NGC provides an OpenFST finite state archive (.far) for use within the open source Sparrowhawk normalization engine and Riva.

Language model

4-gram language models trained with Kneser-Ney smoothing using KenLM are available from NGC. This directory also contains the decoder dictionary used by the Flashlight decoder.

P&C model

NGC provides a Riva P&C model for German.

Case study: Hindi

For Hindi, you can readily access the Hindi-Labelled ULCA-asr-dataset-corpus public dataset:

  • Newsonair (791 hours)
  • Swayamprabha (80 hours)
  • Multiple sources (1,627 hours)

We started the training of the Hindi Conformer-CTC medium model from a NeMo En Conformer-CTC medium modelas initialization. The Hindi model’s encoder is initialized with the English model’s encoder weights and the decoder is initialized from scratch (Figure 6).

The Riva Hindi ASR model in development is fine-tuned from an English model.
Figure 6. Hindi acoustic model training

Getting started and bring your own languages

The NVIDIA Riva Speech AI ecosystem (including NVIDIA TAO and NeMo) offer comprehensive workflows and tools for new languages, making it a systematic approach to bringing your own language onboard. 

Whether you are fine-tuning an existing language model for a domain-specific application or implementing one for a brand-new dialect with little or lots of data, Riva offers those capabilities. 

For more information about how NVIDIA Riva ASR engineering teams bring up a new language, see the Riva new language tutorial series and apply it to your own project. 

Categories
Misc

Best Practices for Using NVIDIA RTX Ray Tracing (Updated)

Optimize your use of NVIDIA RTX with these in-depth ray tracing tips.

This post gathers best practices based on our experiences so far using NVIDIA RTX ray tracing in games. The practical tips are organized into short, actionable items for developers working on ray tracing today. They aim to provide insight into what kind of solutions lead to good performance in most cases. To find the optimal solution for a specific case, I always recommend profiling and experimenting.

Common abbreviations and short terms used in this post:

  • AABB: Axis-aligned bounding box
  • AS: Acceleration structure
  • BLAS: Bottom-level acceleration structure
  • Geometry: A geometry in a BLAS
  • Instance: An instance of a BLAS in a TLAS
  • TLAS: Top-level acceleration structure

Acceleration structures

This section focuses on the building and management of ray-tracing acceleration structures, which is the starting point for using ray tracing for any purpose. Topics include:

  • General tips
  • Maximizing GPU utilization when building
  • Memory allocations
  • Organizing geometries into BLASes
  • Build preference flags
  • Dynamic BLASes
  • Non-opaque geometries
  • Particles

General tips

Consider async compute for AS building. Especially in hybrid rendering, where G-buffer or shadow maps are rasterized, it’s potentially beneficial to execute AS building on async compute.

Consider worker threads for generating AS building command lists. Generating AS building commands can include a considerable amount of CPU-side work. It can be directly in the AS build calls or in some related task like the culling of the objects. Moving the CPU work to one or more worker threads is potentially beneficial.

Cull instances for TLAS. Typically, including the entire scene in the TLAS is not optimal. Instead, cull instances depending on the situation. For example, consider culling based on an expanded camera frustum. Maximum distance can often be less than the far plane distance in rasterization. You can also consider instance size when culling so that smaller instances are culled at a shorter distance.

Use appropriate Level of Detail (LOD) for instances. Like in rasterization, using the most detailed geometry LOD for everything is typically suboptimal. LODs used for far-away objects can be simpler. In hybrid rendering, using the same LOD for rasterization and ray tracing can be considered. It’s an efficient way to avoid self-intersection artifacts such as surface shadowing itself.

Also consider using lower-detail LODs in ray tracing, especially to reduce the updating cost of dynamic BLASes. If the LODs between rasterization and ray tracing don’t match, enabling back face culling is often needed in ray tracing to prevent the self-intersections. For more information about LODs in ray tracing, and how to implement stochastic LODs, see Implementing Stochastic Levels of Detail with Microsoft DirectX Raytracing.

Flag geometries or instances as opaque whenever possible. Flagging instances or geometries as opaque allows uninterrupted hardware intersection search and prevents invocation of the any-hit shader. Do this whenever possible. Enable the use of any-hit shaders only for those geometries that need it; for example, to do alpha testing.

Use triangle geometries when possible. Hardware excels in performing ray-triangle intersections. Ray-box intersections are accelerated too, but you get the most out of the hardware when tracing against triangle geometries.

Maximizing GPU utilization when building

Batch vertex deformations and BLAS builds. Consecutively execute all vertex deformation calls that produce triangles used as input for BLAS building and all BLAS build calls. Do not place resource barriers between consecutive calls. This allows the driver to parallelize the calls to an extent. All BLAS build calls need unique scratch memory to allow execution without barriers.

The individual UAV barriers for each resource holding BLASes are not needed. Instead, you can have a single global UAV barrier before TLAS build to ensure all BLAS builds are completed, regardless of the resource where they reside.

Consider merging small vertex deformation calls. Often, calls that output deformed vertices for one geometry or instance are lightweight and do not fill the entire GPU even when executed without barriers between consecutive calls. Merging the processing of several geometries or instances to happen in one call can increase GPU utilization and result in better performance.

Memory allocations

Pool small allocations. BLASes can be small, sometimes only a few kilobytes. Using a separate committed resource to store each such small BLAS is not optimal. Instead, pool them with larger resources. Pooling saves memory and often increases performance. One option is to use placed resources in a large resource heap.

Alternatively, many BLASes can be stored in a single buffer by manually suballocating sections from the buffer. This allows even tighter backing of BLASes into memory as the suballocations only have to follow 256-byte alignment. Regardless of the pooling mechanism, avoid memory fragmentation to keep the benefits achieved by pooling. For more information, see Managing Memory for Acceleration Structures in DirectX Raytracing.

Consider compacting static BLASes. Compacting BLASes saves memory and can increase performance. Reduction in memory consumption depends on the geometries but can be up to about 50%. As the compacted size needs to be read back to the CPU after the BLAS build has been completed on GPU, this is most practical for BLASes that are only built one time. Remember to pool small allocations and avoid memory fragmentation to get the maximum benefit from compaction. For more information, see Tips: Acceleration Structure Compaction.

Organizing geometries into BLASes

Consider splitting a BLAS when there is a lot of empty space in an instance’s world-space AABB. World-space AABBs are used to test whether a ray potentially hits an instance and traversing its associated BLAS is required. A significant amount of empty space can lead to unnecessary traversal through the BLAS.

Geometries that move independently should usually be in their own BLASes. Merging them into a single BLAS can lead to an AABB with lots of empty space, and unnecessary rebuilding of the BLAS instead of simply changing transformations of the independent instances.

Graphic depicting sparsely located geometries in two BLASes with overlapping AABBs (left) and geometries split into four BLASes without AABB overlap (right).
Figure 1. Geometries in two BLASes with overlapping AABBs with a lot of empty space (left). After geometries are split into four independent BLASes, the AABBs no longer overlap (right).

Consider merging BLASes when instance world-space AABBs overlap significantly. When world-space AABBs of instances overlap, each ray going through that region must process separately all the overlapping BLAS instances to find potential intersections. Traversing through one merged BLAS would be more efficient.

Tracing performance against a BLAS doesn’t depend on the number of geometries in it. Geometries merged into a single BLAS can still have unique materials.

Graphic illustrating independent BLAS instances with overlapping AABBs (left) and one merged BLAS instance (right).
Figure 2. Independent BLAS instances with overlapping AABBs (left) and one merged BLAS instance (right).

Instantiate BLASes when possible. Instancing BLASes saves memory. It can also increase ray-tracing performance. Instances can have unique materials and transformations. In the case where the AABBs of the instances overlap a lot, replicating and merging them into a single BLAS as multiple geometries can still be a better choice, despite the increased memory consumption.

Avoid elongated triangles in geometries. Long, thin triangles have non-optimal bounding volumes with lots of empty space. They easily overlap with many other bounding volumes. This leads to non-optimal performance when tracing a ray against the geometry.

The driver can mitigate the issues to an extent depending on the geometry. The first such triangle isn’t likely to cause problems, but too many triangles do cause a problem, so I recommend avoiding them when possible; for example, by splitting them into smaller triangles.

Don’t include sky geometry in TLAS. A skybox or skysphere would have an AABB that overlaps with everything else and all rays would have to be tested against it. It’s more efficient to handle sky shading in the miss shader rather than in the hit shader for the geometry representing the sky.

Build preference flags

For TLAS, consider the PREFER_FAST_TRACE flag and perform only rebuilds. Often, this results in best overall performance. The rationale is that making the TLAS as high quality as possible regardless of the movement occurring in the scene is important and doesn’t cost too much.

For static BLASes, use the PREFER_FAST_TRACE flag. For all BLASes that are built only one time, optimizing for best ray-trace performance is an easy choice.

For dynamic BLASes, choose between using the PREFER_FAST_TRACE or PREFER_FAST_BUILD flags, or neither. For BLASes that are occasionally rebuilt or updated, the optimal build preference flag depends on many factors. How much is built? How expensive are the ray traces? Can the build cost be hidden by executing builds on async compute? To find the optimal solution for a specific case, I recommend trying different options.

Dynamic BLASes

Reuse the old BLAS when possible. Whenever you know that vertices of a BLAS have not moved after the previous update, continue using the old BLAS.

Update the BLAS only for visible objects. When instances are culled from the TLAS, also exclude their culled BLASes from the BLAS update process.

Consider skipping updates based on distance and size. Sometimes it’s not necessary to update a BLAS on every frame, depending on how large it is on the screen. It may be possible to skip some updates without causing noticeable visual errors.

Rebuild BLASes after large deformations. BLAS updates are a good choice after limited deformations, as they are significantly cheaper than rebuilds. However, large deformations after the previous rebuild can lead to non-optimal ray-trace performance. Elongated triangles amplify the issue.

Consider rebuilding updated BLASes periodically. It can be non-trivial to detect when a geometry has been deformed too much and would require a rebuild to restore optimal ray-trace performance. Simply periodically rebuilding all BLASes can be a reasonable approach to avoid significant performance implications, regardless of deformations.

Distribute rebuilds over frames. Because rebuilds are considerably slower than updates, many rebuilds on a single frame can lead to stuttering. To avoid this, it’s good practice to distribute the rebuilds over frames.

Consider using only rebuilds with unpredictable deformations. In some cases, when the geometry deformation is large and rapid enough, it’s beneficial to omit the ALLOW_UPDATE flag when building the BLAS and always just rebuild it. If needed, using the PREFER_FAST_BUILD flag to reduce the cost of rebuilding can be considered. In extreme cases, using the PREFER_FAST_BUILD flag results in better overall ray-tracing performance than using the PREFER_FAST_TRACE flag and updating.

Avoid triangle topology changes in BLAS updates. Topology changes in an update mean that triangles degenerate or revive. That can lead to non-optimal ray-trace performance if the positions of the degenerate triangles do not represent the positions of the revived triangles. Occasional topology changes in “bending” deformations are typically not problematic, but larger topology changes in “breaking” deformations can be.

When possible, prefer having separate BLAS versions or using inactive triangles for different topologies caused by “breaking” deformations. A triangle is inactive when its position is NaN. If those alternatives are not possible, I recommend rebuilding the BLAS instead of updating after topology changes. Topology changes through index buffer modifications are not allowed in updates.

Non-opaque geometries

Minimize the non-opaque area when possible. Invoking any-hit shader, typically for performing alpha testing, for non-opaque triangles interrupts hardware intersection search. When possible, minimizing the area not marked as opaque is a simple way to increase performance. Using more triangles to define the non-opaque area more accurately is likely a good trade-off.

Consider splitting to opaque and non-opaque geometries. When a well-defined part of geometry triangles can be considered fully opaque, splitting them into a separate geometry and marking it as opaque can be considered. The different geometries can still reside in the same BLAS.

Particles

Consider representing billboard particles as triangle geometries. One option for representing billboard particles in BLASes is to output the billboards as triangles, rotating part of the billboards 90 degrees along the vertical axis to different orientations. This allows utilization of the triangle intersection hardware while providing a reasonable approximation for the visual boundaries of the particles.

Consider alpha testing instead of blending. Depending on particle type, using alpha testing in secondary rays for particles that are blended when rendering primary visibility may offer reasonable visual quality. This approach works best for particles with clear boundaries. For particles representing things like smoke or fog, this is likely not applicable. For more information, see Ray Traced Reflections in ‘Wolfenstein: Youngblood’.

Avoid using degenerate triangles for dead particles. Degenerate triangles in updated BLASes can make the structure non-optimal for ray tracing. For particle systems with a dynamic number of live particles, I recommend considering other solutions like rebuilding the BLAS on each frame with the correct particle count.

Consider representing mesh particles as instances in TLAS. For particles rendered as triangle meshes, having a unique instance for each particle can be a reasonable solution. This is true when the particles get distributed around the scene so that individual rays do not often hit many instances. Instances should share the base mesh BLAS. Also, consider compacting the BLAS.

Hit shading

This section focuses on the shading of ray hits. Even seasoned graphics developers may benefit from fresh ideas when they start developing ray-tracing shaders, as the optimal solutions may differ from those in rasterization. Topics include:

  • General tips
  • Minimizing divergence
  • Any-hit shader
  • Shader resource binding
  • Inline ray tracing
  • Pipeline states

General tips

Keep the ray payload small. Registers are used to hold payload values and they reduce the number of registers otherwise available to hit shaders. I recommend avoiding careless payload usage, though adding complex code to pack values is rarely beneficial.

Use the payload access qualifiers. This feature becomes available in HLSL Shader Model 6.6. It allows specifying which shader stages write or read each field in the payload and makes it possible for the compiler to better optimize register usage, which can lead to higher occupancy and better performance. For maximum potential benefit, define the qualifiers for each field as accurately as possible. For more information, see DirectX-Specs on GitHub.

Consider writing a safe default value to unused payload fields. When some shader doesn’t use all fields in the payload, which are required by other shaders, it can be beneficial to still write a safe default value to the unused fields. This allows the compiler to discard the unused input value and use the payload register for other purposes before writing to it.

Terminate rays on the first hit when possible. When resolving the correct closest hit is not required (as for shadow rays), flagging rays with RAY_FLAG_ACCEPT_FIRST_HIT_AND_END_SEARCH or gl_RayFlagsTerminateOnFirstHitEXT is a simple and efficient optimization.

Use face culling only when required for correctness. Unlike in rasterization, enabling back- or front-face culling does not improve performance. Instead, it slightly slows ray traversal. Use them only when it is required to get the correct rendering result.

Minimize live state across ray-trace calls. Variables that are initialized before a TraceRay or traceRayExt call and used after it are live states that must be maintained across the call while invoking hit and miss shaders. The driver has a few different options to do it, but they all have a cost.

I recommend trying to minimize the amount of live state. Identifying such variables is not always trivial. NVIDIA and Microsoft are working together on a compiler feature for the automatic detection of a live state.

Avoid deep recursion. Deep, non-uniform ray recursion can get expensive.

Minimizing divergence

Use a separate hit shader for each material model. Reducing code and data divergence within hit shaders is helpful, especially with incoherent rays. In particular, avoid übershaders that manually switch between material models. Implementing each required material model in a separate hit shader gives the system the best possibilities to manage divergent hit shading.

When the material model allows a unified shader without much divergence, you can consider using a common hit shader for geometries with various materials.

Consider simplified shading. Often, replicating all features used in rendering primary visibility for shading specular reflection or indirect diffuse illumination is not necessary. Leaving out features does not always result in a significant visual difference. Alternately, the visual improvement does not justify the rendering cost. The more incoherent the rays, the less accurate replication of primary visibility features is typically required. Also, as the hit distance grows, the shading can sometimes be further simplified.

Avoid direct conversion from vertex and pixel shaders. The approach that leads to optimal performance in hit shading is different from what is optimal for rasterization. In rasterization, having separate shader permutations for even small code differences can be beneficial. In hit shading, both reducing the divergence within individual hit shaders and the number of the separate hit shaders are helpful. Generally, I don’t recommend converting vertex and pixel shaders directly to hit shaders.

Consider moving common code outside of hit and miss shaders. When all hit shaders have a common part, I recommend moving that code away from hit shaders; for example, to the ray generation shader. Sometimes, there can be common code also in hit-and-miss shaders, such as when the approximation for the next bounce in hit shaders is the same as the approximation done for the first bounce in miss shader. Again, I recommend moving that common code outside of hit-and-miss shaders.

Any-hit shader

Prefer unified and simplified any-hit shaders. An any-hit shader is potentially executed a lot during ray traversal, and it interrupts the hardware intersection search. The cost of any-hit shaders can have a significant effect on overall performance. I recommend having a unified and simplified any-hit shader in a ray-tracing pass. Also, the full register capacity of the GPU is not available for any-hit shaders, as part of it is consumed by the driver for storing the ray state.

Optimize access to material data. In any-hit shaders, optimal access to material data is often crucial. A series of dependent memory accesses is a common pattern. Load vertex indices, vertex data, and sample textures. When possible, removing indirections from that path is beneficial.

When blending, remember the undefined order of hits. Hits along ray are discovered and the corresponding any-hit shader invocations happen in undefined order. This means that the blending technique must be order-independent. It also means that to exclude hits beyond the closest opaque hit, ray distance must be limited properly. Additionally, you may need to flag the blended geometries with NO_DUPLICATE_ANYHIT_INVOCATION to ensure the correct results. For more information, see Chapter 9 in Ray Tracing Gems.

Shader resource binding

Prefer the global root table (DXR) or direct descriptor access (Vulkan) when possible. Often, resources used by ray generation and miss shaders can be conveniently bound just like for compute shaders instead of binding through shader records. Also, hit shader resources that are used regardless of what was hit can typically be bound like that too. Having the same resource bound in all hit records is not optimal.

Consider bindless resources for hit shaders. Resources in unbounded descriptor tables (DXR) or unsized descriptor arrays (Vulkan), indexed by the hit-specific system values such as InstanceIndex or gl_InstanceID or values stored directly in the hit records (root constants in DXR) can be an efficient way to provide resources to hit shaders.

Consider root descriptors for index and vertex buffers. (DXR) As an alternative to unbounded descriptor tables, storing index and vertex buffer addresses directly in the hit records as root descriptors can be efficient. Out-of-bounds checks are not implicitly performed when accessing resources through root descriptors. Root descriptor addresses must follow a four-byte alignment. Precomputing an offset to 16-bit indices to the base address may break the alignment.

Use Root Signature version 1.1 and static descriptors when possible. (DXR) Root Signature 1.1 allows the driver to expect that descriptors are static; that is, they are not modified by the application after command lists have been recorded. This enables some potentially beneficial optimizations in the driver, especially when root descriptors are not used for accessing buffers. As with root descriptors, out-of-bounds checks are not implicitly performed with static descriptors. Additionally, both static and root descriptors must not be null.

Consider constructing shader tables on GPU. When there are many geometries and many ray-tracing passes, hit tables can grow large and uploading them can consume a considerable amount of time. Instead of uploading entire hit tables constructed on CPU, upload only the required new information on each frame, such as material indices for currently visible instances, and then execute a hit table construction pass on the GPU to be more efficient.

A large part of the information needed in the table construction can reside permanently in the GPU memory, such as hit group identifiers, vertex buffer addresses, and offsets for geometries.

Inline ray tracing

Consider thread group size 8×8 or larger. As a rule of thumb for compute shaders doing inline ray tracing, thread group size 8×8 can be used. Usually, it is efficient that the number of threads in a group is multiple of the GPU wave size. The wave size in NVIDIA GPUs is 32 threads.

However, using thread groups with only one wave limits the thread occupancy due to a limit in the number of groups simultaneously in execution. Having two waves in a group doubles the potential occupancy. The shader register and group shared memory consumption can also set limits to the occupancy. When the other factors allow, maximum thread occupancy can be reached starting from groups of three waves.

A practical choice for group size could then be 16×8 threads. Increasing the size much beyond this is usually not beneficial. Experimenting with different sizes reveals the optimal one for a specific case. The optimal size may be different for different hardware generations.

Avoid divergent shading with inline ray tracing. As hit shaders are not invoked based on hits, all shading happens inline in the shader that casts rays. Having divergent code paths or data accesses in the shader chosen based on hits can slow down the shading, especially with incoherent rays. When multiple different shading models are required, using DispatchRays or vkCmdTraceRaysKHR  is a better choice.

Use the hit-specific system values for bindless resource access with inline ray tracing. As bindings in hit records are not available, geometry-specific bindings must be provided by other means. Accessing resources in unbounded descriptor tables based on the hit-specific system values such as InstanceContributionToHitGroupIndex and GeometryIndex is a good practice.

I recommend avoiding indirections in accessing index, vertex, and material data when possible. For example, reading a resource index from a buffer based on a system value like InstanceID for selecting an index buffer may cause latency that is difficult to hide.

Prefer the compile-time ray flags. Both compile-time and runtime ray flags can be used with inline ray tracing. I recommend preferring the compile-time flags when possible, as they may enable beneficial compile-time optimizations.

Monitor the register consumption of the query objects. After initialization, the query objects must hold state for the ray traversal when the shader is executing code that may continue the traversal. This consumes registers and complex user code may limit occupancy sooner than usual. The situation is similar to executing any-hit shaders in a DispatchRays or vkCmdTraceRaysKHR pass. Variables initialized before using the query object and used after that may consume additional registers.

Consider thread group reordering to improve coherency. When using inline ray tracing from a compute shader, the default row major assignment of the dispatched thread groups to GPU for execution often does not result in optimal performance. Coherency of the memory accesses done by the thread groups simultaneously in execution on GPU can be improved by manually reordering the thread groups. For more information, see Parallel Shader Compilation for Ray Tracing Pipeline States.

Pipeline states

Consider one state object per ray generation shader. I recommend having a separate state object for each DispatchRays or vkCmdTraceRaysKHR call compiled with only the shaders required in that pass. It can help in optimizing the register consumption and allows the optimal setting of pipeline configuration values described later in this post.

Set MaxTraceRecursionDepth, MaxRecursionDepth, MaxPayloadSizeInBytes, and MaxAttributeSizeInBytes as small as possible. Setting these higher than necessary may have an unnecessarily negative performance impact. When using inline ray tracing within a DispatchRays or vkCmdTraceRaysKHR call, those ray-trace calls don’t count towards the maximum recursion depth.

Use SKIP_PROCEDURAL_PRIMITIVES, SKIP_AABBS, and SKIP_TRIANGLES whenever possible. These pipeline state flags allow simple but potentially effective optimizations in state compilation.

Consider shader collections for parallel compilation and sharing. (DXR) When you are managing many shaders, shader collections may allow multi-threaded compilation of state objects and sharing of compiled code between state objects. For more information, see Parallel Shader Compilation for Ray Tracing Pipeline States.

When automatic bind point assignment is needed, consider the compiler options. (DXR) By default, automatic bind point assignment for shader resources is not used when compiling shader libraries. If that is required, there are a couple useful compiler options. First, /auto-binding-space enables automatic bind point assignment in a given register space. Also, all functions not marked with the keyword static are considered library exports by default.

When using /auto-binding-space, resources accessed by any exported function consume bind points regardless of whether they are used in the final state object. To limit the bind point consumption to only the functions really needed, /exports can be used to limit the library exports.

Consider AddToStateObject for incremental building. It allows the incremental building of state objects based on existing objects, which can be useful when managing dynamic content with many shaders.

Manually manage the stack if applicable. Use the API’s query functions to determine the stack size required per shader and apply app-side knowledge about the call graph to reduce memory consumption and increase performance.

A good example is expensive reflection shaders shooting secondary shadow rays, which are known by the app to only use trivial hit shaders with low stack requirements. The driver can’t know this call graph in advance, so the default conservative stack size computation over-allocates memory.

Tools

Consider implementing a heatmap. To discover performance issues related to specific BLASes, or shading of specific geometries, NVIDIA offers a convenient API for implementing a heatmap for visualizing the processing cost of each pixel. This can be useful for improving the performance of your ray-tracing passes. For more information, see Profiling DXR Shaders with Timer Instrumentation.

Use NVIDIA Nsight Graphics for profiling and debugging. Learn more about inspecting acceleration structures, shader tables, and profiling ray-tracing passes.  

For more information about how to use Nsight Graphics most efficiently, see the following posts:

Consider updating to the latest version of the Microsoft Shader Compiler. (DXR) For new features and optimization, it’s often worthwhile to update to the latest available version of the Microsoft Shader Compiler.  

Categories
Misc

Digital Sculptor Does Heavy Lifting With Lightweight Mobile Workstation

As a professional digital sculptor, Marlon Nuñez is on a mission to make learning 3D art skills easier, smoother and more fun for all. And with the help of an NVIDIA RTX-powered Lenovo mobile workstation, he takes his 3D projects to the next level, wherever he goes. Nuñez is the art director and co-founder of Read article >

The post Digital Sculptor Does Heavy Lifting With Lightweight Mobile Workstation appeared first on NVIDIA Blog.

Categories
Misc

Faster Text Classification with Naive Bayes and GPUs

Accelerating classification on sparse data with GPUs and a splash of BayesNaive Bayes (NB) is a simple but powerful probabilistic classification technique that parallelizes well and can scale to datasets of massive size.  If you have been working…Accelerating classification on sparse data with GPUs and a splash of Bayes

Naive Bayes (NB) is a simple but powerful probabilistic classification technique that parallelizes well and can scale to datasets of massive size. 

If you have been working with text processing tasks in data science, you know that machine learning models can take a long time to train. Using GPU-accelerated computing on those models has often resulted in significant gains in time performance, and NB classifiers make no exception.

By using CUDA-accelerated operations, we reached a performance boost from 5–20x depending on the NB model used. A smart utilization of sparse data led to a 120x speedup for one of the models.

In this post, we present recent upgrades to the NB implementation in RAPIDS cuML and compare it to Scikit-learn’s implementation on the CPU. We provide benchmarks to demonstrate the performance benefits and walk through simple examples of each support variant of the algorithm to help you determine which is best for your use case.

What is naive Bayes?

NB uses Bayes’ theorem (Figure 1) to model the conditional probability distribution shown below to predict a label or category (y) given some input features (x). In its most simplest form, the Bayes theorem computes the conditional probability using the joint probability between the features and possible labels with the marginal probability of the features occurring across all possible labels.

Graphical representation of formula for Bayes’ Theorem with the conditional probability, joint probability, and marginal probability defined.
Figure 1. Bayes’ theorem represents the probability of a label (y) resulting from a set of features (x) as a conditional probability. It is computed using the joint probability of each label occurring with the set of features and the marginal probability of the features occurring across all possible labels

NB algorithms have been shown to work well on text classification use cases. They are often applied to tasks such as filtering spam emails; predicting categories and sentiment for tweets, web pages, blog posts, user ratings, and forum posts; or ranking documents and web pages. 

The NB algorithms simplify the conditional probability distribution by making the naive assumption that each feature (for example, each column in an input vector x) is statistically independent of all the other features. This makes the algorithm great because this naive assumption increases the ability to parallelize the algorithm. Also, the general approach of computing simple co-occurrence probabilities between features and class labels enables the model to be trained incrementally, supporting datasets that don’t fit into memory.

NB comes in several variants, which make certain assumptions about the joint distribution or the features co-occurring with respect to various class labels.

Naive Bayes assumptions

To predict classes for unseen sets of input features, different assumptions about the joint distribution enable several different variants of the algorithm, which model the distribution of features by learning parameters for different probability distributions.

Table 1 models a simple document/term matrix that could come from a collection of text documents. The terms along the columns represent a vocabulary. A simple vocabulary might break a document into the set of unique words that occur in total across all the documents. 

I love dogs hate and knitting is my hobby session
Doc 1 1 1 1
Doc 2 1 1 1 1 1
Doc 3 1 1 1 2 1 1
Table 1. A document/term matrix containing documents along the rows and the vocabulary terms that occur in each document along the columns

In Table 1, each element could be a count, such as what is shown here, a 0 or 1 to denote the existence of a feature, or some other value such as a ratio, spread, or measure of dispersion for each term occurring across the entire set of documents.

In practice, a sliding window is often run across either the entire documents or the terms, dividing them further into small chunks of word sequences, known as n-grams. For the first document of the following figure, the 2-gram (or bigram) would be “I love” and “love dogs”. It’s common for the vocabularies in these types of datasets to grow significantly large and become sparse. Preprocessing steps are often executed on the vocabulary to filter noise, for example, by removing common terms that appear in most documents. 

The process of converting a document into a document-term matrix is known as vectorization. There are tools to accelerate this process, such as the CountVectorizer, TdfidfVectorizer, or HashingVectorizer estimator objects in RAPIDS cuML. 

Multinomial and Bernoulli distributions

Table 1 represents a set of documents, which have been vectorized into term counts such that each element in the resulting matrix represents the number of times a particular word appears in its corresponding document. This simple representation can be effective for classification tasks.

Because the features represent a frequency distribution, the multinomial naive Bayes variant can effectively model the joint distribution of the features and their associated classes with a multinomial distribution. 

The frequency distributions for each term can be enhanced by incorporating a measure of dispersion, such as Term Frequency Inverse Document Frequency (TF-IDF), which takes into account the number of documents each occurs in. This can significantly improve performance by giving more weight to the terms that appear in fewer documents, and thus improve their discriminative abilities.

While the multinomial distribution works great when used directly with term frequencies, it has also been shown to have great performance on fractional values, like TF-IDF values. The multinomial naive Bayes variant covers a great number of use cases and so tends to be the most widely used. A similar variant is Bernoulli naive Bayes, which models the simple occurrence of each term rather than their frequency, resulting in a matrix of 0s and 1s (a Bernoulli distribution).

Unequal class distributions

It’s common to find imbalanced datasets in the real world. For example, you might have limited data samples for spam and malicious activity but an abundance of normal and benign samples.

The complement naive Bayes variant helps reduce the effects of unequal class distributions by using the complement of the joint distribution for each class during training, for example, the number of times a feature occurred in samples from all other classes.

Categorical distributions

You could also create bins for each of your features, maybe by quantizing some frequencies into a number of buckets such that frequencies of 0-5 go into bucket 0, frequencies of 6-10 go into bucket 1, and so on.

Another option could be to merge several terms together into a single feature, maybe by creating buckets for “animals” and “holidays,” where “animals” might have three buckets, zero for feline, one for canine, and two for rodents. “Holidays” might have two buckets, zero for personal holidays such as a birthday or wedding anniversary, and one for federal holidays.

The categorical naive Bayes variant assumes that the features follow a categorical distribution. The naive assumption works well for this case because it allows each feature to have a different set of categories and it model the joint distribution using, you guessed it, a categorical distribution.

Continuous distributions

Finally, the Gaussian naive Bayes variant works great when features are continuous and it can be assumed that the distribution of features in each class can be modeled with Gaussian distributions, that is, with a simple mean and variance.

While this variant might demonstrate good performance on some datasets after TF-IDF normalization, it can also be useful on general machine learning datasets.

Algorithm Multinomial Bernoulli Complement Categorical Gaussian
Type of input Frequencies,
Counts
Boolean
occurrence
Counts Categorical Continuous
Advantage Support
count data
Support
binary data
Reduce impact
of imbalance data
Support
categorical data
Support general
continuous data
Table 2. Comparison of the different NB algorithms

Real-world end-to-end examples

To demonstrate the benefits of each algorithm variant, as outlined in Table 2, we step through example notebooks of each algorithm variant. For a comprehensive end-to-end notebook that includes all the examples, see news_aggregator_a100.ipynb.

We used the News Aggregator dataset to demonstrate the performance of the NB variants. The dataset is available publicly from Kaggle and consists of 422K news headlines taken from multiple news sources. Each headline is labeled with one of four possible labels: business, science and technology, entertainment, and health. The data is loaded directly onto the GPU using RAPIDS cuDF and continues through preprocessing steps specific to each NB variant.

Gaussian naive Bayes

Starting with Gaussian naive Bayes, we ran a TD-IDF vectorizer to transform the text data into a real-valued vector that can be used for training.

By specifying ngram_range=(1,3) we indicated that we would learn on single words, as well as 2– and 3-grams. This increases significantly the number of terms or features to learn, from 15K words to 1.8M combinations. As most terms do not occur in most headlines, the resulting matrix is sparse with many values equal to zero. cuML supports special structures to represent data like this.

One additional benefit of NB classifiers is that they can be trained incrementally using a partial_fit method on the Estimator object. This technique is suited for massive datasets that might not fit into memory all at once or which must be distributed across multiple GPUs.

Our first example demonstrates incremental training using Gaussian naive Bayes by splitting the data into multiple chunks after preprocessing into continuous features with TF-IDF. The cuML version of Gaussian naive Bayes is 21x faster than Scikit-learn for training and 72x faster for inference.

Bernoulli naive Bayes

The next example demonstrates Bernoulli naive Bayes, without incremental training, using binary features that represent the presence or absence of each term. The CountVectorizer object can be used to accomplish this with the setting binary=True. We found a 14x speedup over Scikit-learn in this example.

Multinomial naive Bayes

Multinomial naive Bayes is the most versatile and widely used variant, as demonstrated in the following example. We used the TF-IDF vectorizer instead of CountVectorizer to achieve a 5x speedup over Scikit-learn.

Complement naive Bayes

We demonstrated the power of complement naive Bayes using CountVectorizer and showed that it yielded a better classification score than both the Bernoulli and multinomial NB variants on our imbalanced dataset.

Categorical naive Bayes

Last but definitely not least is an example of categorical naive Bayes, which we vectorized using k-means along with a model previously trained on another NB variant to group similar terms into the same categories based on their contribution to the resulting classes.

We found a 126x speedup over Scikit-learn to train a model with 315K news headlines and 23x speedup to perform inference and compute the model’s accuracy.

Benchmarks

The charts in Figure 2 compare the performance of NB training and inference between RAPIDS cuML and Scikit-learn for all of the variants outlined in this post.

The benchmarks were performed on an a2-highgpu-8g Google Cloud Platform (GCP) instance provisioned with an NVIDIA Tesla A100 GPU and 96 Intel Cascade Lake vCPUs at 2.2Ghz.

Charts containing performance comparisons between RAPIDS cuML and Scikit-learn for the Naive Bayes variants outlined in this post. cuML is substantially faster than Scikit-learn during both training and testing phases.
Figure 2. Performance comparison between Scikit-learn (blue) and cuML (green)

GPU-accelerated naive Bayes

We were able to implement all the NB variants right in Python with CuPy, which is a GPU-accelerated near-drop-in replacement for NumPy and SciPy. CuPy also provides you with the capability to write custom CUDA kernels in Python. It uses the just-in-time (JIT) compilation abilities of NVRTC to compile and execute them on the GPU while the Python application is running.

At the core of all the NB variants lies two simple primitives written using CuPy’s JIT, to sum and count the features for each class.

When a single document-term matrix grows too large to process on a single GPU, the Dask library can make use of the incremental training feature to spread the processing over multiple GPUs and multiple nodes. Currently, the multinomial variant can be distributed with Dask in cuML.

Conclusion

NB algorithms should be in every data scientist’s toolkit. With RAPIDS cuML you can accelerate your implementations of NB on the GPU, without dramatically changing your code. These powerful and fundamental algorithms, combined with the speedup of cuML, provide everything you must perform classification on extremely large or sparse datasets. 

If you think that RAPIDS cuML can help accelerate your data science and machine learning workflows or is already doing so, then leave a comment because we’d love to hear about it.

As always, visit the rapidsai GitHub repo and let us know how we can help you. You can also follow us on Twitter at @rapidsai.

If you are new to RAPIDS, be sure to check out the Getting Started resources to get up and running quickly.

Categories
Misc

Accelerating Cloud-Native Applications at China Mobile Bigcloud

Cloud computing is designed to be agile and resilient to deliver additional value for businesses. China Mobile (CMCC), one of China’s largest telecom operators and cloud services…

Cloud computing is designed to be agile and resilient to deliver additional value for businesses. China Mobile (CMCC), one of China’s largest telecom operators and cloud services providers, offers precisely this with its Bigcloud public cloud offering.

Bigcloud provides PaaS and SaaS services tailored to the needs of enterprise cloud and hybrid-cloud solutions for mission-critical applications. CMCC understands that businesses rely on their networking and communication infrastructure to stay competitive in an increasingly always-on, digital world.

When they started experiencing enormous demand for their cloud-native services, CMCC turned to network abstraction and virtualization through Open vSwitch (OVS) to automate and gain dynamic network control of their network, assisting in handling their growing demand.

However, maintaining network performance due to the added east-west network traffic became a serious challenge.

Virtual sprawl produced an explosion of east-west traffic that the created increased network congestion.
Figure 1. Bigcloud networking solution

Identifying the network challenges

With the massive adoption of cloud services, CMCC experienced enormous growth in its virtualization environment. This virtual sprawl produced an explosion of east-west traffic between servers within their data centers.

Due to the rise in network traffic, they also saw an increase in network congestion, causing higher jitter and latency and hindering overall network throughput and application performance. This was causing insufficient effective bandwidth and they were unable to keep up with the large number of network flows during peak business times.

As CMCC investigated the cause of these challenges, they determined that the root of these problems stemmed from four main issues with the Open vSwitch:

  • Inefficient vSwitch capacity for VXLAN encapsulation and decapsulation rule handling due to the server CPUs being tasked with both application and networking requests.
  • Poor performance of kernel-based vSwitch forwarding caused by frequent context switching between user space, kernel space, and memory, which created data copying overhead.
  • DPDK-based vSwitch forwarding created competition for server CPU resources, which were already severely limited.
  • Limited vSwitch flow rule capabilities due to lowered throughput due to excessive packet loss, jitter, and latency.

These challenges became a bottleneck and prevented applications from receiving the high network traffic throughput they required at the lowest possible latency.

While OVS allows for packets and flow rules to be forwarded between hosts as well as the outside world, it’s CPU-intensive and affects system performance by consuming CPU cores that should be used for customer applications and prevents full utilization of available bandwidth.

CMCC wanted to ensure network application response times stayed low, that delivered bandwidth was consistent, and that they were able to meet peak demands.

CMCC used OVS and OVS DPDK to support a highly efficient SDN network.
Figure 2. CMCC faced challenges in their desire to support both OVS and OVS DPDK for their Bigcloud vSwitch Forwarding

CMCC turned to two experts in this area, NVIDIA and Nokia, who jointly provided a highly efficient, software-defined networking (SDN) solution. The solution combines the offloads, performance, and efficiency of NVIDIA ConnectX SmartNIC and the NVIDIA BlueField data processing unit (DPU) technology with the agility, elasticity, and automation of the Nuage Networks Virtualized Services Platform (VSP).

Together, NVIDIA and Nuage offload the computationally intensive packet processing operations associated with OVS and free costly compute resources so they can run applications instead of SDN tasks.

SmartNIC– and DPU-powered accelerated networking

The NVIDIA ConnectX series of SmartNICs and BlueField series of DPUs offer NVIDIA Accelerated Switching and Packet Processing (ASAP2) technology, which runs the OVS data plane within the NIC hardware while leaving the OVS control plane intact and completely transparent to applications.

ASAP2 has two modes. In the first mode, the hardware data plane is built on top of SR-IOV virtual functions (VFs) so that each network VF is connected directly to its corresponding VM.

An alternate approach that is also supported is VirtIO acceleration through virtual data path acceleration (vDPA). VirtIO allows virtual machines native access to hardware devices such as the network adapters, while vDPA allows the connection to the VM to be established with the OVS data plane built between the network device and the standard VirtIO driver through device queues called Virtqueue. This enables seamless integration between VMs and accelerated networking, with the control plane to be managed on the host whereas the VirtIO data plane is accelerated by smartNIC hardware.

BlueField DPUs provide hardware offload and acceleration to reduce network congestion
Figure 3. vDPA uses SmartNIC hardware to offload and accelerate traffic for each VM.

Seamless integration of Nuage Networks SDN with NVIDIA vDPA technology

Nuage Networks contribution to the solution is through their Virtualized Services Platform (VSP). VSP performs the virtual routing and switching and is the distributed forwarding module based on Open vSwitch, serving as a virtual endpoint for network services. VSP immediately recognizes any changes in the compute environment, triggering instantaneous policy-based responses in network connectivity and configuration to ensure application performance.

Nuage Networks’ VSP uses tunneling protocols such as VXLAN to encapsulate the original payload as an overlay SDN solution.

Because standard NICs don’t recognize new packet header formats, traditionally all packet manipulation operations must be performed by the CPU, potentially over-taxing the CPU and causing significant network I/O performance degradation, especially as server I/O speeds increase.

For this reason, overlay network processing needs to be offloaded to an I/O-specific hardware adapter that can handle VXLAN, like ConnectX or BlueField, to reduce CPU strain.

Performance advantages of vDPA

ASAP2 uses hardware acceleration to increase performance compared to OVS DPDK.
Figure 4. Performance comparison of OVS DPDK in software versus ASAP2 vDPA hardware acceleration.

China Mobile decided to go with the VirtIO solution for maximum compatibility, and they wanted the ability to choose either straight OVS or OVS DPDK, depending on the use case. Working together, Nuage Network and NVIDIA delivered an SDN solution for China Mobile’s public cloud that is agile, scalable, and hardware-accelerated and which supports both types of network virtualization.

The joint solution using Nuage Networks VSP with NVIDIA hardware-accelerated vDPA delivered significantly faster performance. The network throughput increased by 1.5x, the packet forwarding rate was 3x faster, and the Apache benchmark supported 7x more requests per second, compared to running OVS-DPDK in software alone.

Learn more

For more information about the differentiation between OVS offload technologies, why CMCC decided to use the VirtIO/vDPA solution, and how NVIDIA can help you improve efficiencies in cloud-native technologies, see the Turbocharge Cloud-Native Application with Virtual Data Plane Accelerated Networking joint GTC session between CMCC, Nuage Networks, and NVIDIA.

Categories
Misc

What’s New in NVIDIA AI Enterprise 2.1

Today, NVIDIA announced general availability of NVIDIA AI Enterprise 2.1. This latest version of the end-to-end AI and data analytics software suite is optimized, certified, and…

Today, NVIDIA announced general availability of NVIDIA AI Enterprise 2.1. This latest version of the end-to-end AI and data analytics software suite is optimized, certified, and supported for enterprises to deploy and scale AI applications across bare metal, virtual, container, and cloud environments. 

Release highlights: New containers, public cloud support

The NVIDIA AI Enterprise 2.1 release offers advanced data science with the latest NVIDIA RAPIDS and low code AI model development using the most recent release of NVIDIA TAO Toolkit

Making enterprise AI even more accessible across hybrid or multi-cloud environments, AI Enterprise 2.1 includes added support for Red Hat OpenShift running in the public cloud and the new Microsoft Azure NVads A10 v5 series. These are the first NVIDIA virtual GPU instances offered from the public cloud, which enables affordable GPU sharing.

Support for the latest AI frameworks

NVIDIA AI Enterprise enables you to stay current with the latest AI tools for development and deployment, along with enterprise support and regular updates from NVIDIA. Support will continue for those relying on earlier versions of NVIDIA AI frameworks, ensuring the flexibility to manage infrastructure updates.

NVIDIA TAO Toolkit 22.05

The NVIDIA TAO Toolkit is a low code solution of NVIDIA TAO, a framework that enables developers to create custom, production-ready models to power speech and vision AI applications.

The latest version of the TAO Toolkit is now supported through NVIDIA AI Enterprise, with new key features including REST APIs integration, pre-trained weights import, TensorBoard integration, and new pre-trained models.

NVIDIA RAPIDS 22.04

The RAPIDS 22.04 release provides more support for data workflows through the addition of new models, techniques, and data processing capabilities across all the NVIDIA data science libraries. 

Red Hat OpenShift support in the public cloud 

Red Hat OpenShift, the industry’s leading enterprise Kubernetes platform with integrated DevOps capabilities, is now certified and supported for the public cloud with NVIDIA AI Enterprise, in addition to bare metal and VMware vSphere-based deployments. This enables a standardized AI workflow in a Kubernetes environment to scale across a hybrid-cloud environment.

Azure NVads A10 v5 support 

The Azure NVads A10 v5 series, powered by NVIDIA A10 Tensor Core GPUs, offers unprecedented GPU scalability and affordability with fractional GPU sharing for flexible GPU sizes ranging from one-sixth of an A10 GPU to two full A10 GPUs.

As part of the supported platforms, the NVads A10 v5 instances are certified with NVIDIA AI Enterprise to deliver optimized performance for deep learning inferencing, maximizing the utility and cost efficiency of at-scale deployments in the cloud.

Domino Data Lab Enterprise MLOps Platform Certification

NVIDIA AI Accelerated partner Domino Data Lab’s enterprise MLOps platform is now certified for NVIDIA AI Enterprise. This level of certification mitigates deployment risks and ensures reliable, high-performance integration with the NVIDIA AI platform.

This partnership pairs the Enterprise MLOps benefits of workload orchestration, self-serve infrastructure, and collaboration with cost-effective scale from virtualization on mainstream accelerated servers.

Try NVIDIA AI Enterprise 

NVIDIA LaunchPad provides organizations around the world with immediate, short-term access to the NVIDIA AI Enterprise software suite in a private accelerated computing environment that includes hands-on labs.

Experience the latest NVIDIA AI frameworks and tools, running on NVIDIA AI Enterprise, through new NVIDIA LaunchPad labs. Hosted on NVIDIA-accelerated infrastructure, the labs enable enterprises to speed up the development and deployment of modern, data-driven applications and quickly test and prototype the entire AI workflow on the same complete stack available for deployment.

 Check out these new LaunchPad labs for NVIDIA AI Enterprise 2.1:

  • Multi-Node Training for Image Classification on VMware vSphere with Tanzu
  • Deploy a Fraud Detection XGBoost Model using NVIDIA Triton
  • Develop a Custom Object Detection Model with NVIDIA TAO Toolkit and Deploy with NVIDIA DeepStream
Categories
Misc

Integrating NVIDIA Reflex: Q&A with Pathea Head of Technology Jingyang Xu

NVIDIA spoke with Chief Wizard of Pathea, Jingyang Xu, about himself, his company, and the process of implementing NVIDIA Reflex in My Time at Sandrock, the studio’s latest…

NVIDIA spoke with Chief Wizard of Pathea, Jingyang Xu, about himself, his company, and the process of implementing NVIDIA Reflex in My Time at Sandrock, the studio’s latest release.

For those who may not know you, could you tell us about yourself?

My name is Jingyang Xu. I am the Chief Wizard for Pathea. In other words, I’m the Head of Technology. I’ve had various jobs in the software industry for the last 23 years and recently joined Pathea.

Tell us about Pathea and the success of the company thus far?

Pathea has developed a few games that the readers may have heard of, including Planet Explorers and My Time at Portia. Currently, we are working on a few titles, including My Time at Sandrock, part of the “My Time” series. Like Portia, you take up the role of a builder in a post-apocalyptic world. We’ve gone early access on Steam, WeGame, Epic, and Bilibili.

Picture of a farm with blooming crops.
Figure 1. A farm in My Time at Sandrock

Why did you decide to integrate Reflex?

We are always open to the latest and greatest in gaming technology out there to provide the best experience for the players. So, when NVIDIA kept telling us how great Reflex is, we went ahead with it.

What challenge were you looking to solve with Reflex?

A few spots in the game require quick reflexes to complete the mission with the best result. So, we hoped that Reflex would allow the players to have great fun doing those missions.

How long did it take for you to get Reflex up and running in My Time in Sandrock?

NVIDIA provided us with a plugin for Unity; it only took a couple of hours to get Reflex up and running.

Picture of a girl and man having a conversation sitting down .
Figure 2. Gameplay of My Time at Sandrock

How difficult was the Reflex integration process with Unity?

The hard part is to locate the issue. Our testing didn’t initially find any problems. However, there was an issue with some missing DLLs on certain players’ machines. NVIDIA quickly fixed the issues. The problem caused some players’ experiences to be less than satisfactory, but NVIDIA responded quickly and helped Sandrock and us overcome it.

Any surprises or unexpected challenges?

We were surprised by how quickly NVIDIA responded when we had the issues. When we were trying to employ Reflex, we talked with them, and they gave us many solutions and suggestions to help us. The turnaround was super quick! More than that, NVIDIA spent time helping us to test to ensure that it runs well.

How has Reflex affected gameplay?

We have tested the performance and the results are very positive. We saw a 20%-30% increase in input responsiveness with our test. You can easily find the speed changes, which make us feel more confident that the gaming experience will be much better than before.

Any tips or lessons learned for other developers looking into Reflex?

We think it’s worth trying, and don’t worry about any problems, because NVIDIA will help you fix them. Just keep regular communications with NVIDIA. If it fits your gameplay, we believe it shows promise.

Do you plan on integrating NVIDIA Reflex in future titles?

Definitely. We have a couple of games in the pipeline that I think Reflex will be great for in the future. We believe Reflex is an excellent way to reduce game latency. It has been proven in Sandrock, and we believe it can be used in our other games to get the same results. Based on that, Reflex is very useful for making better games.

More resources

For more information, see Pathea.net. Discover the full list of NVIDIA Reflex Compatible Products and for other NVIDIA resources, see Game Development

Categories
Offsites

Training Generalist Agents with Multi-Game Decision Transformers

Current deep reinforcement learning (RL) methods can train specialist artificial agents that excel at decision-making on various individual tasks in specific environments, such as Go or StarCraft. However, little progress has been made to extend these results to generalist agents that would not only be capable of performing many different tasks, but also upon a variety of environments with potentially distinct embodiments.

Looking across recent progress in the fields of natural language processing, vision, and generative models (such as PaLM, Imagen, and Flamingo), we see that breakthroughs in making general-purpose models are often achieved by scaling up Transformer-based models and training them on large and semantically diverse datasets. It is natural to wonder, can a similar strategy be used in building generalist agents for sequential decision making? Can such models also enable fast adaptation to new tasks, similar to PaLM and Flamingo?

As an initial step to answer these questions, in our recent paper “Multi-Game Decision Transformers” we explore how to build a generalist agent to play many video games simultaneously. Our model trains an agent that can play 41 Atari games simultaneously at close-to-human performance and that can also be quickly adapted to new games via fine-tuning. This approach significantly improves upon the few existing alternatives to learning multi-game agents, such as temporal difference (TD) learning or behavioral cloning (BC).

A Multi-Game Decision Transformer (MGDT) can play multiple games at desired level of competency from training on a range of trajectories spanning all levels of expertise.

Don’t Optimize for Return, Just Ask for Optimality
In reinforcement learning, reward refers to the incentive signals that are relevant to completing a task, and return refers to cumulative rewards in a course of interactions between an agent and its surrounding environment. Traditional deep reinforcement learning agents (DQN, SimPLe, Dreamer, etc) are trained to optimize decisions to achieve the optimal return. At every time step, an agent observes the environment (some also consider the interactions that happened in the past) and decides what action to take to help itself achieve a higher return magnitude in future interactions.

In this work, we use Decision Transformers as our backbone approach to training an RL agent. A Decision Transformer is a sequence model that predicts future actions by considering past interactions between an agent and the surrounding environment, and (most importantly) a desired return to be achieved in future interactions. Instead of learning a policy to achieve high return magnitude as in traditional reinforcement learning, Decision Transformers map diverse experiences, ranging from expert-level to beginner-level, to their corresponding return magnitude during training. The idea is that training an agent on a range of experiences (from beginner to expert level) exposes the model to a wider range of variations in gameplay, which in turn helps it extract useful rules of gameplay that allow it to succeed under any circumstance. So during inference, the Decision Transformer can achieve any return value in the range it has seen during training, including the optimal return.

But, how do you know if a return is both optimal and stably achievable in a given environment? Previous applications of Decision Transformers relied on customized definitions of the desired return for each individual task, which required manually defining a plausible and informative range of scalar values that are appropriately interpretable signals for each specific game — a task that is non-trivial and rather unscalable. To address this issue, we instead model a distribution of return magnitudes based on past interactions with the environment during training. At inference time, we simply add an optimality bias that increases the probability of generating actions that are associated with higher returns.

To more comprehensively capture spatial-temporal patterns of agent-environment interactions, we also modified the Decision Transformer architecture to consider image patches instead of a global image representation. Patches allow the model to focus on local dynamics, which helps model game specific information in further detail.

These pieces together give us the backbone of Multi-Game Decision Transformers:

Each observation image is divided into a set of M patches of pixels which are denoted O. Return R, action a, and reward r follows these image patches in each input casual sequence. A Decision Transformer is trained to predict the next input (except for the image patches) to establish causality.

Training a Multi-Game Decision Transformer to Play 41 Games at Once
We train one Decision Transformer agent on a large (~1B) and broad set of gameplay experiences from 41 Atari games. In our experiments, this agent, which we call the Multi-Game Decision Transformer (MGDT), clearly outperforms existing reinforcement learning and behavioral cloning methods — by almost 2 times — on learning to play 41 games simultaneously and performs near human-level competency (100% in the following figure corresponds to the level of human gameplay). These results hold when comparing across training methods in both settings where a policy must be learned from static datasets (offline) as well as those where new data can be gathered from interacting with the environment (online).

Each bar is a combined score across 41 games, where 100% indicates human-level performance. Each blue bar is from a model trained on 41 games simultaneously, whereas each gray bar is from 41 specialist agents. Multi-Game Decision Transformer achieves human-level performance, significantly better than other multi-game agents, even comparable to specialist agents.

This result indicates that Decision Transformers are well-suited for multi-task, multi-environment, and multi-embodiment agents.

A concurrent work, “A Generalist Agent”, shows a similar result, demonstrating that large transformer-based sequence models can memorize expert behaviors very well across many more environments. In addition, their work and our work have nicely complementary findings: They show it’s possible to train across a wide range of environments beyond Atari games, while we show it’s possible and useful to train across a wide range of experiences.

In addition to the performance shown above, empirically we found that MGDT trained on a wide variety of experience is better than MDGT trained only on expert-level demonstrations or simply cloning demonstration behaviors.

Scaling Up Multi-Game Model Size to Achieve Better Performance
Argurably, scale has become the main driving force in many recent machine learning breakthroughs, and it is usually achieved by increasing the number of parameters in a transformer-based model. Our observation on Multi-Game Decision Transformers is similar: the performance increases predictably with larger model size. In particular, its performance appears to have not yet hit a ceiling, and compared to other learning systems performance gains are more significant with increases in model size.

Performance of Multi-Game Decision Transformer (shown by the blue line) increases predictably with larger model size, whereas other models do not.

Pre-trained Multi-Game Decision Transformers Are Fast Learners
Another benefit of MGDTs is that they can learn how to play a new game from very few gameplay demonstrations (which don’t need to all be expert-level). In that sense, MGDTs can be considered pre-trained models capable of being fine-tuned rapidly on small new gameplay data. Compared with other popular pre-training methods, it clearly shows consistent advantages in obtaining higher scores.

Multi-Game Decision Transformer pre-training (DT pre-training, shown in light blue) demonstrates consistent advantages over other popular models in adaptation to new tasks.

Where Is the Agent Looking?
In addition to the quantitative evaluation, it’s insightful (and fun) to visualize the agent’s behavior. By probing the attention heads, we find that the MGDT model consistently places weight in its field of view to areas of the observed images that contain meaningful game entities. We visualize the model’s attention when predicting the next action for various games and find it consistently attends to entities such as the agent’s on screen avatar, agent’s free movement space, non-agent objects, and key environment features. For example, in an interactive setting, having an accurate world model requires knowing how and when to focus on known objects (e.g., currently present obstacles) as well as expecting and/or planning over future unknowns (e.g., negative space). This diverse allocation of attention to many key components of each environment ultimately improves performance.

Here we can see the amount of weight the model places on each key asset of the game scene. Brighter red indicates more emphasis on that patch of pixels.

The Future of Large-Scale Generalist Agents
This work is an important step in demonstrating the possibility of training general-purpose agents across many environments, embodiments, and behavior styles. We have shown the benefit of increased scale on performance and the potential with further scaling. These findings seem to point to a generalization narrative similar to other domains like vision and language — we look forward to exploring the great potential of scaling data and learning from diverse experiences.

We look forward to future research towards developing performant agents for multi-environment and multi-embodiment settings. Our code and model checkpoints can soon be accessed here.

Acknowledgements
We’d like to thank all remaining authors of the paper including Igor Mordatch, Ofir Nachum Menjiao Yang, Lisa Lee, Daniel Freeman, Sergio Guadarrama, Ian Fischer, Eric Jang, Henryk Michalewski.

Categories
Misc

Performance Boosts and Enhanced Features in New Nsight Graphics, Nsight Aftermath Releases

Nsight Graphics 2022.3 and Nsight Aftermath 2022.2 have just been released and are now available to download.  Nsight Graphics 2022.3 The Nsight Graphics 2022.3 release…

Nsight Graphics 2022.3 and Nsight Aftermath 2022.2 have just been released and are now available to download. 

Nsight Graphics 2022.3

The Nsight Graphics 2022.3 release focuses on performance gains, bug fixes, and Vulkan improvements.

Performance for the Ray Tracing Acceleration Structure Viewer has improved by up to 20x in some complex scenes, thanks to better occlusion culling. Additionally, the viewer received improved handling of large instance counts to increase performance and reduce memory usage in scenes with duplicate geometry.

With the new VK_KHR_graphics_pipeline_library extension, your Vulkan application can now precompile shaders and link them at runtime at a substantially reduced cost. This is important because large 3D graphics applications such as games utilize complex algorithms that result in a large number of shaders. 

These algorithms often require different permutations of the shaders to account for different effects or lighting environments. The end result is thousands or hundreds of thousands of shaders that, in many cases, are compiled at runtime. This can result in mid-frame stuttering which negatively impacts the user experience. 

Download Nsight Graphics 2022.3 >>

Nsight Aftermath 2022.2

In addition to the great improvements with structure viewer and shaders in Nsight Graphics, the Nsight Aftermath 2022.2 release enhances your ability to find the root cause of GPU crashes on a user’s system. 

GPU shaders make frequent accesses to memory, which all go through a dedicated hardware unit called the MMU. Nsight Aftermath 2022.2 adds enhanced MMU fault correlation which provides the line of shader source code that initiated the memory request from the shader units.

In the case where the fault is caused by a memory write with no outstanding dependencies, the shader unit would have retired the warp, leaving no contextual data to help in the debugging process. A new (debugging-only) setting in the API addresses this, preventing the shader units from retiring a warp while there is an outstanding instruction with the potential for an MMU fault. 

Nsight Aftermath helps you locate GPU crashes so that you can ship fast and stable 3D graphics applications. Look for even better correlation of GPU crashes in future releases, so you can find exactly where a crash occurred in your code.

Download Nsight Aftermath 2022.2 >>

Additional resources

Want to help us build better tools for you? Share your thoughts with this Nsight Graphics survey that takes less than one minute to complete.