Categories
Offsites

Scaling Vision with Sparse Mixture of Experts

Advances in deep learning over the last few decades have been driven by a few key elements. With a small number of simple but flexible mechanisms (i.e., inductive biases such as convolutions or sequence attention), increasingly large datasets, and more specialized hardware, neural networks can now achieve impressive results on a wide range of tasks, such as image classification, machine translation, and protein folding prediction.

However, the use of large models and datasets comes at the expense of significant computational requirements. Yet, recent works suggest that large model sizes might be necessary for strong generalization and robustness, so training large models while limiting resource requirements is becoming increasingly important. One promising approach involves the use of conditional computation: rather than activating the whole network for every single input, different parts of the model are activated for different inputs. This paradigm has been featured in the Pathways vision and recent works on large language models, while it has not been well explored in the context of computer vision.

In “Scaling Vision with Sparse Mixture of Experts”, we present V-MoE, a new vision architecture based on a sparse mixture of experts, which we then use to train the largest vision model to date. We transfer V-MoE to ImageNet and demonstrate matching state-of-the-art accuracy while using about 50% fewer resources than models of comparable performance. We have also open-sourced the code to train sparse models and provided several pre-trained models.

Vision Mixture of Experts (V-MoEs)
Vision Transformers (ViT) have emerged as one of the best architectures for vision tasks. ViT first partitions an image into equally-sized square patches. These are called tokens, a term inherited from language models. Still, compared to the largest language models, ViT models are several orders of magnitude smaller in terms of number of parameters and compute.

To massively scale vision models, we replace some dense feedforward layers (FFN) in the ViT architecture with a sparse mixture of independent FFNs (which we call experts). A learnable router layer selects which experts are chosen (and how they are weighted) for every individual token. That is, different tokens from the same image may be routed to different experts. Each token is only routed to at most K (typically 1 or 2) experts, among a total of E experts (in our experiments, E is typically 32). This allows scaling the model’s size while keeping its computation per token roughly constant. The figure below shows the structure of the encoder blocks in more detail.

V-MoE Transformer Encoder block.

Experimental Results
We first pre-train the model once on JFT-300M, a large dataset of images. The left plot below shows our pre-training results for models of all sizes: from the small S/32 to the huge H/14.

We then transfer the model to new downstream tasks (such as ImageNet), by using a new head (the last layer in a model). We explore two transfer setups: either fine-tuning the entire model on all available examples of the new task, or freezing the pre-trained network and tuning only the new head using a few examples (known as few-shot transfer). The right plot in the figure below summarizes our transfer results to ImageNet, training on only 5 images per class (called 5-shot transfer).

JFT-300M Precision@1 and ImageNet 5-shot accuracy. Colors represent different ViT variants and markers represent either standard ViT (●), or V-MoEs (▸) with expert layers on the last n even blocks. We set n=2 for all models, except V-MoE-H where n=5. Higher indicates better performance, with more efficient models being to the left.

In both cases, the sparse model strongly outperforms its dense counterpart at a given amount of training compute (shown by the V-MoE line being above the ViT line), or achieves similar performance much faster (shown by the V-MoE line being to the left of the ViT line).

To explore the limits of vision models, we trained a 15-billion parameter model with 24 MoE layers (out of 48 blocks) on an extended version of JFT-300M. This massive model — the largest to date in vision as far as we know — achieved 90.35% test accuracy on ImageNet after fine-tuning, near the current state-of-the-art.

Priority Routing
In practice, due to hardware constraints, it is not efficient to use buffers with a dynamic size, so models typically use a pre-defined buffer capacity for each expert. Assigned tokens beyond this capacity are dropped and not processed once the expert becomes “full”. As a consequence, higher capacities yield higher accuracy, but they are also more computationally expensive.

We leverage this implementation constraint to make V-MoEs faster at inference time. By decreasing the total combined buffer capacity below the number of tokens to be processed, the network is forced to skip processing some tokens in the expert layers. Instead of choosing the tokens to skip in some arbitrary fashion (as previous works did), the model learns to sort tokens according to an importance score. This maintains high quality predictions while saving a lot of compute. We refer to this approach as Batch Priority Routing (BPR), illustrated below.

Under high capacity, both vanilla and priority routing work well as all patches are processed. However, when the buffer size is reduced to save compute, vanilla routing selects arbitrary patches to process, often leading to poor predictions. BPR smartly prioritizes important patches resulting in better predictions at lower computational costs.

Dropping the right tokens turns out to be essential to deliver high-quality and more efficient inference predictions. When the expert capacity decreases, performance quickly decreases with the vanilla routing mechanism. Conversely, BPR is much more robust to low capacities.

Performance versus inference capacity buffer size (or ratio) C for a V-MoE-H/14 model with K=2. Even for large C’s, BPR improves performance; at low C the difference is quite significant. BPR is competitive with dense models (ViT-H/14) by processing only 15-30% of the tokens.

Overall, we observed that V-MoEs are highly flexible at inference time: for instance, one can decrease the number of selected experts per token to save time and compute, without any further training on the model weights.

Exploring V-MoEs
Because much is yet to be discovered about the internal workings of sparse networks, we also explored the routing patterns of the V-MoE.

One hypothesis is that routers would learn to discriminate and assign tokens to experts based on some semantic grounds (the “car” expert, the “animal” experts, and so on). To test this, below we show plots for two different MoE layers (a very early-on one, and another closer to the head). The x-axis corresponds to each of the 32 experts, and the y-axis shows the ID of the image classes (from 1 to 1000). Each entry in the plot shows how often an expert was selected for tokens corresponding to the specific image class, with darker colors indicating higher frequency. While in the early layers there is little correlation, later in the network, each expert receives and processes tokens from only a handful of classes. Therefore, we can conclude that some semantic clustering of the patches emerges in the deeper layers of the network.

Higher routing decisions correlate with image classes. We show two MoE layers of a V-MoE-H/14. The x-axis corresponds to the 32 experts in a layer. The y-axis are the 1000 ImageNet classes; orderings for both axes are different across plots (to highlight correlations). For each pair (expert e, class c) we show the average routing weight for the tokens corresponding to all images with class c for that particular expert e.

Final Thoughts
We train very large vision models using conditional computation, delivering significant improvements in representation and transfer learning for relatively little training cost. Alongside V-MoE, we introduced BPR, which requires the model to process only the most useful tokens in the expert layers.

We believe this is just the beginning of conditional computation at scale for computer vision; extensions include multi-modal and multi-task models, scaling up the expert count, and improving transfer of the representations produced by sparse models. Heterogeneous expert architectures and conditional variable-length routes are also promising directions. Sparse models can especially help in data rich domains such as large-scale video modeling. We hope our open-source code and models help attract and engage researchers new to this field.

Acknowledgments
We thank our co-authors: Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. We thank Alex Kolesnikov, Lucas Beyer, and Xiaohua Zhai for providing continuous help and details about scaling ViT models. We are also grateful to Josip Djolonga, Ilya Tolstikhin, Liam Fedus, and Barret Zoph for feedback on the paper; James Bradbury, Roy Frostig, Blake Hechtman, Dmitry Lepikhin, Anselm Levskaya, and Parker Schuh for invaluable support helping us run our JAX models efficiently on TPUs; and many others from the Brain team for their support. Finally, we would also like to thank and acknowledge Tom Small for the awesome animated figure used in this post.

Categories
Misc

How Retailers Meet Tough Challenges Using NVIDIA AI

At the National Retail Federation’s annual trade show, conversations tend to touch on recurring themes: “Will we be able to stock must-have products for next Christmas?,” “What incentives can I offer to loyal workers?” and “What happens to my margins if Susie Consumer purchases three of the same dresses online and returns two?” The $26 Read article >

The post How Retailers Meet Tough Challenges Using NVIDIA AI  appeared first on The Official NVIDIA Blog.

Categories
Misc

AI Startup to Take a Bite Out of Fast-Food Labor Crunch

Addressing a growing labor crisis among quick-service restaurants, startup Vistry is harnessing AI to automate the process of taking orders. The company will share its story at the NRF Big Show, the annual industry gathering of the National Retail Federation in New York, starting Jan. 16. “They’re closing restaurants because there is not enough labor,” Read article >

The post AI Startup to Take a Bite Out of Fast-Food Labor Crunch appeared first on The Official NVIDIA Blog.

Categories
Misc

GFN Thursday: ‘Fortnite’ Comes to iOS Safari and Android Through NVIDIA GeForce NOW via Closed Beta

Starting next week, Fortnite on GeForce NOW will launch in a limited-time closed beta for mobile, all streamed through the Safari web browser on iOS and the GeForce NOW Android app. The beta is open for registration for all GeForce NOW members, and will help test our server capacity, graphics delivery and new touch controls Read article >

The post GFN Thursday: ‘Fortnite’ Comes to iOS Safari and Android Through NVIDIA GeForce NOW via Closed Beta appeared first on The Official NVIDIA Blog.

Categories
Misc

World Record-Setting DNA Sequencing Technique Helps Clinicians Rapidly Diagnose Critical Care Patients

Cutting down the time needed to sequence and analyze a patient’s whole genome from days to hours isn’t just about clinical efficiency — it can save lives. By accelerating every step of this process — from collecting a blood sample to sequencing the whole genome to identifying variants linked to diseases — a research team Read article >

The post World Record-Setting DNA Sequencing Technique Helps Clinicians Rapidly Diagnose Critical Care Patients appeared first on The Official NVIDIA Blog.

Categories
Misc

Custom model.predict() function

Hi!

I have a prediction routine that involves doing some postprocessing of the output of model.predict(x) function. The postprocessing involves a comparison of the output to a the mean output of all training data. The process has worked well until now, but I would like to combine it all, mean training vector included, into a TF SavedModel. I.e. I’m trying to get the final output (postprocessing included) when calling model.predict(x)

Is there any way to customize the functionality of the model.predict(x) function?

What my current pipeline looks like:

mean_training_output = # an array consisting of the mean output vector from the training data predicted = model.predict(x) # Compare distance of new and mean training output normalized_distance = np.zeros(len(predicted)) for i in range(len(predicted)): normalized_distance[i] = np.linalg.norm(feature_vectors_flattned[i]-mean_training_output) # What I actually want model.predict() to output normalized_distance 

So in the above snippet I would actually want model.predict() to output normalized_distance.

submitted by /u/Maltmax
[visit reddit] [comments]

Categories
Misc

Slice of 20 elements from rank 1 tensor then reshaping throws "Input to reshape is tensor with 10272 values, but requested shape requires multiple of 20"

I posted this question to stack exchange here:

https://stackoverflow.com/questions/70686521/slice-of-20-elements-of-rank1-tensor-then-reshaping-throws-input-to-reshape-is

My input tensor “`Data = Input(shape=(856,))“` is a vector of float32 values concatenated from many different devices. I am trying to apply different TensorFlow functions to different subslices of each input chunk. Some of these functions include a 1D Convolution which requires a reshape.

slice = Data[:20]

reshape = tf.reshape(slice, (-1, 20, 1))

Doing this crashes after trying to fit my model. It throws the following errors:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 10272 values, but the requested shape requires a multiple of 20

[[node model/tf.reshape_1/Reshape

(defined at /home/.local/lib/python3.8/site-packages/keras/layers/core/tf_op_layer.py:261)

]] [Op:__inference_train_function_1858]

Errors may have originated from an input operation.

Input Source operations connected to node model/tf.reshape_1/Reshape:

In[0] model/tf.__operators__.getitem_1/strided_slice:

In[1] model/tf.reshape_1/Reshape/shape:

I am not sure how slicing 20 elements from a tensor of 856 could result in a tensor of 10272 values.

I have also tried using the “`tf.slice“` function a couple of different ways; both fail. Referencing the docs: https://www.tensorflow.org/guide/tensor_slicing

slice = tf.slice(Data, begin=[0], size=[20])

And fails, stating:

Shape must be rank 1 but is rank 2 for ‘{{node tf.slice/Slice}} = Slice[Index=DT_INT32, T=DT_FLOAT](Placeholder, tf.slice/Slice/begin, tf.slice/Slice/size)’ with input shapes: [?,856], [1], [1].

For reference, here is what some of the values look like in the input data

array([-9.55784683e+01, -1.70557899e+01, 2.95967350e+01, 7.81378937e+00,

9.02729130e+00, 5.49621725e+00, 4.19811630e+00, 5.84186697e+00,

4.90438080e+00, 3.73845983e+00, 5.12300587e+00, 2.61530232e+00,

2.67061424e+00, 3.91038632e+00, 2.31110978e+00, 4.20644665e+00,

4.50000000e+00, 9.87345278e-01, 1.59740388e+00, 6.30727148e+00,

submitted by /u/EightEqualsEqualsDe
[visit reddit] [comments]

Categories
Misc

Developing Accelerated Code with Standard Language Parallelism

Learn how standard language parallelism can be used for programming accelerated computing applications on NVIDIA GPUs with ISO C++, ISO Fortran, or Python.

The NVIDIA platform is the most mature and complete platform for accelerated computing. In this post, I address the simplest, most productive, and most portable approach to accelerated computing. There are three approaches that you can take for programming GPUs (Figure 1).

Three boxes showing three approaches to GPU development. First, C++, Fortran, and Python parallel features. Second, compiler directives like OpenACC and OpenMP. And third, platform languages like CUDA. These approaches are layered on top of several available accelerated libraries.
Figure 1. Three approaches to programming the NVIDIA platform

CUDA C++ and Fortran are the innovation ground where NVIDIA can expose new hardware and software innovations, and where you can tune your applications to achieve the best possible performance on NVIDIA GPUs. Many developers assume that this is how NVIDIA expects everyone to program for GPUs.

Instead, we expect that developers coming to the NVIDIA platform for the first time will use standard, parallel programming languages, such as ISO C++, ISO Fortran, and Python. In this post, I highlight some successes in using this approach to parallel programming to demonstrate the most productive path to entering the NVIDIA CUDA ecosystem.

The foundation of the NVIDIA strategy is providing a rich, mature set of SDKs and libraries on which applications can be built. NVIDIA already provides highly tuned math libraries, such as cuBLAS, cuSolver, and cuFFT; core libraries, such as Thrust and libcu++; and communication libraries, such as NCCL and NVSHMEM, as well as other packages and frameworks on which you can build your applications.

On top of this, NVIDIA layers the three different programming approaches:

  • Standard language parallelism, which is the subject of this post
  • Languages for platform specialization, such as CUDA C++ and CUDA Fortran for obtaining the best possible performance on the NVIDIA platform
  • Compiler directives, bridging the gap between these two approaches by enabling incremental performance optimization

Each of these approaches makes tradeoffs in terms of performance, productivity, and code portability. As they can all interoperate, you don’t have to use a particular model but can mix any or all as desired.

If you start writing code using parallelism in standard programming languages, then you can come to the NVIDIA platform or any other platform with baseline code that is already capable of running in parallel. This is why we have invested more than a decade collaborating in the standard language committees on the adoption of features to enable parallel programming without the need for additional extensions or APIs. Standard language parallelism is a rising tide that raises all boats.

ISO C++

The C++ programming language is consistently among the top programming languages in recent studies of programming trends. It has seen a significant increase in usage in scientific computing. The richness of its Standard Template Library makes it a highly productive language for new code development and, since the release of C++17, it has supported several important features for parallel programming.

I’ve seen several applications get refactored away from traditional for loops in favor of these C++ parallel algorithms. Here are the results from a few of them.

Lulesh

Lulesh is a hydrodynamics mini-app from Lawrence Livermore National Laboratory (LLNL), written in C++. The mini-app has several versions for evaluating different programming approaches, both in terms of the quality of the code and performance. We worked with the developers to rewrite their existing OpenMP-based code to use C++ Parallel Algorithms. Figure 2 shows an example of just one of the application’s important functions.

Shows code for OpenMP and ISO C++ versions, where the ISO C++ version is significantly more concise and easier to understand. It is noted that the code is also ISO standard and can be built with multiple compilers.
Figure 2. Refactoring Lulesh from OpenMP to ISO C++ parallelism results in code that is simpler, easier to read, ISO standard, and portable to all compilers that support ISO C++

The code on the left uses OpenMP to parallelize the loops in the code across CPU threads. To maintain both a serial and parallel version of the code, the developers used #ifdef macros and compiler pragmas. The result is repeated code and the introduction of an additional API, OpenMP, into the source.

The code on the right is the same routine, but rewritten using the C++ transform_reduce algorithm. The resulting code is much more compact, making it less error prone, easier to read, and more maintainable. It also removes the dependency on OpenMP, relying instead on the C++ standard template library, while maintaining a single source code for all platforms. This code is fully ISO C++ compliant, capable of being built by any C++ compiler that supports C++17. As it turns out, it is faster too!

A bar chart comparing performance of OpenMP code with g++ and nvc++ compilers, which are roughly equal, to ISO C++ code running on AMD EPYC CPUs and NVIDIA GPUs. The ISO C++ code is up to 13.5X faster and runs on CPUs and GPUs without modification.
Figure 3. The ISO C++ version of Lulesh is faster than the original OpenMP code and portable to multiple compilers and between the CPU and GPU

As a performance baseline, we used the OpenMP code running on all cores of an AMD EPYC 7742 processor and built with GCC. Rebuilding this baseline code using NVIDIA nvc++ compiler achieves essentially the same performance on the CPU.

If you instead build the ISO C++ code using the same version of GCC and running on the same CPU, the performance improves by roughly 50%, due to various improved overheads and opportunities for the compiler to better optimize the code.

This turns into a 2X performance improvement when building this code using nvc++ and running on the same CPU. This is already an exciting achievement but to top that off, you can build this same code, changing only a compiler option to target an NVIDIA GPU instead of a multicore CPU. Now that same code runs more than 13X faster by running on an NVIDIA A100 GPU. There’s a 13.5X performance improvement from the original code, running in parallel both on the CPU and GPU, using strictly ISO C++ code.

STLBM

Another example of an application using C++ Standard Parallelism is STLBM, a Lattice-Boltzmann solver from the University of Geneva. Professor Jonas Latt discussed this application in several GTC sessions, showing how code written in ISO C++ without any external SDK dependencies can run with multiple compilers and on multiple hardware platforms, including NVIDIA GPUs. For more information, see Fluid Dynamics on GPUs with C++ Parallel Algorithms: State-of-the-Art Performance through a Hardware-Agnostic Approach and Porting a Scientific Application to GPU Using C++ Standard Parallelism

His application achieves more than a 12X performance improvement using GPUs. What is notable is that his baseline for comparison is a source code that is parallel by default, using the parallel algorithms in the C++17 standard template library to express the parallelism inherent in the application.

He categorized the experience of using ISO C++ to program for GPUs as a paradigm shift in cross-platform CPU/GPU programming.” Rather than writing an application that is serial by default and then adding parallelism later, his team has written an application that is ready for any parallel platform on which they wish to run.

A slide showing a performance graph with the same C++ code running on a 2-Socket Xeon server and on an NVIDIA A100 GPU. The GPU performance is more than 12X better than the full CPU Performance.
Figure 4. STLBM is capable of running the same source code on multicore CPU nodes and on NVIDIA GPUs

NVIDIA is heavily invested in the continued development of parallelism and concurrency in C++ and has coauthored a variety of proposals for the upcoming C++23 specification to further improve your ability to write code that is parallel-first.

ISO Fortran

Fortran remains a language whose primary focus is on scientific and high performance computing. Originally, the FORmula TRANslator, Fortran provides a variety of advantages both to developers and compilers, and also has a huge existing code base for modeling and simulation codes.

Fortran began adding features to support parallel programming in Fortran 2008, enhanced these capabilities in Fortran 2018, and continues to refine them in the upcoming version, currently referred to as Fortran 202X. Just as with ISO C++, NVIDIA has been working with application developers to use standard language parallelism in Fortran to modernize their applications and make them parallel-first.

Computational chemistry

My colleague Jeff Hammond, in his FortranCon2021: Standard Fortran on GPUs and its utility in quantum chemistry codes session, presented some promising results using Fortran do concurrent loops in kernels taken from the NWChem application and also GAMESS, another computational chemistry application.

For NWChem, he isolated several performance-critical loops that perform tensor contractions and has written them using several programming models. On multicore CPUs, these tensor contractions use OpenMP for threading across CPU cores. For GPUs, there are versions available using OpenACC, OpenMP target offloading, and now Fortran do concurrent loops.

Figure 5 shows that the do concurrent loops perform at the same level as both OpenACC and OpenMP target offloading on NVIDIA GPUs but without the need to include these additional APIs in the application. This is all standard Fortran.

A point graph showing performance of several NWChem tensor contraction kernels. For each kernel, the OpenMP CPU performance is similar in terms of GF per second and the GPU performance of the pure Fortran version to the directive-based versions is similar and much higher than the CPU performance.
Figure 5 Performance of a range of NWChem application kernels using several programming models

High-performance flux transport

At the recent Workshop for Accelerator Programming Using Directives (WACCPD), collocated at the SC21 conference, a team of developers from Predictive Science Inc. showed their results in refactoring one of their production codes, which previously used OpenACC to run on NVIDIA GPUs, using do concurrent loops.

They compared the results of building this purely ISO Fortran application using NVIDIA nvfortran, gfortran, and ifort. They concluded that, for their application when using the nvfortran compiler, pure Fortran gave the performance that they required without the need for any directives. Furthermore, this code could run in parallel on GPUs and multicore CPUs without modification.

A screenshot of a presentation slide showing compiler options used to build the application code with nvfortran and performance results. The performance results show very good performance using do concurrent when compared to OpenACC on both the CPU and GPU.
Figure 6. Performance results for HPFT benchmark using nvfortran compiler

This paper received the award for best paper at the workshop, even though it required no directives at all for accelerator programming. When asked whether they would continue the standard language parallelism approach in their other applications, the presenter replied that they already have plans to adopt this approach in other important applications for their company.

Python with Legate and cuNumeric

The Python language has had a meteoric rise in popularity over the past decade. It is now commonly used in machine learning, data science, and even traditional modeling and simulation applications. Although Python is not an ISO programming language, like C++ and Fortran, we are implementing the spirit of standard language parallelism in the Python language as well.

In his keynote address at GTC’21 Fall, NVIDIA CEO Jensen Huang introduced the alpha release of cuNumeric, a library that is modeled after NumPy and which enables features similar to what I have discussed for ISO C++ and Fortran. The NumPy package is so prevalent in Python development that it is a near certainty that any HPC application written in Python uses it.

The cuNumeric package, written on top of a package called Legate, enables NumPy applications to automatically scale their work not only onto GPUs but across GPUs in a large cluster. I’ve seen for several example applications that simply replacing references to NumPy in the code to instead refer to cuNumeric, I could weakly scale that application to the full size of the NVIDIA internal cluster, Selene, which is among the 10 fastest supercomputers in the world.

For more information about cuNumeric, see NVIDIA Announces Availability for cuNumeric Public Alpha and watch the GTC On-Demand session, Legate: Scaling the Python Ecosystem.

Conclusion

I hope this post has inspired you to see that GPU programming is not as difficult as you may have heard. If you use standard language parallelism, it may even be possible without any code changes at all.

NVIDIA is encouraging you to write applications parallel-first such that there is never a need to “port” applications to new platforms and standard language parallelism is the best approach to doing this, as it requires nothing more than the ISO standard languages. This is why we continue to invest in the ISO programming languages and in bringing even more features for parallelism and concurrency to these languages.

In summary, using standard language parallelism has the following benefits:

  • Full ISO language compliance, resulting in more portable code
  • Code that is more compact, easier to read, less error prone
  • Code that is parallel by default, so it can run without modification on more platforms

Here are several talks from GTC’21 that can provide you with even more detail about this approach to parallel programming:

For more information, see the following resources:

Categories
Misc

Elevated Entertainment: SHIELD Experience 9.0 Upgrade Rolling Out Now

SHIELD Software Experience Upgrade 9.0 is rolling out to all NVIDIA SHIELD TVs, delivering the Android 11 operating system and more. An updated Gboard — the Google Keyboard — allows people to use their voices and the Google Assistant to discover content in all search boxes. Additional permissions let users customize privacy across apps, including Read article >

The post Elevated Entertainment: SHIELD Experience 9.0 Upgrade Rolling Out Now appeared first on The Official NVIDIA Blog.

Categories
Misc

NVIDIA Named America’s Best Place to Work on Latest Glassdoor List

NVIDIA is America’s best place to work, according to Glassdoor’s just-issued list of best employers for 2022. Amid a global pandemic that has affected every workplace, NVIDIA was ranked No. 1 on Glassdoor’s 14th annual Best Places to Work list for large US companies. The award is based on anonymous employee feedback covering thousands of Read article >

The post NVIDIA Named America’s Best Place to Work on Latest Glassdoor List appeared first on The Official NVIDIA Blog.