General and Scalable Parallelization for Neural Networks

Scaling neural networks, whether it be the amount of training data used, the model size or the computation being utilized, has been critical for improving model quality in many real-world machine learning applications, such as computer vision, language understanding and neural machine translation. This, in turn, has motivated recent studies to scrutinize the factors that play a critical role in the success of scaling a neural model. Although increasing model capacity can be a sound approach to improve model quality, doing so presents a number of systems and software engineering challenges that must be overcome. For instance, in order to train large models that exceed the memory capacity of an accelerator, it becomes necessary to partition the weights and the computation of the model across multiple accelerators. This process of parallelization increases the network communication overhead and can result in device under-utilization. Moreover, a given algorithm for parallelization, which typically requires a significant amount of engineering effort, may not work with different model architectures.

To address these scaling challenges, we present “GSPMD: General and Scalable Parallelization for ML Computation Graphs”, in which we describe an open-source automatic parallelization system based on the XLA compiler. GSPMD is capable of scaling most deep learning network architectures and has already been applied to many deep learning models, such as GShard-M4, LaMDA, BigSSL, ViT, and MetNet-2, leading to state-of-the-art-results across several domains. GSPMD has also been integrated into multiple ML frameworks, including TensorFlow and JAX, which use XLA as a shared compiler.

GSPMD separates the task of programming an ML model from the challenge of parallelization. It allows model developers to write programs as if they were run on a single device with very high memory and computation capacity — the user simply needs to add a few lines of annotation code to a subset of critical tensors in the model code to indicate how to partition the tensors. For example, to train a large model-parallel Transformer, one may only need to annotate fewer than 10 tensors (less than 1% of all tensors in the entire computation graph), one line of additional code per tensor. Then GSPMD runs a compiler pass that determines the entire graph’s parallelization plan, and transforms it into a mathematically equivalent, parallelized computation that can be executed on each device. This allows users to focus on model building instead of parallelization implementation, and enables easy porting of existing single-device programs to run at a much larger scale.

The separation of model programming and parallelism also allows developers to minimize code duplication. With GSPMD, developers may employ different parallelism algorithms for different use cases without the need to reimplement the model. For example, the model code that powered the GShard-M4 and LaMDA models can apply a variety of parallelization strategies appropriate for different models and cluster sizes with the same model implementation. Similarly, by applying GSPMD, the BigSSL large speech models can share the same implementation with previous smaller models.

Generality and Flexibility
Because different model architectures may be better suited to different parallelization strategies, GSPMD is designed to support a large variety of parallelism algorithms appropriate for different use cases. For example, with smaller models that fit within the memory of a single accelerator, data parallelism is preferred, in which devices train the same model using different input data. In contrast, models that are larger than a single accelerator’s memory capacity are better suited for a pipelining algorithm (like that employed by GPipe) that partitions the model into multiple, sequential stages, or operator-level parallelism (e.g., Mesh-TensorFlow), in which individual computation operators in the model are split into smaller, parallel operators.

GSPMD supports all the above parallelization algorithms with a uniform abstraction and implementation. Moreover, GSPMD supports nested patterns of parallelism. For example, it can be used to partition models into individual pipeline stages, each of which can be further partitioned using operator-level parallelism.

GSPMD also facilitates innovation on parallelism algorithms by allowing performance experts to focus on algorithms that best utilize the hardware, instead of the implementation that involves lots of cross-device communications. For example, for large Transformer models, we found a novel operator-level parallelism algorithm that partitions multiple dimensions of tensors on a 2D mesh of devices. It reduces peak accelerator memory usage linearly with the number of training devices, while maintaining a high utilization of accelerator compute due to its balanced data distribution over multiple dimensions.

To illustrate this, consider a simplified feedforward layer in a Transformer model that has been annotated in the above way. To execute the first matrix multiply on fully partitioned input data, GSPMD applies an MPI-style AllGather communication operator to partially merge with partitioned data from another device. It then executes the matrix multiply locally and produces a partitioned result. Before the second matrix multiply, GSPMD adds another AllGather on the right-hand side input, and executes the matrix multiply locally, yielding intermediate results that will then need to be combined and partitioned. For this, GSPMD adds an MPI-style ReduceScatter communication operator that accumulates and partitions these intermediate results. While the tensors generated with the AllGather operator at each stage are larger than the original partition size, they are short-lived and the corresponding memory buffers will be freed after use, which does not affect peak memory usage in training.

Left: A simplified feedforward layer of a Transformer model. Blue rectangles represent tensors with dashed red & blue lines overlaid representing the desired partitioning across a 2×2 mesh of devices. Right: A single partition, after GSPMD has been applied.

A Transformer Example with Nested Parallelism
As a shared, robust mechanism for different parallelism modes, GSPMD allows users to conveniently switch between modes in different parts of a model. This is particularly valuable for models that may have different components with distinct performance characteristics, for example, multimodal models that handle both images and audio. Consider a model with the Transformer encoder-decoder architecture, which has an embedding layer, an encoder stack with Mixture-of-Expert layers, a decoder stack with dense feedforward layers, and a final softmax layer. In GSPMD, a complex combination of several parallelism modes that treats each layer separately can be achieved with simple configurations.

In the figure below, we show a partitioning strategy over 16 devices organized as a logical 4×4 mesh. Blue represents partitioning along the first mesh dimension X, and yellow represents partitioning along the second mesh dimension Y. X and Y are repurposed for different model components to achieve different parallelism modes. For example, the X dimension is used for data parallelism in the embedding and softmax layers, but used for pipeline parallelism in the encoder and decoder. The Y dimension is also used in different ways to partition the vocabulary, batch or model expert dimensions.

Computation Efficiency
GSPMD provides industry-leading performance in large model training. Parallel models require extra communication to coordinate multiple devices to do the computation. So parallel model efficiency can be estimated by examining the fraction of time spent on communication overhead — the higher percentage utilization and the less time spent on communication, the better. In the recent MLPerf set of performance benchmarks, a BERT-like encoder-only model with ~500 billion parameters to which we applied GSPMD for parallelization over 2048 TPU-V4 chips yielded highly competitive results (see table below), utilizing up to 63% of the peak FLOPS that the TPU-V4s offer. We also provide efficiency benchmarks for some representative large models in the table below. These example model configs are open sourced in the Lingvo framework along with instructions to run them on Google Cloud. More benchmark results can be found in the experiment section of our paper.

Model Family Parameter Count % of model activated* No. of Experts** No. of Layers No. of TPU FLOPS utilization
Dense Decoder (LaMDA) 137B 100% 1 64 1024 TPUv3 56.5%
Dense Encoder (MLPerf-Bert) 480B 100% 1 64 2048 TPUv4 63%
Sparsely Activated Encoder-Decoder (GShard-M4) 577B 0.25% 2048 32 1024 TPUv3 46.8%
Sparsely Activated Decoder 1.2T 8% 64 64 1024 TPUv3 53.8%
*The fraction of the model activated during inference, which is a measure of model sparsity.
**Number of experts included in the Mixture of Experts layer. A value of 1 corresponds to a standard Transformer, without a Mixture of Experts layer.

The ongoing development and success of many useful machine learning applications, such as NLP, speech recognition, machine translation, and autonomous driving, depend on achieving the highest accuracy possible. As this often requires building larger and even more complex models, we are pleased to share the GSPMD paper and the corresponding open-source library to the broader research community, and we hope it is useful for efficient training of large-scale deep neural networks.

We wish to thank Claire Cui, Zhifeng Chen, Yonghui Wu, Naveen Kumar, Macduff Hughes, Zoubin Ghahramani and Jeff Dean for their support and invaluable input. Special thanks to our collaborators Dmitry Lepikhin, HyoukJoong Lee, Dehao Chen, Orhan Firat, Maxim Krikun, Blake Hechtman, Rahul Joshi, Andy Li, Tao Wang, Marcello Maggioni, David Majnemer, Noam Shazeer, Ankur Bapna, Sneha Kudugunta, Quoc Le, Mia Chen, Shibo Wang, Jinliang Wei, Ruoming Pang, Zongwei Zhou, David So, Yanqi Zhou, Ben Lee, Jonathan Shen, James Qin, Yu Zhang, Wei Han, Anmol Gulati, Laurent El Shafey, Andrew Dai, Kun Zhang, Nan Du, James Bradbury, Matthew Johnson, Anselm Levskaya, Skye Wanderman-Milne‎, and Qiao Zhang for helpful discussions and inspirations.


Artisan Baking: How Creators Worldwide Cooked Up GTC Keynote’s Virtual Kitchen

With their marbled counters, neoclassical oven alcove and iconic bouquets of spatulas, the “kitchen keynotes” delivered by NVIDIA founder and CEO Jensen Huang during pandemic-era GTCs have been a memorable setting for the highly anticipated events. The keynotes were initially delivered from his real kitchen, in response to workplace closures. But last spring, the kitchen Read article >

The post Artisan Baking: How Creators Worldwide Cooked Up GTC Keynote’s Virtual Kitchen appeared first on The Official NVIDIA Blog.


Accelerating Financial Services With AI

AI is enabling brighter financial futures for consumers and businesses. From traditional banks to new fintechs, the financial services industry is powering use cases with AI such as preventing payments fraud, automating insurance claims, and accelerating trading strategies. The latest episode in the I AM AI video series brings these technology stories to life by Read article >

The post Accelerating Financial Services With AI appeared first on The Official NVIDIA Blog.


Deep Learning vs Machine Learning Challenger Models for Default Risk with Explainability

This post details the credit default risk prediction with deep learning and Machine learning models.

Data Scientists and Machine Learning Engineers often face the dilemma of “machine learning compared to deep learning” classifier usage for their business problems. Depending upon the nature of the dataset, some data scientists prefer classical machine-learning approaches. Others apply the latest deep learning models, while still others pursue an “ensemble” model hoping to get the best of both worlds – explainability and performance. 

Machine learning, especially decision trees, and leading up to the more advanced XGBoost models, was maturing earlier than deep learning and has some well-established methods. Deep learning excels in the non-tabular computer vision, language, and speech recognition domains. Whichever you pick, GPUs are accelerating data science use cases to the point where any data analysis on large datasets simply requires them for every day convenience, rapid iteration, and results.

RAPIDS makes leveraging GPUs easier for data scientists through interfaces similar to favorites like scikit-learn and pandas. Here, we are working with a tabular dataset. The classic extract-transform-load process (ETL), is a core-starting point in any data science project. 

For GPU-accelerated use cases, the NVIDIA MERLIN application framework for deep recommender systems uses NVTabular – an accelerated feature engineering, preprocessing, and data loading library, which can also be used in other domains such as financial services.

In this article, we demonstrate how to examine competing models, known as challenger models, and use GPU-acceleration to succeed with easy, cost effective, and understandable application of model explainability. When GPU-acceleration is utilized many times during model development, the modeler’s time is used much more effectively by amortizing the training time and cost reductions over dozens of model iterations.

We do this in the context of predicting mortgage delinquencies using the public Fannie Mae mortgage dataset. We also show the simple speedups obtained by an NVTabular data loader for model training. This same example could be extended to credit underwriting, credit card delinquency, or a host of other important class-imbalance binary classification problems.

A common theme in all financial credit risk modeling is the concern for the expected loss. Whether the transaction is a trading agreement between two counter parties where one party owes the other party some amount, or is a loan agreement with borrower owes the lender monthly repayment amounts, we can look at the expected loss, EL, in the following way:



  • PD: the probability of default, taking into account all loans in the population
  • LGD: the loss given default; a value between 0 and 1, which measures the percentage of unpaid loan
  • EAD: the exposure at default, which is the outstanding balance remaining

The PD and EL are attached to a time period, which often can be set to annually or monthly depending on the choice of the firm issuing the loan. 

In our case here, the goal is to predict specific individual loans which are most likely to become delinquent, based on their characteristics or features. Thus, we are mostly concentrating on the loans that affect the PD rate, that is, separating those with an expected loss from those with no expected loss.

Machine Learning and Deep Learning approaches

Machine learning (ML) and deep learning (DL) have evolved into cooperative and competing approaches for analytical prediction. It is becoming best practice to consider both approaches and weigh the outcomes of each individual model, or employ ensemble multiple methods to get the best of both worlds for a given application. Either method can extract deep, complex insights out of data to help make decisions.

In many cases, using more advanced ML models delivers real business value over traditional regression models. However, explaining what drove a particular decision with a more advanced model can be difficult, time consuming, and expensive using traditional infrastructure. The model run time is equally important as well as the run time of the explanation steps for interpreting predictions.

In order to be confident in the results, we want to address the new demand for explainability. Existing techniques can be slow, are computationally expensive, and are ideal candidates for GPU acceleration. By moving to GPU accelerated modeling and explainability, teams can improve the processing, accuracy, explainability, and provide results when the business wants them.

Predicting risk in mortgage loans

Default risk can affect us personally when viewed from the consumer side or affect the issuer side as well. Today, in many countries, there are a number of loans being issued for infrastructure improvement projects. One large highway bridge, for example, can require over a billion U.S. Dollars of debt funding. Obviously, financing huge billion dollar projects come with default risk.

Measuring the probability of default is important as the citizens of the governing body certainly do not want to see a default on that issuance. A paper by the Bank of England entitled Machine learning explainability in finance: an application to default risk analysis served as inspiration for the current body of work, which focuses on housing mortgage loans.

The benefits of having this explanation transparency are easy to see in the Bank of England paper. The authors call their approach Quantitative Input Influence (QII) and QII is used for linear logistic and gradient boosted tree machine-learning prediction models. The question arises: what factors contribute most to the defaults?

The authors illustrate the intuitive power of the explanations. They also make observations, which financial modeling practitioners should take note of. The paper demonstrates the ability to engineer adequate accuracy, precision, and recall for default prediction with their precision-recall curve results.

Model explainability may be an important component of discussions with thought leaders, management, external auditors, and regulators. The Shapley values are computed using open source code where the data set consists of six million loans in the U.K. with an approximate 2.5% default rate.

As described in the recent article by NVIDIA authors, default risk is a very common debt use case in the capital markets, banking, and insurance. For example, credit derivatives are a way to speculate on the likelihood of default by tranche and closer to home. Mortgage lenders are deeply interested in whether they will be paid back in a timely fashion.

Ultimately, insurance contracts are ways to cover the risks associated with flooding, theft, or death from the customer viewpoint. From the lender or insurance underwriting viewpoint, predicting these events is critical to the profitability of their business. With the well-known U.S. Fannie Mae public mortgage loan dataset, we are able to examine approaches to risk and the out-of-sample precision, recall, using GPU accelerated training of ML and DL models.

Please see the original article introducing the GPU-accelerated explainability for credit risk and the extension for Shap value clustering and see also this related article for additional interesting explainability and acceleration results for simulating equity instruments.

For this article here, the focus is on the nuances of the ML and DL models and the approaches for explainability. Upon a 90-days past due loan event, concern is raised in the minds of the lending company. The probability of default is of concern due to the replacement costs. A key result of this article is the reported 29-fold speedup in computing Shap values when GPU-acceleration is applied with an algorithm as outlined in this GPUTreeShap paper.

Our Python program predicting defaults for the U.S. Fannie Mae mortgage dataset will use the GPU-accelerated framework known as RAPIDS. RAPIDS provides an application program interface (API) similar to Python pandas for DataFrame operations. The mortgage loan dataset provided in this handy RAPIDS Mortgage Data link has almost two decades of loan performance data with the actual interest rates and borrower characteristics and lender names on record. We have a classic imbalanced class prediction problem with our mortgage loan tabular dataset since only about 4% of all loans are delinquent.

Factorizing for categorical columns

Factors are an important concept in programming languages. Those readers who are familiar with the R language for statistical computing know about factors, created with the factor() function, as a way to categorize columns into a discrete set of values. R, in fact, defaults to factorizing columns, which can be factorized upon input so much so that the user often should override the read.csv() option with the wordy parameter stringsAsFactors=FALSE. Python users will be happy to know that the Python pandas and RAPIDS packages include a very similar factorize() method as mentioned in this article. For our mortgage dataset, zip codes are a classic column to factorize.

df['Zip'], Zip = df['Zip'].factorize()

A series of these transformational statements is an alternative for one-hot encoded columns, reducing the sparsity and memory required in the transformed data. The advantage of factorizing as opposed to using one-hot encodings is that the data frame does not need to grow wider as the number of column values increases yet we still have the advantages of categorical column variables.

XGBoost classifier tuning

When using decision trees, one obtains the benefit of feature importance. Feature importance is reported to help explain the features used most to make decisions. A feature importance report such as Figure 1 is one of the artifacts that propelled decision trees into a popular classification approach. Decision tree nodes correspond to a set of training dataset rows. Initially we begin with a single node to represent all training rows. Node purity refers to the dataset rows being similar. Node impurity is much more common when we start the process for decision tree training and purity becomes more common as we expand the tree while scanning the dataset. Feature importance is listed in decreasing node impurity, weighted by the chance of reaching that node. The most effective nodes are those that cause the best reduction in impurity and also represent the largest number of samples in the data population.

With decision trees such as the XGBoost classifier, as the decision tree gets expanded through splits from the initial single node to hundreds of nodes, node impurity is not desired when a split occurs to gain accuracy. We will discuss more about explainability soon.

A horizontal bar graph shows the feature names in descending order of importance: OrUnpaidPrinc, Zip, CreditScore, DTIRat, OrCLTV, OrInterestRate, CoCreditScore, PropertyState, SellerName, LoanPurpose.
Figure 1: Feature importance as reported by the XGBoost classifier. The column names of the feature are listed preceding the plot.

An XGBoost classifier was tuned as part of a Python Jupyter notebook to examine the predictability of loan delinquencies. The inspiration for this work was the article by DeGrave. We focused on the XGBoost classifier in the earlier article and are able to report an improvement in the precision and recall here with factorization. Given a dataset with stimulus variables and the Default output variable, there is a limit to the predictability in the rows of data. Our results in Figure 2 are from a run of 11.2 million individual mortgages from 2007 to 2012 with a subset of 1.1 million loans residing in the test set. Using a customized threshold on the emitted probability of default helped to balance the precision and recall more evenly than using the standard value of 0.5. We show the code sequence below with our best parameters. The code for the XGBoost and PyTorch classifiers explanations is available at along with instructions on how to download the mortgage loan dataset.

 params = {
       'num_rounds':     100,
       'max_depth':      12,
       'max_leaves':     0,
       'alpha':          3,
       'lambda':         1,
       'eta':            0.17,
       'subsample':      1,
       'sampling_method':   'gradient_based',
       'scale_pos_weight': scaling, # num_negative_samples/num_positive_samples
       'max_delta_step': 1,
       'max_bin':        2048,
       'tree_method':    'gpu_hist',
       'grow_policy':    'lossguide',
       'n_gpus':         1,
       'objective':      'binary:logistic',
       'eval_metric':    'aucpr',    
       'predictor':      'gpu_predictor',
       'num_parallel_tree': 1,
       "min_child_weight":  2,
       'verbose':        True
 if use_cpu:
       print('training XGBoost model on cpu')
       params['tree_method'] = 'hist'
       params['sampling_method'] = 'uniform'
       params['predictor'] = 'cpu_predictor'
 dtrain = xgb.DMatrix(X_train, label=y_train)
 dtest = xgb.DMatrix(X_test, label=y_test)
 evals = [(dtest, 'test'), (dtrain, 'train')]
 model = xgb.train(params, dtrain, params['num_rounds'], evals=evals,

We can see preceding that the objective for the XGBoost training step is binary:logistic and the evaluation metric is the area-under-curve for precision and recall, known as aucpr. After the model was trained, the threshold corresponding to the maximum F1 score was calculated on the training set. This threshold was applied to the predictions on the test set with the results shown in Figure 2 and is indicated as the red dot in the precision-recall curve in Figure 3.

The sklearn package’s classification report is displayed with positive case precision of 0.36 and recall of 0.48. The test case two by two confusion matrix is displayed.
Figure 2: Reporting precision and recall the test set for 11.2 million total mortgage record training and test sets. 1.1 million loans are included in the test set.
On the left plot, the Recall compared to Precision is reported with an approximate diagonal line from (.01,06) to (1.0,0.0). On the right, the False Positive Rate versus True Positive rate curve is present, which curls from the (0,0) point upward to the (.20,.88) then rightward to the (1,1) point.
Figure 3: The XGBoost Machine Learning Precision-Recall curve left, and the Receiver Operating Characteristic curve, right, reflect the imbalanced nature of the dataset. The Precision-Recall curve has an area of 0.36 which compares favorably with the Bank of England 816 Paper Precision-Recall curve. A redpoint indicates thresholding used to obtain the maximum F1 score for this model.

Speed up PyTorch Deep Learning training with NVTabular

The NVIDIA NVTabular Python package is a feature engineering and preprocessing library for tabular data that is designed to quickly and easily manipulate terabyte scale datasets and train deep learning (DL) based recommender systems. It can be installed using Anaconda or Docker or using pip with the nvtabular keyword. In our case, we simply use it to feed data into our PyTorch classifier during training. We also compared the run-time of using a plain PyTorch Dataloader compared to NVTabular’s Asynchronous PyTorch Dataloader. We found for our mortgage dataset that NVTabular delivers a 6-fold advantage over not using it where both runs were completed on the same GPU. See Figure 4 and this article for more details.

Two bars in a bar plot with one showing the time of 2567.3 seconds on PyTorch on GPU and a reduced time of 382.2 seconds with PyTorch using the NVTabular API on GPU.
Figure 4: NVTabular 6 X acceleration. Both PyTorch training loops are run on a GPU.

For simplicity, we opted for a 5-layer Multi-Layer Perception (MLP) neural network with 512 neurons Linear layer, PReLU, Batch Normalization, and Dropout. The simple MLP is able to match the XGBoost model’s performance on the test set. A more sophisticated model may exceed this performance. The same method was applied to find the threshold yielding the maximum F1 score on the train set before applying this threshold to the test set. The classification report and confusion matrix are described below and a similar PR curve and ROC curves are presented in Figure 5.

On the left plot, the Recall compared to Precision is reported with an approximate diagonal line from (.01,.06) to (1.0,0.0). On the right, the False Positive Rate versus True Positive rate curve is present, which curls from the (0,0) point upward to the (.20,.88) then rightward to the (1,1) point.
Figure 5: The PyTorch Deep Learning Precision-Recall curve, left, and the Receiver Operating Characteristics curve, right, reflect the imbalanced nature of the dataset. The Precision Recall curve has an area of 0.35, which compares favorably with the Bank of England 816 Paper Precision-Recall curve.

Explainability for Machine Learning and Deep Learning

Now that we are confident of our model for predictions, it is important to understand more about how and why it works. Shapley values can be computed with SHAP and Captum for both ML and DL models. For the SHAP package, it is easy to retrieve Shapley values for explaining our XGBoost ML model per the code snippet below:

 expl = shap.TreeExplainer(model)
 shap_values = expl.shap_values(X_test)
 shap.summary_plot(shap_values, X_test.to_pandas(), sort=False, show=False)

The PyTorch DL Shapley values were calculated using the Captum GradientShap method and plotted using the code below, passing the Shapley values into the SHAP summary_plot() method. We separated out positive and negative categorical and continuous variables to enable visualization of one distinct class only or for both classes, which are depicted in Figure 6.

from captum.attr import GradientShap
Gradshap = GradientShap(model)
attr_gs, delta = gradshap.attribute(([pos_cats, neg_cats], dim=0),
         [pos_conts, neg_conts], dim=0)),
                   baselines=(torch.zeros_like(neg_cats, device=device),
                   torch.zeros_like(neg_conts, device=device)), n_samples=200,
df = DataFrame(cp.asarray([[pos_cats, pos_conts], dim=1), 
     [neg_cats, neg_conts], dim=1)], dim=0))) 
svals = cp.asnumpy(, dim=1)) 
                  sort=False, show=False) 

In general, we would like to explain a single prediction and interpret how the features led to that prediction. The Shapley feature explanations sum up to the prediction for a single row, and we can aggregate across rows to explain model predictions in bulk. For regulatory purposes, this means that a model can provide a human-interpretable explanation for any output, favorable, or adverse. Anyone who holds a mortgage or works in the debt instrument domain can recognize these familiar factors.

The SHAP values are x-axis labels on the two x-y plots and feature values are labels on the y-axis. 20 horizontally oriented shapes represent the SHAP values.
Figure 6: Using the Shapley Algorithm to measure the impact and direction of a feature. Red means the feature value being higher and blue means the feature value is lower. A more positive SHAP value indicates a higher contribution to the positive class (loan delinquency) and vice versa. Feature names typically appear on the left-hand side.

Figure 6 depicts both ML and DL Shapley values side by side. We can interpret Figure 6 in the following manner by considering the CreditScore and Interest Rate (OrInterestRate) features as an example. The red portion of the CreditScore feature indicates higher credit scores as mentioned in the legend on the right on the plot with higher feature value as red and lower feature value as blue. For CreditScore points clustered on the negative x-axis corresponding to negative SHAP values contributing to the negative or non-delinquent class suggesting that people with high credit scores are less likely to be delinquent. Symmetrically, blue (low) values for CreditScore are on the positive x-axis or positive Shapley values, indicating a contribution to the positive or delinquent class.

A similar yet opposite interpretation can be applied with the OrInterestRate feature: low (blue) interest rates yield negative Shapley values and are associated with lower delinquency rates and this makes intuitive sense as lower rates mean lower mortgage payments. Some features may be less clear and provide an opportunity for the Data Scientist or Machine Learning engineer to improve a model. For example, in our simple MLP model, we concatenated the factorized categorical features with the continuous features before passing into the MLP. An improvement to this model may be to use categorical embeddings, which could both improve model performance and enhance explainability as well. In this way, a Data Scientist or Machine Learning Engineer can try to optimize both model explainability and performance.

GPU-Acceleration results

Figures 7 and 8 focus on the test of the time required to read the input datasets, merge the two datasets, as well as the DL inference step on an NVIDIA Ampere A100 GPU when compared to an Ice Lake 24 core dual CPU. As Table 1 shows, there are solid speed ups in every step.

Three bars charts depicting the values in rows 1 through 3 of Table 1.
Figure 7: The relative run-time latency of the key RAPIDS Python steps, which show a speedup of 6X to 38X for the compute-intensive steps taking more than 1 second.
Five double bars charts depicting the values in rows 4 through 8 of Table 1.
Figure 8: The relative run-time latency of the key Python steps, which show a speedup of 2X to 29X for the compute-intensive steps

Table 1 quantifies the speedups illustrated in Figures 7 and 8, and underscores the benefit of GPU-acceleration.

CPU only NVIDIA Ampere
A100 40GB GPU
Speed up Factor
Read cvs files 587.0 sec 15.3 sec 38X
merge 4.6 sec 0.04 sec 115X
PyTorch inference 27.7 sec 4.3 sec 6X
XGBoost train 134.0 sec. 10.7 sec 12.5X
XGBoost shap 265.0 sec 9.2 sec 29X
PyTorch train 13314.6 sec 2567.3 sec 5X
PyTorch train with NVTabular NA 382.2 sec 6X over PyTorch train w/GPU
Captum GradientShap 289.1 sec 166.8 sec 2x
Table 1: The relative compute latency of various steps of the data transformation or ETL process and the training and inference steps for the 11.2 million loan dataset.

In this post, we have expanded on a related earlier post, discussing credit default risk prediction with deep learning and discussed:

  • How to use RAPIDS to GPU-accelerate the complete default analytics workflow
  • How to apply the XGBoost implementation inside of RAPIDS with GPUs
  • How to apply the PyTorch Deep Learning library to tabular data with GPUs
  • How to use the NVIDIA NVTabular package for PyTorch DL on GPU to obtain 6x faster run-time performance simply by changing the Data Loader.
  • How to access explainable predictions with the Shap and Captum packages with GPUs and using these explainable results for further model improvements.

We recommend the following steps:

  • Visit the site for the NVIDIA GPU Cloud repository of containers to help build AI solutions available.
  • Review or attend the most recent NVIDIA Global Technology Conference to share ideas and technology solutions.
  • For more information, e-mail one of the authors listed.

Get Started on DOCA for DPUs with a Free Introductory Course

DDOCA software consists of an SDK and a runtime environmentFollowing the announcement of Early Access to the NVIDIA DOCA Software Framework at this year’s GTC, held in November, we launched a self-paced DOCA course to help you start working with this new framework. The NVIDIA Deep Learning Institute (DLI) is offering a free self-paced course titled “Introduction to DOCA for DPUs.” In this 2-hour introductory course, you will learn how DOCA and … ContinuedDDOCA software consists of an SDK and a runtime environment

Following the announcement of Early Access to the NVIDIA DOCA Software Framework at this year’s GTC, held in November, we launched a self-paced DOCA course to help you start working with this new framework. The NVIDIA Deep Learning Institute (DLI) is offering a free self-paced course titled “Introduction to DOCA for DPUs.” In this 2-hour introductory course, you will learn how DOCA and DPUs enable the development of applications that accelerate data center services. This highly anticipated training covers the essentials of the DOCA platform.

A new pillar of computing

Over the past decade, computing has broken out of the confines of PCs and servers into hyperscale data centers. With this paradigm shift comes Data Processing Units (DPUs), a new class of programmable processors that will join CPUs and GPUs as one of the three pillars of computing. DPUs are designed to offload all the virtual data center such as networking, security, and storage workloads from the CPU. In doing so, they meaningfully reduce overhead for the server CPU to focus on its primary application workload. We’re very encouraged by the prospect and are excited to foster this transformation of data center computing. 

What is DOCA?

DOCA is the DPU-enablement platform of software (with libraries, drivers, runtimes, etc.) that abstracts the low-level programming requirements. It is the key to unlocking the potential of DPUs to offload, accelerate, and isolate data center workloads. With DOCA, developers can create applications that address the increasing performance, security, and reliability demands of modern data centers. 

What will you learn?

In this course, you will start by getting an overview of DOCA and its features that will shape how you think about the platform and its capabilities. Through live demonstrations, you will also get familiar with the BlueField DPU hardware and the many ways to interact with it in different DOCA development environments. By the end of the course, you will be exposed to numerous examples of how DPU-acceleration can be realized.

By participating in this course, you will learn:

  • Basic concepts of DOCA as a platform for accelerated data center computing on DPUs
  • The DOCA framework paradigm
  • BlueField DPU specifications and capabilities
  • Sample DOCA applications under different configurations
  • Opportunities to apply DPU accelerated computation

This course contains everything you need to begin working with DOCA. Upon completion, you will also have an opportunity to earn an NVIDIA DLI certificate to demonstrate competency in DOCA. Since one of the main focuses of DOCA is backward compatibility, you can be confident that the investment in development done today will continue to provide performance benefits as DPU hardware generations move forward. It’s never too early to get started!


struggling to install it

I used pip install tensorflow

but got this error

ERROR: Could not find a version that satisfies the requirement tensorflow (from versions: none) ERROR: No matching distribution found for tensorflow 

I am using pycharm and python 3.10.1 and i am not sure what to do

submitted by /u/MainDepth
[visit reddit] [comments]


GPU not recognized after it’s been working for a while?

So I was using my GPU for a while and I’m not sure when it started but now when I check I realize my GPU is not being picked up. I’m definitely sure I was on GPU before and I went through the pain of installing it so I know most of my setup should be right…

  • standard tensorflow unistalled, only tensorflow-gpu installed
  • C:Program FilesNVIDIA GPU Computing ToolkitCUDAv10.1bin is on my path for deps like cusolver64_10.dll
  • C:toolscudabin is also on my path for cudnn64_7.dll

Any ideas why I might be having issues?

submitted by /u/Yogi_DMT
[visit reddit] [comments]


Using TF timeseries_dataset_from_array with more samples

Using TF timeseries_dataset_from_array with more samples

I have to handle a huge amount of samples, where each sample contains unique time series. The goal is to feed this data into the Tensorflow LSTM model and predict some features. I have created the tf timeseries_dataset_from_arraygenerator function to feed the data to the TF model, but I haven’t figured out how to create a generator function when I have multiple samples. If I use the usual pipeline, tf timeseries_dataset_from_arrayoverlap the time series of two individual samples.

Does anyone have an idea how to effectively pass a time series of multiple samples to the TF model?

E.g. the Human Activity Recognition Dataset is one such dataset where each person has a separate long, time series, and each user’s time series can be further parsed with the SLIDING/ROLLING WINDOS-like timeseries_dataset_from_arrayfunction.

Here is a simpler example:

I want to use timeseries_dataset_from_arrayto generate samples for the TF model. Example: sample 1 where column 0 has 0, sample 2 starts where column 0 has 100. Here is a simpler example:

I want to get 3D data (samples, timesteps, features) without overlap.For example (6,2,7) Like this:

Here is the sample code:

from tensorflow.keras.preprocessing import timeseries_dataset_from_array import numpy as np x = np.array([[0,1,2,3,4,5,6], [0,11,12,13,14,15,16], [0,21,22,23,24,25,26], [0,31,32,33,34,35,36], [0,41,42,43,44,45,46] ]) xx = np.concatenate((x, x+100), axis=0)#.reshape(2,5,6) sequence_length=2 stride=1 rate=1 input_dataset = timeseries_dataset_from_array(xx, None, sequence_length, sequence_stride=stride, sampling_rate=rate) x_test = np.concatenate([x for x in input_dataset], axis=0) 

submitted by /u/korosig
[visit reddit] [comments]


Silicon Express Lanes: AI, GPUs Pave Fast Routes for Chip Designers

AI can design chips no human could, said Bill Dally in a virtual keynote today at the Design Automation Conference (DAC), one of the world’s largest gatherings of semiconductor engineers. The chief scientist of NVIDIA discussed research in accelerated computing and machine learning that’s making chips smaller, faster and better. “Our work shows you can Read article >

The post Silicon Express Lanes: AI, GPUs Pave Fast Routes for Chip Designers appeared first on The Official NVIDIA Blog.


GPU Operator 1.9 Adds Support for DGX A100 with DGX OS

GPU Operator 1.9 includes support for NVIDIA DGX A100 systems and streamlined installation processes.GPU Operator 1.9 includes support NVIDIA DGX A100 systems with DGX OS and streamlined installation processes. GPU Operator 1.9 includes support for NVIDIA DGX A100 systems and streamlined installation processes.

NVIDIA GPU Operator allows organizations to easily scale NVIDIA GPUs on Kubernetes. 

By simplifying the deployment and management of GPUs with Kubernetes, the GPU Operator enables infrastructure teams to scale GPU applications error-free, within minutes, automatically. 

GPU Operator 1.9 is now available and includes several key features, among other updates, that allow users to get started faster and maintain uninterrupted service. 

GPU Operator 1.9 includes:

  • Support for NVIDIA DGX A100 systems with DGX OS
  • Streamlined installation process

Support for DGX A100 with DGX OS

With 1.9, the GPU Operator automatically deploys the software required for initializing the fabric on NVIDIA NVSwitch systems, including the DGX A100 when used with DGX OS. Once initialized, all GPUs can communicate with one another at full NVLink bandwidth to create an end-to-end scalable computing platform. 

The DGX A100 features the world’s most advanced accelerator, enabling enterprises to consolidate training, inference, and analytics into a unified, easy-to-deploy AI infrastructure. And now, with GPU Operator support, organizations can take their applications from training to scale with the world’s most advanced systems.  

Streamlined installation process

With previous versions of GPU Operator, organizations using GPU Operator with OpenShift needed to apply additional entitlements from Red Hat in order to successfully use the GPU Operator. As entitlement keys expired, users would need to re-apply them to ensure that their workflow was not interrupted. 

GPU Operator 1.9 now supports entitlement-free driver containers for OpenShift. This is done by leveraging Driver-Toolkit images provided by RedHat with necessary kernel packages preinstalled for building NVIDIA kernel modules. Users no longer need to ensure that valid certificates with an RHEL subscription are always applied for running GPU Operator. More importantly for disconnected clusters, it eliminates dependencies on private package repositories.

Version 1.9 also includes support for preinstalled drivers with the MIG Manager, support for preinstalled MOFED to use GPUDirect RDMA, automatic detection of container runtime, and automatic disabling of NOUVEAU – all designed to make it easier for users to get started and continue GPU-accelerated Kubernetes. 

Additionally, GPU Operator 1.9 automatically detects the container runtime installed on the worker node. There is no need to specify the container runtime at install time.

GPU Operator 1.9:

helm install --wait --generate-name nvidia/gpu-operator 

GPU Operator 1.8 and earlier:

helm install --wait --generate-name nvidia/gpu-operator --set operator.defaultRuntime=containerd

GPU Operator requires Nouveau to be disabled. With previous GPU Operator versions, the K8s admin had to disable Nouveau as documented here. GPU Operator 1.9 automatically detects if Nouveau is enabled and disables it for you.

GPU Operator Resources

The following resources are available for using NVIDIA GPU Operator: 

The NVIDIA GPU Operator is a key component to many edge computing solutions. Learn more about NVIDIA solutions for edge computing.