Categories
Misc

NVIDIA CUDA Toolkit Symbol Server

Decorative image of two boxes with libcuda.sym labels.NVIDIA has already made available a GPU driver binary symbols server for Windows. Now, NVIDIA is making available a repository of CUDA Toolkit symbols for…Decorative image of two boxes with libcuda.sym labels.

NVIDIA has already made available a GPU driver binary symbols server for Windows. Now, NVIDIA is making available a repository of CUDA Toolkit symbols for Linux.

What are we providing?

NVIDIA is introducing CUDA Toolkit symbols for Linux for an application development enhancement. During application development, you can now download obfuscated symbols for NVIDIA libraries that are being debugged or profiled in your application. This is shipping initially for the CUDA Driver (libcuda.so) and the CUDA Runtime (libcudart.so), with more libraries to be added.

For instance, when an issue appears to relate to a CUDA API, it may not always be possible to provide NVIDIA with a reproducing example, core dump, or unsymbolized stack traces with all library load information. Providing a symbolized call stack can help speed up the debug process.

We are only hosting symbol files, so debug data will not be distributed. The symbol files contain obfuscated symbol names.

Quickstart guide

There are two recommended ways to use the obfuscated symbols for each library:

  • By unstripping the library
  • By deploying the .sym file as a symbol file for the library
# Determine the symbol file to fetch and obtain it
$ readelf -n /usr/local/cuda/lib64/libcudart.so

# ... Build ID: 70f26eb93e24216ffc0e93ccd8da31612d277030
# Browse to https://cudatoolkit-symbols.nvidia.com/libcudart.so/70f26eb93e24216ffc0e93ccd8da31612d277030/index.html to determine filename to download
$ wget https://cudatoolkit-symbols.nvidia.com/libcudart.so/70f26eb93e24216ffc0e93ccd8da31612d277030/libcudart.so.12.2.128.sym

# Then with appropriate permissions, either unstrip,
$ eu-unstrip /usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudart.so.12.2.128 libcudart.so.12.2.128.sym –o /usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudart.so.12.2.128

# Or, with appropriate permissions, deploy as symbol file
# By splitting the Build ID into first two characters as directory, then remaining with ".debug" extension
$ cp libcudart.so.12.2.128.sym /usr/lib/debug/.build-id/70/f26eb93e24216ffc0e93ccd8da31612d277030.debug

Example: Symbolizing

Here is a simplified example to show the uses of symbolizing. The sample application test_shared has a data corruption that leads to an invalid handle being passed to the CUDA Runtime API cudaStreamDestroy. With a default install of CUDA Toolkit and no obfuscated symbols, the output in gdb might look like the following:

Thread 1 "test_shared" received signal SIGSEGV, Segmentation fault.
0x00007ffff65f9468 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0  0x00007ffff65f9468 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffff6657e1f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffff6013845 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#3  0x00007ffff604e698 in cudaStreamDestroy () from /usr/local/cuda/lib64/libcudart.so.12
#4  0x00005555555554e3 in main ()

After applying the obfuscated symbols in one of the ways described earlier, it would give a stack trace like the following example:

Thread 1 "test_shared" received signal SIGSEGV, Segmentation fault.
0x00007ffff65f9468 in libcuda_8e2eae48ba8eb68460582f76460557784d48a71a () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0  0x00007ffff65f9468 in libcuda_8e2eae48ba8eb68460582f76460557784d48a71a () from /lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffff6657e1f in libcuda_10c0735c5053f532d0a8bdb0959e754c2e7a4e3d () from /lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffff6013845 in libcudart_43d9a0d553511aed66b6c644856e24b360d81d0c () from /usr/local/cuda/lib64/libcudart.so.12
#3  0x00007ffff604e698 in cudaStreamDestroy () from /usr/local/cuda/lib64/libcudart.so.12
#4  0x00005555555554e3 in main ()

The symbolized call stack can then be documented as part of the bug description provided to NVIDIA for analysis.

Conclusion

When you have to profile and debug applications using CUDA and want to share a call stack with NVIDIA for analysis, use the CUDA symbol server. Profiling and debugging will be faster and easier.

For questions or issues, dive into the forum at Developer Tools.

Categories
Misc

Unlocking Multi-GPU Model Training with Dask XGBoost

A diagram representing multi-GPU training.As data scientists, we often face the challenging task of training large models on huge datasets. One commonly used tool, XGBoost, is a robust and efficient…A diagram representing multi-GPU training.

As data scientists, we often face the challenging task of training large models on huge datasets. One commonly used tool, XGBoost, is a robust and efficient gradient-boosting framework that’s been widely adopted due to its speed and performance for large tabular data. 

Using multiple GPUs should theoretically provide a significant boost in computational power, resulting in faster model training. Yet, many users have found it challenging when attempting to leverage this power through Dask XGBoost. Dask is a flexible open-source Python library for parallel computing and XGBoost provides Dask APIs to train CPU or GPU Dask DataFrames.

A common hurdle of training Dask XGBoost is handling out of memory (OOM) errors at different stages, including

  • Loading the training data
  • Converting the DataFrame into XGBoost’s DMatrix format
  • During the actual model training

Addressing these memory issues can be challenging, but very rewarding because the potential benefits of multi-GPU training are enticing.

Top takeaways

This post explores how you can optimize Dask XGBoost on multiple GPUs and manage memory errors. Training XGBoost on large datasets presents a variety of challenges. I use the Otto Group Product Classification Challenge dataset to demonstrate the OOM problem and how to fix it. The dataset has 180 million rows and 152 columns, totaling 110 GB when loaded into memory.

The key issues we tackle include: 

  • Installation using the latest version of RAPIDS and the correct version of XGBoost.
  • Setting environment variables.
  • Dealing with OOM errors.
  • Utilizing UCX-py for more speedup.

Be sure to follow along with the accompanying Notebooks for each section.

Prerequisites

An initial step in leveraging the power of RAPIDS for multi-GPU training is the correct installation of RAPIDS libraries. It’s critical to note that there are several ways to install these libraries—pip, conda, docker, and building from source, each compatible with Linux and Windows Subsystem for Linux. 

Each method has unique considerations. For this guide, I recommend using Mamba, while adhering to the conda install instructions. Mamba provides similar functionalities as conda but is much faster, especially for dependency resolution. Specifically, I opted for a fresh installation of mamba

Install the latest RAPIDS version

As a best practice, always install the latest RAPIDS libraries available to use the latest features. You can find up-to-date install instructions in the RAPIDS Installation Guide

This post uses version 23.04, which can be installed with the following command:

mamba create -n rapids-23.04 -c rapidsai -c conda-forge -c nvidia  

    rapids=23.04 python=3.10 cudatoolkit=11.8

This instruction installs all the libraries required including Dask, Dask-cuDF, XGBoost, and more. In particular, you’ll want to check the XGBoost library installed using the command:

mamba list xgboost

The output is listed in Table 1:

Name Version Build Channel
XGBoost 1.7.1dev.rapidsai23.04 cuda_11_py310_3 rapidsai-nightly
Tabel 1. Install the correct XGBoost whose channel should be rapidsai or rapidsai-nightly

Avoid manual updates for XGBoost

Some users might notice that the version of XGBoost is not the latest, which is 1.7.5. Manually updating or installing XGBoost using pip or conda-forge is‌ problematic when training XGBoost together with UCX. 

The error message will read something like the following:
Exception: “XGBoostError(‘[14:14:27] /opt/conda/conda-bld/work/rabit/include/rabit/internal/utils.h:86: Allreduce failed’)”

Instead, use the XGBoost installed from RAPIDS. A quick way to verify the correctness of the XGBoost version is mamba list xgboost and check the “channel” of the xgboost, which should be “rapidsai” or “rapidsai-nightly”.  

XGBoost in the rapidsai channel is built with the RMM plug-in enabled and delivers the best performance regarding multi-GPU training.

Multi-GPU training walkthrough 

First, I’ll walk through a multi-GPU training notebook for the Otto dataset and cover the steps to make it work. Later on, we will talk about some advanced optimizations including UCX and spilling. 

You can also find the XGB-186-CLICKS-DASK Notebook on GitHub. Alternatively, we provide a python script with full command line configurability.

The main libraries we are going to use are xgboost, dask, dask_cuda, and dask-cudf. 

import os

import dask

import dask_cudf

import xgboost as xgb

from dask.distributed import Client

from dask_cuda import LocalCUDACluster

Environment set up

First, let’s set our environment variables to make sure our GPUs are visible. This example uses eight GPUs with 32 GB of memory on each GPU, which is the minimum requirement to run this notebook without OOM complications. In Section Enable memory spilling below we will discuss techniques to lower this requirement to 4 GPUs.

GPUs = ','.join([str(i) for i in range(0,8)])
os.environ['CUDA_VISIBLE_DEVICES'] = GPUs

Next, define a helper function to create a local GPU cluster for a mutli-GPU single node.

def get_cluster():

    cluster = LocalCUDACluster()

    client = Client(cluster)

    return client

Then, create a Dask client for your computations.

client = get_cluster()

Loading data

Now, let’s load the Otto dataset. Use dask_cudf read_parquet function, which uses multiple GPUs to read the parquet files into a dask_cudf.DataFrame.

users = dask_cudf.read_parquet('/raid/otto/Otto-Comp/pqs/train_v152_*.pq').persist()

The dataset consists of 152 columns that represent engineered features, providing information about the frequency with which specific product pairs are viewed or purchased together. The objective is to predict which product the user will click next based on their browsing history. The details of this dataset can be found in this writeup.

Even at this early stage, out of memory errors can occur. This issue often arises due to excessively large row groups in ‌parquet files. To resolve this, we recommend rewriting the parquet files with smaller row groups. For a more in-depth explanation, refer to the Parquet Large Row Group Demo Notebook.

After loading the data, we can check its shape and memory usage.

users.shape[0].compute()

users.memory_usage().sum().compute()/2**30

The ‘clicks’ column is our target, which means if the recommended item was clicked by the user. We ignore the ID columns and use the rest columns as features.

FEATURES = users.columns[2:]

TARS = ['clicks']

FEATURES = [f for f in FEATURES if f not in TARS]

Next, we create a DaskQuantileDMatrix which is the input data format for training xgboost models. DaskQuantileDMatrix is a drop-in replacement for the DaskDMatrix when the histogram tree method is used. It helps reduce overall memory usage. 

This step is critical to avoid OOM errors. If we use the DaskDMatrix OOM occurs even with 16 GPUs. In contrast, DaskQuantileDMatrix enables training xgboot with eight GPUs or less without OOM errors.

dtrain = xgb.dask.DaskQuantileDMatrix(client, users[FEATURES], users['clicks'])

XGBoost model training

We then set our XGBoost model parameters and start the training process. Given the target column ‘clicks’ is binary, we use the binary classification objective. 

xgb_parms = { 

    'max_depth':4, 

    'learning_rate':0.1, 

    'subsample':0.7,

    'colsample_bytree':0.5, 

    'eval_metric':'map',

    'objective':'binary:logistic',

    'scale_pos_weight':8,

    'tree_method':'gpu_hist',

    'random_state':42

}

Now, you’re ready to train the XGBoost model using all eight GPUs.  

Output:

[99] train-map:0.20168

CPU times: user 7.45 s, sys: 1.93 s, total: 9.38 s

Wall time: 1min 10s

That’s it! You’re done with training the XGBoost model using multiple GPUs.

Enable memory spilling

In the previous XGB-186-CLICKS-DASK Notebook, training the XGBoost model on the Otto dataset required a minimum of eight GPUs. Given that this dataset occupies 110GB in memory, and each V100 GPU offers 32GB, the data-to-GPU-memory ratio amounts to a mere 43% (calculated as 110/(32*8)). 

Optimally, we’d halve this by using just four GPUs. Yet, a straightforward reduction of GPUs in our previous setup invariably leads to OOM errors. This issue arises from the creation of temporary variables needed to generate the DaskQuantileDMatrix from the Dask cuDF dataframe and in other steps of training XGBoost. These variables themselves consume a substantial share of the GPU memory. 

Optimize the same GPU resources to train larger datasets

In the XGB-186-CLICKS-DASK-SPILL Notebook, I introduce minor tweaks to the previous setup. By enabling spilling, you can now train on the same dataset using just four GPUs. This technique allows you to train much larger data with the same GPU resources.

Spilling is the technique that moves data automatically when an operation that would otherwise succeed runs out of memory due to other dataframes or series taking up needed space in GPU memory. It enables out-of-core computations on datasets that don’t fit into memory. RAPIDS cuDF and dask-cudf now support spilling from GPU to CPU memory 

Enabling spilling is surprisingly easy, where we just need to reconfigure the cluster with two new parameters, device_memory_limit and jit_unspill:

def get_cluster():

    ip = get_ip()

    cluster = LocalCUDACluster(ip=ip, 

                               device_memory_limit='10GB',

                               jit_unspill=True)

    client = Client(cluster)

    return client

device_memory_limit='10GB’ sets a limit on the amount of GPU memory that can be used by each GPU before spilling is triggered. Our configuration intentionally assigns a device_memory_limit of 10GB, substantially less than the total 32GB of the GPU. This is a deliberate strategy designed to preempt OOM errors during XGBoost training. 

It’s also important to understand that memory usage by XGBoost isn’t managed directly by Dask-CUDA or Dask-cuDF. Therefore, to prevent memory overflow, Dask-CUDA and Dask-cuDF need to initiate the spilling process before the memory limit is reached by XGBoost operations.

Jit_unspill enables Just-In-Time un-spilling, which means that the cluster will automatically spill data from GPU memory to main memory when GPU memory is running low, and unspill it back just in time for a computation.

And that’s it! The rest of the notebook is identical to the previous notebook. Now it can train with just four GPUs, saving 50% of computing resources.

Refer to the XGB-186-CLICKS-DASK-SPILL Notebook for details.

Use Unified Communication X (UCX) for optimal data transfer

UCX-py is a high-performance communication protocol that provides optimized data transfer capabilities, which is particularly useful for GPU-to-GPU communication.

To use UCX effectively, we need to set another environment variable RAPIDS_NO_INITIALIZE

os.environ["RAPIDS_NO_INITIALIZE"] = "1"

It stops cuDF from running various diagnostics on import which requires the creation of an NVIDIA CUDA context. When running distributed and using UCX, we have to bring up the networking stack before a CUDA context is created (for various reasons). By setting that environment variable, any child processes that import cuDF do not create a CUDA context before UCX has a chance to do so. 

Reconfigure the cluster:

def get_cluster():

    ip = get_ip()

    cluster = LocalCUDACluster(ip=ip, 

                               device_memory_limit='10GB',

                               jit_unspill=True,

                               protocol="ucx", 

                                 rmm_pool_size="29GB"

)

    client = Client(cluster)

    return client

The protocol=’ucx’ parameter specifies UCX to be the communication protocol used for transferring data between the workers in the cluster.

Use the prmm_pool_size=’29GB’ parameter to set the size of the RAPIDS Memory Manager (RMM) pool for each worker. RMM allows for efficient use of GPU memory. In this case, the pool size is set to 29GB which is less than the total GPU memory size of 32GB. This adjustment is crucial as it accounts for the fact that XGBoost creates certain intermediate variables that exist outside the control of the RMM pool.

By simply enabling UCX, we experienced a substantial acceleration in our training times—a significant speed boost of 20% with spilling, and an impressive 40.7% speedup when spilling was not needed. Refer to the XGB-186-CLICKS-DASK-UCX-SPILL Notebook for details.

Configure local_directory

There are times when warning messages emerge, such as, “UserWarning: Creating scratch directories is taking a surprisingly long time.” This is a signal indicating that ‌disk performance is becoming a bottleneck. 

To circumvent this issue, we could set local_directory of dask-cuda, which specifies the path on the local machine to store temporary files. These temporary files are used during Dask’s spill-to-disk operations. 

A recommended practice is to set the local_directory to a location on a fast storage device. For instance, we could set local_directory to /raid/dask_dir if it is on a high-speed local SSD. Making this simple change can significantly reduce the time it takes for scratch directory operations, optimizing your overall workflow. 

The final cluster configuration is as follows:

def get_cluster():

    ip = get_ip()

    cluster = LocalCUDACluster(ip=ip, 

                               local_directory=’/raid/dask_dir’               

                               device_memory_limit='10GB',

                               jit_unspill=True,

                               protocol="ucx", 

                                 rmm_pool_size="29GB"

)

    client = Client(cluster)

    return client

Results

As shown in Table 2, the two main optimization techniques are UCX and spilling. We managed to train XGBoost with just four GPUs and 128GB of memory. We will also show the performance scales nicely to more GPUs.

Spilling off Spilling on
UCX off 135s / 8GPUs / 256 GB 270s / 4GPUs / 128 GB
UCX on 80s / 8GPUs /  256 GB  217s / 4GPUs / 128 GB
Table 2. Overview of four combinations of optimizations

In each cell, the numbers represent end-to-end execution time, the minimum number of GPUs required, and the total GPU memory available. All four demos accomplish the same task of loading and training 110 GB of Otto data.

Summary

In conclusion, leveraging Dask and XGBoost with multiple GPUs can be an exciting adventure, despite the occasional bumps like out of memory errors.

You can mitigate these memory challenges and tap into the potential of multi-GPU model training by:

  • Carefully configuring parameters such as row group size in the input parquet files
  • Ensuring the correct installation of RAPIDS and XGBoost
  • Utilizing Dask Quantile DMatrix
  • Enabling spilling

Furthermore, by applying advanced features such as UCX-Py, you can significantly speed up training times.

Sign up for the latest Data Science news. Get the latest announcements, notebooks, hands-on tutorials, events, and more in your inbox once a month from NVIDIA.
Categories
Misc

Ask Me Anything: Winning Formula for the Best Multilingual Recommender Systems

Illustration of a person in a living room on a sofa watching TV.On Sept. 13, connect with the winning multilingual recommender systems Kaggle Grandmaster team of KDD’23.Illustration of a person in a living room on a sofa watching TV.

On Sept. 13, connect with the winning multilingual recommender systems Kaggle Grandmaster team of KDD’23.

Categories
Misc

Attention, Please: Focus Entertainment Brings Game Pass Titles to GeForce NOW

GeForce NOW brings expanded support for PC Game Pass to members this week. Members can stream eight more games from Microsoft’s subscription service, including four titles from hit publisher Focus Entertainment. Play A Plague Tale: Requiem, Atomic Heart and more from the GeForce NOW library at up to 4K resolution and 120 frames per second Read article >

Categories
Offsites

TSMixer: An all-MLP architecture for time series forecasting

Time series forecasting is critical to various real-world applications, from demand forecasting to pandemic spread prediction. In multivariate time series forecasting (forecasting multiple variants at the same time), one can split existing methods into two categories: univariate models and multivariate models. Univariate models focus on inter-series interactions or temporal patterns that encompass trends and seasonal patterns on a time series with a single variable. Examples of such trends and seasonal patterns might be the way mortgage rates increase due to inflation, and how traffic peaks during rush hour. In addition to inter-series patterns, multivariate models process intra-series features, known as cross-variate information, which is especially useful when one series is an advanced indicator of another series. For example, a rise in body weight may cause an increase in blood pressure, and increasing the price of a product may lead to a decrease in sales. Multivariate models have recently become popular solutions for multivariate forecasting as practitioners believe their capability of handling cross-variate information may lead to better performance.

In recent years, deep learning Transformer-based architectures have become a popular choice for multivariate forecasting models due to their superior performance on sequence tasks. However, advanced multivariate models perform surprisingly worse than simple univariate linear models on commonly-used long-term forecasting benchmarks, such as Electricity Transformer Temperature (ETT), Electricity, Traffic, and Weather. These results raise two questions:

  • Does cross-variate information benefit time series forecasting?
  • When cross-variate information is not beneficial, can multivariate models still perform as well as univariate models?

In “TSMixer: An All-MLP Architecture for Time Series Forecasting”, we analyze the advantages of univariate linear models and reveal their effectiveness. Insights from this analysis lead us to develop Time-Series Mixer (TSMixer), an advanced multivariate model that leverages linear model characteristics and performs well on long-term forecasting benchmarks. To the best of our knowledge, TSMixer is the first multivariate model that performs as well as state-of-the-art univariate models on long-term forecasting benchmarks, where we show that cross-variate information is less beneficial. To demonstrate the importance of cross-variate information, we evaluate a more challenging real-world application, M5. Finally, empirical results show that TSMixer outperforms state-of-the-art models, such as PatchTST, Fedformer, Autoformer, DeepAR and TFT.

TSMixer architecture

A key difference between linear models and Transformers is how they capture temporal patterns. On one hand, linear models apply fixed and time-step-dependent weights to capture static temporal patterns, and are unable to process cross-variate information. On the other hand, Transformers use attention mechanisms that apply dynamic and data-dependent weights at each time step, capturing dynamic temporal patterns and enabling them to process cross-variate information.

In our analysis, we show that under common assumptions of temporal patterns, linear models have naïve solutions to perfectly recover the time series or place bounds on the error, which means they are great solutions for learning static temporal patterns of univariate time series more effectively. In contrast, it is non-trivial to find similar solutions for attention mechanisms, as the weights applied to each time step are dynamic. Consequently, we develop a new architecture by replacing Transformer attention layers with linear layers. The resulting TSMixer model, which is similar to the computer vision MLP-Mixer method, alternates between applications of the multi-layer perceptron in different directions, which we call time-mixing and feature-mixing, respectively. The TSMixer architecture efficiently captures both temporal patterns and cross-variate information, as shown in the figure below. The residual designs ensure that TSMixer retains the capacity of temporal linear models while still being able to exploit cross-variate information.

Transformer block and TSMixer block architectures. TSMixer replaces the multi-head attention layer with time-mixing, a linear model applied on the time dimension.

Comparison between data-dependent (attention mechanisms) and time-step-dependent (linear models). This is an example of forecasting the next time step by learning the weights of the previous three time steps.

Evaluation on long-term forecasting benchmarks

We evaluate TSMixer using seven popular long-term forecasting datasets (ETTm1, ETTm2, ETTh1, ETTh2, Electricity, Traffic, and Weather), where recent research has shown that univariate linear models outperform advanced multivariate models with large margins. We compare TSMixer with state-of-the-art multivariate models (TFT, FEDformer, Autoformer, Informer), and univariate models, including linear models and PatchTST. The figure below shows the average improvement of mean squared error (MSE) by TSMixer compared with others. The average is calculated across datasets and multiple forecasting horizons. We demonstrate that TSMixer significantly outperforms other multivariate models and performs on par with state-of-the-art univariate models. These results show that multivariate models are capable of performing as well as univariate models.

The average MSE improvement of TSMixer compared with other baselines. The red bars show multivariate methods and the blue bars show univariate methods. TSMixer achieves significant improvement over other multivariate models and achieves comparable results to univariate models.

Ablation study

We performed an ablation study to compare TSMixer with TMix-Only, a TSMixer variant that consists of time mixing layers only. The results show that TMix-Only performs almost the same as TSMixer, which means the additional feature mixing layers do not improve the performance and confirms that cross-variate information is less beneficial on popular benchmarks. The results validate the superior univariate model performance shown in previous research. However, existing long-term forecasting benchmarks are not well representative of the need for cross-variate information in some real-world applications where time series may be intermittent or sparse, hence temporal patterns may not be sufficient for forecasting. Therefore, it may be inappropriate to evaluate multivariate forecasting models solely on these benchmarks.

Evaluation on M5: Effectiveness of cross-variate information

To further demonstrate the benefit of multivariate models, we evaluate TSMixer on the challenging M5 benchmark, a large-scale retail dataset containing crucial cross-variate interactions. M5 contains the information of 30,490 products collected over 5 years. Each product description includes time series data, like daily sales, sell price, promotional event information, and static (non-time-series) features, such as store location and product category. The goal is to forecast the daily sales of each product for the next 28 days, evaluated using the weighted root mean square scaled error (WRMSSE) from the M5 competition. The complicated nature of retail makes it more challenging to forecast solely using univariate models that focus on temporal patterns, so multivariate models with cross-variate information and even auxiliary features are more essential.

First, we compare TSMixer to other methods only considering the historical data, such as daily sales and historical sell prices. The results show that multivariate models outperforms univariate models significantly, indicating the usefulness of cross-variate information. And among all compared methods, TSMixer effectively leverages the cross-variate information and achieves the best performance.

Additionally, to leverage more information, such as static features (e.g., store location, product category) and future time series (e.g., a promotional event scheduled in coming days) provided in M5, we propose a principle design to extend TSMixer. The extended TSMixer aligns different types of features into the same length, and then applies multiple mixing layers to the concatenated features to make predictions. The extended TSMixer architecture outperforms models popular in industrial applications, including DeepAR and TFT, showcasing its strong potential for real-world impact.

The architecture of the extended TSMixer. In the first stage (align stage), it aligns the different types of features into the same length before concatenating them. In the second stage (mixing stage) it applies multiple mixing layers conditioned with static features.

The WRMSSE on M5. The first three methods (blue) are univariate models. The middle three methods (orange) are multivariate models that consider only historical features. The last three methods (red) are multivariate models that consider historical, future, and static features.

Conclusion

We present TSMixer, an advanced multivariate model that leverages linear model characteristics and performs as well as state-of-the-art univariate models on long-term forecasting benchmarks. TSMixer creates new possibilities for the development of time series forecasting architectures by providing insights into the importance of cross-variate and auxiliary information in real-world scenarios. The empirical results highlight the need to consider more realistic benchmarks for multivariate forecasting models in future research. We hope that this work will inspire further exploration in the field of time series forecasting, and lead to the development of more powerful and effective models that can be applied to real-world applications.

Acknowledgements

This research was conducted by Si-An Chen, Chun-Liang Li, Nate Yoder, Sercan O. Arik, and Tomas Pfister.

Categories
Misc

Supercharge Ransomware Detection with AI-Enhanced Cybersecurity Solutions

Cybersecurity abstract imageRansomware attacks have become increasingly popular, more sophisticated, and harder to detect. For example, in 2022, a destructive ransomware attack took 233…Cybersecurity abstract image

Ransomware attacks have become increasingly popular, more sophisticated, and harder to detect. For example, in 2022, a destructive ransomware attack took 233 days to identify and 91 days to contain, for a total lifecycle of 324 days. Going undetected for this amount of time can cause irreversible damage. Faster and smarter detection capabilities are critical to addressing these attacks. 

Behavioral ransomware detection with NVIDIA DPUs and GPUs 

Adversaries and malware are evolving faster than defenders, making it hard for security teams to track changes and maintain signatures for known threats.  To address this, a combination of AI and advanced security monitoring is needed. Developers can build solutions for detecting ransomware attacks faster using advanced technologies including NVIDIA BlueField Data Processing Units (DPUs), the NVIDIA DOCA SDK with DOCA App Shield, and NVIDIA Morpheus cybersecurity AI framework.

Intrusion detection with BlueField DPU

BlueField DPUs are ideal for enabling best-in-class, zero-trust security, and extending that security to include host-based protection. With built-in isolation, this creates a separate trust domain from the host system, where intrusion detection system (IDS) security agents are deployed. If a host is compromised, the isolation layer between the security control agents on the DPU and the host prevents the attack from spreading throughout the data center.

DOCA App-Shield is one of the libraries provided with the NVIDIA DOCA software framework. It is a security framework for host monitoring, enabling cybersecurity vendors to create IDS solutions that can quickly identify an attack on any physical server or virtual machine.

DOCA App-Shield runs on the NVIDIA DPU as an out-of-band (OOB) device in a separate domain from the host CPU and OS and is:

  1. Resilient against attacks on a host machine.
  2. Least disruptive to the execution of host applications.

DOCA App Shield exposes an API to users developing security applications. For detecting malicious activities from the DPU Arm processor, it uses DMA without involving the host OS or CPU. In contrast, a standard agent of anti-virus or endpoint-detection-response runs on the host and can be seen or‌ compromised by an attacker or malware. 

Image of an NVIDIA BlueField-3 DPU.
Figure 1. NVIDIA BlueField-3 DPU 400 Gb/s infrastructure compute platform

Morpheus AI framework for cybersecurity 

Morpheus is part of the NVIDIA AI Enterprise software product family and is designed to build complex ML and AI-based pipelines. It provides significant acceleration of AI pipelines to deal with high data volumes, classify data, and identify anomalies, vulnerabilities, phishing, compromised machines, and many other security issues. 

Morpheus can be deployed on-premise with a GPU-accelerated server like the NVIDIA EGX Enterprise Platform, and it is also accessible through cloud deployment

A workflow showing Morpheus consisting of a GPU-accelerated server with SmartNic/DPU and software stack of RAPIDS, Cyber Logs Accelerator, NVIDIA Triton, and NVIDIA TensorRT for real-time telemetry from BlueField DPUs.
Figure 2. NVIDIA Morpheus with BlueField DPU Telemetry

Addressing ransomware with AI

One of the pretrained AI models in Morpheus is the ransomware detection pipeline that leverages NVIDIA DOCA App-Shield as a data source. This brings a new level of security for detecting ransomware attacks that were previously impossible to detect in real time.

Ransomware detection AI pipeline showing a DPU monitoring virtual machines. The Morpheus AI server receives DOCA AppShield events and alerts high anomaly processes.
Figure 3. Ransomware detection AI pipeline

Inside BlueField DPU

BlueField DPU offers the new OS-Inspector app to leverage DOCA App-Shield host monitoring capabilities and enables a constant collection of OS attributes from the monitored host or virtual machine. OS-Inspector app is now available through early access. Contact us for more information.

The collected operating system attributes include processes, threads, libraries, handles, and vads (for a complete API list, see the App-Shield programming guide).

OS-Inspector App then uses DOCA Telemetry Service to stream the attributes to the Morpheus inference server using the Kafka event streaming platform. 

Inside the Morpheus Inference Framework

The Morpheus ransomware detection AI pipeline processes the data using GPU acceleration and feeds the data to the ransomware detection AI model.

This tree-based model detects ransomware attacks based on suspicious attributes in the servers. It uses N-gram features to capture the change in attributes through time and detect any suspicious anomaly. 

When an attack is detected, Morpheus generates an inference event and triggers a real-time alert to the security team for further mitigation steps.

A ransomware detection model detects a ransomware process named sample.exe.
Figure 4. Ransomware detection model

FinSec lab use case 

NVIDIA partner FinSec Innovation Lab, a joint venture between Mastercard and Enel X, demonstrated their solution for combating ransomware attacks at NVIDIA GTC 2023.

FinSec ran a POC, which used BlueField DPUs and the Morpheus cybersecurity AI framework to train a model that detected a ransomware attack in less than 12 seconds. This real-time response enabled them to isolate a virtual machine and save 80% of the data on the infected servers. 

Learn more

BlueField DPU running DOCA App Shield enables OOB host monitoring. Together with Morpheus, developers can quickly build AI models to protect against cyber attacks, better than ever before. OS-Inspector app is now available through early access.  Contact us for more information.

Categories
Misc

GPUs for ETL? Optimizing ETL Architecture for Apache Spark SQL Operations

Extract-transform-load (ETL) operations with GPUs using the NVIDIA RAPIDS Accelerator for Apache Spark running on large-scale data can produce both cost savings…

Extract-transform-load (ETL) operations with GPUs using the NVIDIA RAPIDS Accelerator for Apache Spark running on large-scale data can produce both cost savings and performance gains. We demonstrated this in our previous post, GPUs for ETL? Run Faster, Less Costly Workloads with NVIDIA RAPIDS Accelerator for Apache Spark and Databricks. In this post, we dive deeper to identify precisely which Apache Spark SQL operations are accelerated for a given processing architecture. 

This post is part of a series on GPUs and extract-transform-load (ETL) operations.

Migrating ETL to GPUs

Should all ETL be migrated to GPUs? Or is there an advantage to evaluating which processing architecture is best suited to specific Spark SQL operations?

CPUs are optimized for sequential processing with significantly fewer yet faster individual cores. There are clear computational advantages for memory management, handling I/O operations, running operating systems, and so on.

GPUs are optimized for parallel processing with significantly more yet slower cores. GPUs excel at rendering graphics, training, machine learning and deep learning models, performing matrix calculations, and other operations that benefit from parallelization.  

Experimental design

We created three large, complex datasets modeled after real client retail sales data using computationally expensive ETL operations:

  • Aggregation (SUM + GROUP BY)
  • CROSS JOIN
  • UNION

Each dataset was specifically curated to test the limits and value of specific Spark SQL operations. All three datasets were modeled based on a transactional sales dataset from a global retailer. The row size, column count, and type were selected to balance experimental processing costs while performing tests that would demonstrate and evaluate the benefits of both CPU and GPU architectures under specific operating conditions. See Table 1 for data profiles.

Operation  Rows # COLUMNS: Structured data  # COLUMNS: Unstructured data  Size (MB)
Aggregation (SUM + GROUP BY) 94.4 million  2 0 3,200
CROSS JOIN 63 billion  6 1 983
UNION 447 million  10 2 721
Table 1. Summary of experimental datasets

The following computational configurations were evaluated for this experiment: 

  • Worker and driver type
  • Workers [minimum and maximum]
  • RAPIDS or Photon deployment
  • Maximal hourly limits on Databricks units (DBUs)—a proprietary measure of Databricks compute cost  
Worker and driver type Workers [min/max] RAPIDS Accelerator / PHOTON Max DBUs / hour
Standard_NC4as_T4_v3 1/1 RAPIDS Accelerator  2
Standard_NC4as_T4_v3 2/8 RAPIDS Accelerator 9
Standard_NC8as_T4_v3 2/2 RAPIDS Accelerator 4.5
Standard_NC8as_T4_v3 2/8 RAPIDS Accelerator 14
Standard_NC16as_T4_v3 2/2 RAPIDS Accelerator 7.5
Standard_NC16as_T4_v3 2/8 RAPIDS Accelerator 23
Standard_E16_v3 2/2 Photon 24
Standard_E16_v3 2/8 Photon 72
Table 2. Experimental computational configurations

Other experimental considerations

In addition to building industry-representative test datasets, other experimental factors are listed below. 

  1. Datasets are run using several different worker and driver configurations on pay-as-you-go instances–as opposed to spot instances–as their inherent availability establishes pricing consistency across experiments.
  2. For GPU testing, we leveraged RAPIDS Accelerator on T4 GPUs, which are optimized for analytics-heavy loads, and carry a substantially lower cost per DBU.
  3. The CPU worker type is an in-memory optimized architecture which uses Intel Xeon Platinum 8370C (Ice Lake) CPUs.
  4. We also leveraged Databricks Photon, a native CPU accelerator solution and accelerated version of their traditional Java runtime, rewritten in C++. 

These parameters were chosen to ensure experimental repeatability and applicability to common use cases. 

Results

To evaluate experimental results in a consistent fashion, we developed a composite metric named adjusted DBUs per minute (ADBUs). ADBUs are based on DBUs and computed as follows:

text{emph{Adjusted DBUs per Minute}} = frac{text{emph{Runtime (mins)}}}{text{emph{Cluster DBUs Cost per Hour}}}

Experimental results demonstrate that there is no computational Spark SQL task in which one chipset–GPU or CPU–dominates. As Figure 1 shows, dataset characteristics and the suitability of a cluster configuration have the strongest impact on which framework to choose for a specific task. Although unsurprising, the question remains: which ETL processes should be migrated to GPUs?

UNION operations

Although RAPIDS Accelerator on T4 GPUs generate results having both lower costs and execution times with UNION operations, the difference when compared with CPUs is negligible. Moving an existing ETL pipeline from CPUs to GPUs seems unwarranted for this combination of dataset and Spark SQL operation. It is likely–albeit untested by this research–that a larger dataset may generate results that warrant a move to GPUs.

CROSS JOIN operations

For the compute-heavy CROSS JOIN operation, we observed an order of magnitude of both time and cost savings by employing RAPIDS Accelerator (GPUs) over Photon (CPUs). 

One possible explanation for these performance gains is that the CROSS JOIN is a Cartesian product that involves an unstructured data column being multiplied with itself. This leads to exponentially increasing complexity. The performance gains of GPUs are well suited for this type of large-scale parallelizable operation. 

The main driver of cost differences is that the CPU clusters we experimented with had a much higher DBU rating than the chosen GPU clusters.  

SUM + GROUP BY operations

For aggregation operations (SUM + GROUP BY), we observed mixed results. Photon (CPUs) delivered notably faster compute times, whereas RAPIDS Accelerator (GPUs) provided lower overall costs. Looking at individual experimental runs, we observed that the higher Photon costs result in higher DBUs, whereas the costs associated with T4s are significantly lower. 

This explains the lower overall cost using RAPIDS Accelerator in this part of the experiment. In summary, if speed is the objective, Photon is the clear winner. More price-conscious users may prefer the longer compute times of RAPIDS Accelerator for notable cost savings.

Bar graphs showing the trade-off between compute time and cost for UNION, CROSS JOIN, and SUM + GROUP operations in Spark SQL for both Photon and RAPIDS Accelerator
Figure 1. Comparison of mean compute time and mean cost

Deciding which architecture to use

The CPU cluster gained performance in execution time in the commonly used aggregation (SUM + GROUP BY) experiment. However, this came at the price of higher associated cluster costs. For CROSS JOINs, a less common high-compute and highly-parallelizable operation, GPUs dominated both in higher speed and lower costs. UNIONs showed negligible comparative differences in compute time and cost. 

Where GPUs (and by association RAPIDS Accelerator) will excel depends largely on the data structure, the scale of the data, the ETL operation(s) performed, and the user’s technical depth. 

GPUs for ETL

In general, GPUs are well suited to large, complex datasets and Spark SQL operations that are highly parallelizable. The experimental results suggest using GPUs for CROSS JOIN situations, as they are amenable to parallelization, and can also scale easily as data grows in size and complexity. 

It is important to note the scale of data is less important than the complexity of the data and the selected operation, as shown in the SUM + GROUP BY experiment. (This experiment involved more data, but less computational complexity compared to CROSS JOINs.) You can work with NVIDIA free of charge to estimate expected GPU acceleration gains based on analyses of Spark log files. 

CPUs for ETL

Based on the experiments, certain Spark SQL operations such as UNIONs showed a negligible difference in cost and compute time. A shift to GPUs may not be warranted in this case. Moreover, for aggregations (SUM + GROUP BY), a conscious choice of speed over cost can be made based on situational requirements, where CPUs will execute faster, but at a higher cost. 

In cases where in-memory calculations are straightforward, staying with an established CPU ETL architecture may be ideal.

Discussion and future considerations

This experiment explored one-step Spark SQL operations. For example, a singular CROSS JOIN, or a singular UNION, omitting more complex ETL jobs that involve multiple steps. An interesting future experiment might include optimizing ETL processing at a granular level, sending individual SparkSQL operations to CPUs or GPUs in a single job or script, and optimizing for both time and compute cost.

A savvy Spark user might try to focus on implementing scripting strategies to make the most of the default runtime, rather than implementing a more efficient paradigm. Examples include:

  • Spark SQL join strategies (broadcast join, shuffle merge, hash join, and so on)
  • High-performing data structures (storing data in parquet files that are highly performant in a cloud architecture as compared to text files, for example)
  • Strategic data caching for reuse 

The results of our experiment indicate that leveraging GPUs for ETL can supply additional performance sufficient to warrant the effort to implement a GPU architecture.

Although supported, RAPIDS Accelerator for Apache Spark is not available by default on Azure Databricks. This requires the installation of .jar files that may necessitate some debugging. This tech debt was largely paid going forward, as subsequent uses of RAPIDS Accelerator were seamless and straightforward. NVIDIA support was always readily available to help if and when necessary. 

Finally, we opted to keep all created clusters under 100 DBUs per hour to manage experimental costs. We tried only one size of Photon cluster. Experimental results may change by varying the cluster size, number of workers, and other experimental parameters. We feel these results are sufficiently robust and relevant for many typical use cases in organizations running ETL jobs.

Conclusion

NVIDIA T4 GPUs, designed specifically for analytics workloads, accomplish a leap in the price/performance ratio associated with leveraging GPU-based compute. NVIDIA RAPIDS Accelerator for Apache Spark, especially when run on NVIDIA T4 GPUs, has the potential to significantly reduce costs and execution times for certain common ETL SparkSQL operations, particularly those that are highly parallelizable. 

To implement this solution on your own Apache Spark workload with no code changes, visit the NVIDIA/spark-rapids-examples GitHub repo or the Apache Spark tool page for sample code and applications that showcase the performance and benefits of using RAPIDS Accelerator in your data processing or machine learning pipelines.

Categories
Misc

A Powerful Legacy: Researcher’s Mom Fueled Passion for Nuclear Fusion

Before she entered high school, Ge Dong wanted to be a physicist like her mom, a professor at Shanghai Jiao Tong University.

Categories
Misc

‘Arteana’s Art Squad’ Assembles — Indie Showrunner Rafi Nizam Creates High-End Children’s Show on a Budget

Rafi Nizam is an award-winning independent animator, director, character designer and more. He’s developed feature films at Sony Pictures, children’s series and comedies at BBC and global transmedia content at NBCUniversal.

Categories
Misc

Webinar: Build Realistic Robot Simulations with NVIDIA Isaac Sim and MATLAB

A warehouse with a medley of robotics pieces.On Sept. 12, learn about the connection between MATLAB and NVIDIA Isaac Sim through ROS.A warehouse with a medley of robotics pieces.

On Sept. 12, learn about the connection between MATLAB and NVIDIA Isaac Sim through ROS.