Microsoft Bing Speeds Ad Delivery With NVIDIA Triton

Jiusheng Chen’s team just got accelerated. They’re delivering personalized ads to users of Microsoft Bing with 7x throughput at reduced cost, thanks to NVIDIA Triton Inference Server running on NVIDIA A100 Tensor Core GPUs. It’s an amazing achievement for the principal software engineering manager and his crew. Tuning a Complex System Bing’s ad service uses Read article >


Accelerating the Accelerator: Scientist Speeds CERN’s HPC With GPUs, AI

Editor’s note: This is part of a series profiling researchers advancing science with high performance computing.  Maria Girone is expanding the world’s largest network of scientific computers with accelerated computing and AI. Since 2002, the Ph.D. in particle physics has worked on a grid of systems across 170 sites in more than 40 countries that Read article >


Take AI Learning to the Edge with NVIDIA Jetson

Picture of the Jetson AGX Orin and Jetson Orin Nano developer kits on a black background.The NVIDIA Jetson Orin Nano and Jetson AGX Orin Developer Kits are now available at a discount for qualified students, educators, and researchers.Since its…Picture of the Jetson AGX Orin and Jetson Orin Nano developer kits on a black background.

The NVIDIA Jetson Orin Nano and Jetson AGX Orin Developer Kits are now available at a discount for qualified students, educators, and researchers.Since its initial release almost 10 years ago, the NVIDIA Jetson platform has set the global standard for embedded computing and edge AI. These high-performance, low-power modules and developer kits for deep learning and computer vision give developers a small yet powerful platform to bring their robotics and edge AI vision to life. It’s the perfect tool for learning and teaching AI.

With 80x the performance over the original Jetson Nano, the Jetson Orin Nano Developer Kit enables you to run any kind of modern AI models, including transformer and advanced robotics models. It also has 5.4x the CUDA compute, 6.6x the CPU performance, and 50x the performance per watt compared to the Nano.

The NVIDIA Jetson AGX Orin Developer Kit and all Jetson Orin modules share one SoC architecture. This enables the developer kit to emulate any of the modules and makes it easy for you to start developing your next product. Compact size, lots of connectors, and up to 275 TOPS of AI performance make this developer kit perfect for prototyping advanced AI-powered robots and other edge AI devices.

The following videos show how easy it is to get started with the Jetson Orin Developer Kits.

Video 1. Getting Started with the Jetson Orin Nano Developer Kit
Video 2. Getting Started with the Jetson AGX Orin Developer Kit

Students making the most of Jetson developer kits

When you’ve got a Jetson-powered developer kit, imagine the possibilities for your next edge AI application.

For example, a group of mathematics, computer science, and biology researchers from the University of Marburg in Germany devised a way to quickly identify 80 species of birds and monitor local biodiversity based solely on the sounds those birds make. They used audio recordings captured with portable devices connected to the NVIDIA Jetson Nano Developer Kit.

Two pictures of the Bird@Edge system with colored overlays and component labels.
Figure 1. Bird@Edge system built with the Jetson Nano

Last fall, students from Southern Methodist University in Dallas, built what they called a baby supercomputer using 16 NVIDIA Jetson Nano modules, four power supplies, more than 60 handmade wires, a network switch, and some cooling fans. Compared to a normal-sized supercomputer, which can be huge, this mini supercomputer fits neatly on a desk, enabling students to easily use it and learn hands-on.

Picture of the student sitting next to a baby supercomputer on a desk.
Figure 2. SMU student with a baby supercomputer built using 16 Jetson Nano modules

Special student and educator pricing for Jetson Orin Developer Kits

If you are affiliated with an educational institution, you may be eligible for a discount on Jetson Orin Developer Kits. You must have a valid accredited university or education-related email address. You will then be provided with a one-time use code to order the Jetson Orin Nano Developer Kit or the Jetson AGX Orin Developer Kit. At this time, there is a limit of one of each kit per person.

Qualified students and educators can get the Jetson Orin Nano Developer Kit for $399 (USD) and the Jetson AGX Orin Developer Kit for $1,699 (USD). For more information, see Jetson for Education (login required). Requests for multiple units will be evaluated on a case-by-case basis.

The Ultimate School Supply: NVIDIA Jetson Nano                                                                                                         

Picture looks down on a student desk setup with a Jetson Nano Developer Kit.
Figure 3. A Jetson Nano Developer Kit helps students get started

Don’t need all the power and performance from Jetson Orin? The NVIDIA Jetson Nano Developer Kit is the perfect companion for the upcoming back-to-school season. Powered by NVIDIA GPU-accelerated computing, the Jetson Nano brings your ideas to life, whether you’re a student, educator, or DIY enthusiast. From building intelligent robots to developing groundbreaking AI applications, the possibilities are endless.

The Jetson Nano Developer Kit is available for $149 (USD).

Need help getting started?

Want to learn more about how you can use the Jetson Nano Developer Kit? Consider signing up for a free course from the NVIDIA Deep Learning Institute: Getting Started with AI on Jetson Nano. You may also want to look at the two Jetson AI certification programs available. For more inspiration, see the Jetson Community Projects page. Be sure to share your hard work in the Jetson forums so other developers can learn from your successes.


Harnessing the Power of NVIDIA AI Enterprise on Azure Machine Learning

AI is transforming industries, automating processes, and opening new opportunities for innovation in the rapidly evolving technological landscape. As more…

AI is transforming industries, automating processes, and opening new opportunities for innovation in the rapidly evolving technological landscape. As more businesses recognize the value of incorporating AI into their operations, they face the challenge of implementing these technologies efficiently, effectively, and reliably. 

Enter NVIDIA AI Enterprise, a comprehensive software suite designed to help organizations implement enterprise-ready AI, machine learning (ML), and data analytics at scale with security, reliability, API stability, and enterprise-grade support.

What is NVIDIA AI Enterprise?

Deploying AI solutions can be complex, requiring specialized hardware and software, as well as expert knowledge to develop and maintain these systems. NVIDIA AI Enterprise addresses these challenges by providing a complete ecosystem of tools, libraries, frameworks, and support services tailored for enterprise environments. 

With GPU-accelerated computing capabilities, NVIDIA AI Enterprise enables enterprises to run AI workloads more efficiently, cost effectively, and at scale. NVIDIA AI Enterprise is built on top of the NVIDIA CUDA-X AI software stack, providing high-performance GPU-accelerated computing capabilities. 

The suite includes:

  1. VMI: A preconfigured virtual machine image that includes the necessary drivers and software to support GPU-accelerated AI workloads in the major clouds.
  2. AI frameworks: Software that can run in a VMI (such as PyTorch, TensorFlow, RAPIDS, NVIDIA Triton with TensorRT and ONNX support, and more) that serves as the basis for AI development and deployment.
  3. Pretrained models: Models that can be used as-is, or fine-tuned on enterprise-relevant data.
  4. AI workflows: Prepackaged reference examples that illustrate how AI frameworks and pretrained models can be leveraged to build AI solutions to solve common business problems. These workflows provide guidance around fine-tuning pretrained models and AI model creation to build on NVIDIA frameworks. The pipelines to create applications are highlighted, as well as opinions on how to deploy customized applications and integrate them with various components typically found in enterprise environments, such as software for orchestration and management, storage, security, and networking. Available AI workflows include:
  • Intelligent virtual assistant: Engaging around-the-clock contact center assistance for lower operational costs.
  • Audio transcription: World-class, accurate transcripts based on GPU-optimized models.
  • Digital fingerprinting threat detection: Cybersecurity threat detection and alert prioritization to identify and act faster.
  • Next item prediction: Personalized product recommendations for increased customer engagement and retention.
  • Route optimization: Vehicle and robot routing optimization to reduce travel times and fuel costs.

Supported software with release branches

One of the main benefits of using the software available in NVIDIA AI Enterprise is that it is supported by NVIDIA with security and stability as guiding principles. NVIDIA AI Enterprise includes three release branches to cater to varying requirements across industries and use cases:

  1. Latest Release Branch: Geared towards those needing top-of-the-tree software optimizations, this branch will have a monthly release cadence, ensuring users have access to the latest features and improvements. CVE patches, along with bug fixes, will also be included in roll-forward releases.
  2. Production Release Branch: Designed for environments that prioritize API stability, this branch will receive monthly CVE patches and bug fixes, with two new branches introduced each year, each having a 9-month lifespan. To ensure seamless transitions and support, there will be a 3-month overlap period between two consecutive production branches. Production branches will be available in the second half of 2023.
  3. Long-Term Release Branch: Tailored for highly regulated industries where long-term support is paramount, this branch will receive quarterly CVE patches and bug fixes and offers up to 3 years of support for a particular release. Complementing this long-term stability is a 6-month overlap period to ensure smooth transitions between versions, thus providing the longevity and consistency needed for these highly regulated industries.
Diagram depicting the three release branches of NVIDIA AI Enterprise: Latest Release Branch, Production Release Branch, and Long-Term Release Branch
Figure 1. The three release branches of NVIDIA AI Enterprise serve varying requirements across industries and use cases

How to use NVIDIA AI Enterprise with Microsoft Azure Machine Learning

Microsoft Azure Machine Learning is a platform for AI development in the cloud and on premises, including a service for training, experimentation, deploying, and monitoring models, as well as designing and constructing prompt flows for large language models. An open platform, Azure Machine Learning supports all popular machine learning frameworks and toolkits, including those from NVIDIA AI Enterprise. 

This collaboration optimizes the experience of running NVIDIA AI software by integrating it with the Azure Machine Learning training and inference platform. Users no longer need to spend time setting up training environments, installing packages, writing training code, logging training metrics, and deploying models. With this integration, users will be able to leverage the power of NVIDIA enterprise-ready software, complementing Azure Machine Learning’s high performance and secure infrastructure, to build production-ready AI workflows. 

To get started today, follow these steps:

1. Sign in to Microsoft Azure and launch Azure Machine Learning Studio.

2. View and access all prebuilt NVIDIA AI Enterprise Components, Environments, and Models from the NVIDIA AI Enterprise Preview Registry (Figure 2).

Screenshot depicting the NVIDIA AI Enterprise Preview Registry in Azure Machine Learning, which includes NVIDIA-maintained Components, Environments, Models, and more
 Figure 2. NVIDIA AI Enterprise Preview Registry on Azure Machine Learning

3. Use these assets from within a workspace to create ML pipelines within the designer through simple drag and drop (Figure 3). 

Screenshot of Azure Machine Learning creating pipelines with NVIDIA AI Enterprise components.
Figure 3. Pipelines in Azure Machine Learning using NVIDIA AI Enterprise components

Find NVIDIA AI Enterprise sample assets in the Azure Machine Learning registry. Visit NVIDIA_AI_Enterprise_AzureML on GitHub to find code for the preview assets.

Use case: Body pose estimation 

Using the various elements within the NVIDIA AI Enterprise Preview Registry is easy. This example showcases a computer vision task that uses NVIDIA DeepStream for body pose estimation. NVIDIA TAO Toolkit provides the basis for the body pose model and the ability to refine it with new data.

Figure 4 shows a video analytics pipeline example running the NVIDIA DeepStream sample app for body pose estimation. It runs on a GPU cluster and can be easily adapted to leverage updated models and videos, unlocking the power of the Azure Machine Learning platform.

Screenshot showing how to create a pipeline in Azure Machine Learning for body pose estimation with NVIDIA AI Enterprise components
Figure 4. NVIDIA TAO Toolkit and NVIDIA DeepStream for body pose estimation with Azure Machine Learning

The example includes two URI-based data assets created for storing the inputs for the DeepStream sample app command component. The data assets leverage a pretrained model, which is readily available in the NVIDIA AI Enterprise Registry. They also include additional calibration and label information.

The DeepStream body pose command component is configured to use Microsoft Azure blob storage. This component monitors the input directory for any new video files that require inference. When a video file appears, the component picks it up and performs body pose inference. The outputted video includes bounding boxes and tracking lines and is stored in an output directory.

Additional samples available within the registry include:

  • bodyposenet
  • citysemsegformer
  • dashcamnet
  • emotionnet
  • fpenet
  • gazenet
  • gesturenet
  • lprnet
  • peoplenet
  • peoplenet_transformer
  • peoplesemsegnet
  • reidentificationnet
  • retail_object_detection
  • retail_object_recognition
  • trafficcamnet

Each of these samples can be improved with a TAO Toolkit-based training pipeline, which performs transfer learning. The model output changes to fit a specific use case. You can find TAO Toolkit computer vision sample workflows on NGC.

Get started with NVIDIA AI Enterprise on Azure Machine Learning

NVIDIA AI Enterprise and Azure Machine Learning together create a powerful combination of GPU-accelerated computing and a comprehensive cloud-based machine learning platform, enabling businesses to develop and deploy AI models more efficiently. This synergy enables enterprises to harness the flexibility of cloud resources while leveraging the performance advantages of NVIDIA GPUs and software. 

To get started with NVIDIA AI Enterprise on Azure Machine Learning, sign up for a Tech Preview. This will give you access to all of the prebuilt Components, Environments, and Models from the NVIDIA AI Enterprise Preview Registry on Azure Machine Learning.


GPU Integration Propels Data Center Efficiency and Cost Savings for Taboola

Picture of an aisle in a data center, with servers on either side.When you see a context-relevant advertisement on a web page, it’s most likely content served by a Taboola data pipeline. As the leading content recommendation…Picture of an aisle in a data center, with servers on either side.

When you see a context-relevant advertisement on a web page, it’s most likely content served by a Taboola data pipeline. As the leading content recommendation company in the world, a big challenge for Taboola was the frequent need to scale Apache Spark CPU cluster capacity to address the constantly growing compute and storage requirements.

Data center capacity and hardware costs are always under pressure.

What caused the scaling challenges? Taboola uses a complex data pipeline stretching from user browsers or mobile devices through multiple data centers. Complex deep learning algorithms, databases, infrastructure services (such as Apache Kafka), and thousands of servers are deployed to serve the best-fitting ads to users around the world.

This post describes Taboola’s motivation to move to RAPIDS Accelerator for Apache Spark to optimize processing costs, along with insights on the migration process, challenges, and lessons learned to date.  

Challenges of feeding a compute-hungry pipeline

To plan solutions, you must fully understand the magnitude of the problem. When serving ad content, Taboola builds a unique pageview that identifies each user and their interaction with the system. The pageview, a large, wide data structure, is built within a massive CPU cluster using data collected across worldwide data centers. The structure contains over 1500 distinct columns amounting to >1 TB of hourly data, all processed in our Apache Spark CPU cluster.

Many distinct analyzers and SQL queries process incoming pageviews at a rate of 1 TB of raw data per hour and catchup runs at 2, 6, 12, and 48 hours. New analyzers are constantly being created that increase the load on the Apache Spark cluster. There is a growing need for more compute power.

Our main task was to make the complex compute-hungry pipeline more scalable while being cost-effective. To address this challenge, Taboola embarked on an effort to migrate thousands of CPU cores to GPUs, to help us cope with the increasing data load to be processed.

One tool we considered to accelerate Taboola’s Apache Spark environments was the RAPIDS Accelerator on GPUs. As such, it was a natural decision to attempt greater scalability on GPUs as opposed to CPUs. 

Key considerations

First, we defined what we needed for testing to successfully migrate thousands of CPU cores to leverage GPU acceleration:

  • Real-life dataset
  • Hardware specifications
  • Minimum X factor
  • Complex query testing 

Real-life dataset

We used real production data from Cyber Monday to test and benchmark a large dataset. The data is 1.5 TB of ZSTD-compressed Parquet files per hour. It has over 1500 columns of all native types including arrays, structures, and nested structures with arrays.

Hardware specifications

For hardware, we started with the following resources:

  • A 72 CPU core Intel server with three A30 GPUs
  • A 900-GB local SSD drive, for Apache Spark to store its intermediate files
  • 380 GB RAM
  • A 10-Gb/s NIC card

Minimum X factor

The primary question when migrating a project from the CPU to the GPU is usually, “What’s the X factor?” For a real-world cluster with multiple GPUs, the question is, “How many GPUs do I need for sustaining a load manageable by X CPU cores?” The answer is your X factor.

We set a minimum bar of an X factor of 3 for the GPU solution to be considered successful in terms of cost. This factor helps guarantee our migration efforts will pay off.

Complex query testing

We picked 15 queries from production across multiple R&D departments to resemble as many of the hundreds of queries as are in production.

The queries are mostly complex including many SQL operations:

  • Aggregations
  • Sorts
  • Lateral view explode
  • Distribute by
  • Window functions
  • UDFS

Figure 1 shows an example query.

Image shows multiple tables related by primary keys and 1:1 or 1:many relationships.
Figure 1. Live sample of a large complex query in production requiring more compute power

Table 1 shows you Taboola’s factors.

Analyzer name Avg CPU production time Avg GPU time GPU factor
AdvertiserDimensionByRequest 586.41 31.91 18.38
ExperiementAnalysisPage 3021.6 102.92 29.36
ExperiementAnalysisPlacement 680.84 47.12 14.45
ExperimentAnalysisRequestBase 6605.44 362.68 18.21
ExperimentAnalysisSession 207.87 23.01 9.03
MediaDatatrendsBase 222.94 9.8 22.75
PerformanceMeasurements 397.17 86.22 4.61
PublisherPerformance 965.63 108.95 8.86
RBoxABTest 63.04 2.4 23.88
RevenueByHostHourly 487.44 95.03 5.13
SlaUnitAvailableFillRate 1199.93 152.38 7.87
SupplyDataTrends 529.92 45.28 11.7
Table 1. Production speedups

CPU-to-GPU migration goals

We began with a single server (as described earlier) capable of scaling to a multi-GPU and multi-server cluster. The cluster would be managed by Kubernetes, as opposed to the current Mesos cluster. Mesos was going to become obsolete and NVIDIA environments support Kubernetes.

Any changes in the software and hardware environment would have to be oblivious to Taboola’s code. Queries run on GPUs should execute with the exact results as CPUs. The team was aware of this challenge, as production stability was a critical goal.

Lastly, we wanted the GPU to outperform the CPUs by a minimum factor of 3. We benchmarked several GPUs including NVIDIA P100, NVIDIA V100, NVIDIA A100, and NVIDIA A30. We learned that the A30 GPU gave us the best price-performance fit.

First experiment with RAPIDS Accelerator

We ran SQL queries using RAPIDS Accelerator with somewhat disappointing results. Some of the less complex queries, mostly with lateral view explode, gave a 3x to 5x factor over the CPU. Some showed much lower factors, while other queries crashed.

After asking questions in the RAPIDS GitHub repo, we tried relevant Apache Spark and RAPIDS Accelerator parameters and began seeing better results:

  • sql.files.maxPartitionBytes: The CPU uses a default value of 128 MB. This is too low for the GPU. We are using 1–2 GB.
  • sql.shuffle.partitions: We found the 200 default to be sufficient in most cases.
  • rapids.sql.concurrentGpuTasks: Determines the number of tasks that can be run concurrently on the GPU. A minimum of at least two tasks appears to be best.

Tuning these parameters helped the queries run more smoothly and perform better in some cases. The NVIDIA Accelerated Spark Analysis tool can automatically generate tuning recommendations.

Challenge #1 – Parquet parsing overhead

We experienced bottlenecks on some of the less performant SQL queries, with most of them wasting time parsing Parquet footer data on the CPU. The Parquet data has more than 1500 columns. It was apparent that regular Java code for parsing the footer was not adequate for such a large footer.

Figure 2 shows a small section of NVIDIA profiler output for a 9-second Apache Spark task where the GPU was mostly idle (only working for 330 ms). 

Screenshot of Nsys trace for Parquet parsing.
Figure 2. Inefficient Parquet parsing trace

Figure 3 shows a flame graph of a query suffering from this behavior. The purple bars indicate time spent inside the org.apache.parquet.hadoop.ParquetFileReader class. Almost 50% of the query time was spent on parsing Parquet’s footer while the GPU was idle.

Screenshot of flame graph for Parquet parsing
Figure 3. Trace showing inefficient Parquet parsing in purple bars

As such, we set off to test an alternative solution. When parsing the footer, Parquet code would iterate over footer metadata serially for each row group. We adjusted Parquet parameters to decrease the number of row groups in each file. This gave us about 10–15% improvement. We found this improvement to be insufficient.

Each time metadata was read, all 1500 columns of metadata were read and parsed serially, even after only asking for 50-100 columns per query. We wanted to index footer metadata such that instead of reading the entire 1500 data columns serially, we’d just access it directly. We managed to make this happen by changing the Parquet-mr public code in C++ and Java. Although we got nice performance results, this was too cumbersome and complex.

Fortunately, the NVIDIA RAPIDS team had a much better idea. The solution was optimal Parquet file parsing. They replaced the Java code with Arrow’s C++ implementation. We now have the rapids.sql.format.parquet.reader.footer.type set to NATIVE by default for our GPU implementation. The bottleneck was solved and there were no longer queries with the GPU idle because of footer parsing overheads on the CPU.

Challenge #2 – Network bottleneck

A weak network card caused the next bottleneck. While the 10-Gb/s Ethernet card sustained the CPU load, it failed to do so for the GPU load. 

The solution was an optimal network card for GPU loads. Replacing the 10-Gb/s Ethernet card with a 25-Gb/s card removed this bottleneck.

Challenge #3 – Disk I/O bottleneck

Even after removing the two bottlenecks, queries still ran slow. On the Apache Spark user interface, we saw a clear indication as to what was happening.

Metric Min 25th percentile Median 75th percentile Max
Duration 0.4 s 0.6 s 0.8 s 1 s 1.2 min
GC Time 0.0 ms 0.0 ms 0.0 ms 90.0 ms 0.5 s
Shuffle Read Size/Records 21.4 MB/1000 22.3 MB/1000 22.5 MB/1000 22.7 MB/1000 27.3 MB/1000
Shuffle Write Size/Records 17.5 MB/1000 17.9 MB/1000 18 MB/1000 18.1 MB/1000 18.1 MB/1000
Scheduler Delay 3.0 ms 5.0 ms 5.0 ms 7.0 ms 3 s
Peak Execution Memory 64 MB 64 MB 64 MB 64 MB 64 MB
Shuffle Write Time 9.0 ms 13.0 ms 18.0 ms 21.0 ms 59 s
Table 2. Statistics of disk I/O bottlenecks

In Table 2, in the Max column, the task’s duration is 1.2 minutes, while Shuffle Write Time took 58 seconds. A lot of time was wasted doing shuffle work while the GPU was idle.

Figures 4 and 5 show the appropriate event timeline graph. The orange part represents shuffle times, read or write. The green parts are compute time. We were wasting a lot of time reading or writing shuffle files.

The event timeline segment graph shows intense write shuffle times compared to compute times.
Figure 4. Event timeline graph of Disk I/O bottlenecks as segments
The event timeline bar chart shows intense write shuffle times at about 70-90% of compute times.
Figure 5. Event timeline bar chart of disk I/O bottlenecks

Our shuffle files can get up to 500 GB and greater in some queries. We obviously can’t keep this huge amount of data in the server’s RAM, so shuffle files are stored in the local SSD drive.

After a quick investigation with our teams, we figured out that the SSD drive was configured to use RAID-1. Each temporary shuffle file was saved twice to the disk. This wasted a lot of time and switching to RAID-0 somewhat improved the situation.

The GPU put more pressure on the SSD drive than the CPU such that we had to replace the SSD with an NVMe drive. The solution was to switch to a 6-TB NVMe drive. We removed one of the GPUs to create a slot for the NVMe drive. Afterward, we had no shuffle read or write performance issues. We also learned that one NVMe drive could sustain the workload of two A30 GPUs.

Moving to Kubernetes

Because our Mesos cluster would become obsolete, the move to Kubernetes from a standalone POC machine was imperative. This involved a lot of configuration work, and other minor work. Still, it was straightforward to implement.

The basic idea is that the Apache Spark driver would sit on a non-GPU machine and each K8s Pod would be associated with a single GPU. For more information, see Getting Started with RAPIDS and Kubernetes.

The following code example shows some of the major relevant Kubernetes configurations.

    include: spark_k8s_extra_files

            spark.kubernetes.container.image.pullPolicy: Always
            spark.kubernetes.authenticate.serviceAccountName: spark
            spark.kubernetes.executor.deleteOnTermination: true
            spark.deploy.mode: client
            spark.executorEnv.preLoadMemoryLibraryName: "/usr/"
            spark.executorEnv.xmxPercentage: 80
            spark.kubernetes.memoryOverheadFactor: 0.1
            spark.mesos.fetcherCache.enable: false

            spark.plugins: "com.nvidia.spark.SQLPlugin"
            spark.kubernetes.executor.podTemplateFile: /conf/k8GPUPodTemplateProduction.yml
            spark.executor.resource.gpu.vendor: ""
            spark.executor.resource.gpu.discoveryScript: /conf/
            spark.executor.resource.gpu.amount: 1

            # GPU task configuration
            spark.rapids.sql.variableFloatAgg.enabled: "true"
            spark.rapids.sql.castFloatToDecimal.enabled: "true"
            spark.rapids.sql.rowBasedUDF.enabled: "true"
            spark.rapids.sql.format.parquet.reader.footer.type: "NATIVE"
            spark.rapids.sql.explain: "all"   # For debug.

            # Most common
            spark.rapids.sql.concurrentGpuTasks: 4
            spark.sql.files.maxPartitionBytes: "2048m"
            spark.rapids.sql.batchSizeBytes: "1g"
            spark.sql.shuffle.partitions: 200

How many GPUs to accelerate?

If the goal is to use GPUs to accelerate workloads more cost-effectively than CPUs, we first had to understand how fast GPUs were. We set up a test system with two A30 GPUs and streamed production data to it in parallel with our large CPU core production environment.

Figure 6 shows two of our heaviest queries running on the production CPU cluster and also on the server with two A30 GPUs.

Screenshot compares two heavy queries running on the A30 GPU and the CPU cluster.
Figure 6. Heavy queries and resulting hourly processing time for CPU and GPU

The yellow and green lines denote the hourly total time across all task numbers of the two queries running on the CPU cluster. The blue and orange lines denote the same queries running on the GPU server. GPU factors are 20x and higher.

Figure 7 shows the two queries running on the GPU shown in Figure 6. Observe that they behave similarly to the CPU in that peaks and valleys roughly align at different times of the day.

Screenshot zooms in to the two heavy queries running on the A30 GPU.
Figure 7. Hourly GPU total time across all tasks

Figure 8 is interesting in that it shows the factor of all the queries that we migrated to the GPU and their counterparts on the CPU. The GPU run is missing the biggest query, which we are still working on migrating to the GPU. It will likely add 200 hours per day to the GPU total time.

Screenshot shows the daily aggregation of multiple queries on the A30 GPU and the CPU cluster.
Figure 8. Daily aggregation comparison of CPU vs GPU times

Lessons learned

Looking ahead to our next migration steps, Taboola would like to move queries from other R&D departments to the GPU, resulting in a greater number of GPUs in production. This means that QA must monitor the system more closely while in production.

Getting acquainted with RAPIDS Accelerator for Apache Spark was an amazing “joy ride” effort. From handling Parquet files to pushing GPUs to their limit, we became better equipped at handling a large data pipeline along with managing data center capacity and hardware costs. Identifying and coping with hardware limitations with this method proved to be rewarding.

Here are Taboola’s top takeaways for those considering CPU-to-GPU migration:

  • Tuning parameters in complex environments that have multiple variables is never straightforward. Automate this task to whatever degree possible. It’s probably a good idea to use the NVIDIA Accelerated Spark Analysis tool to help address this challenge, as it can easily suggest optimized parameters.
  • Look beyond CPUs and GPUs for solutions to bottlenecks. No amount of GPU horsepower can resolve issues that are fundamentally network, disk, bandwidth, or configuration and parsing related.
  • Multiple GPUs do not hinder performance, but GPUs are so powerful that you may have good performance with fewer GPUs than originally thought. It’s best to test for this to lower costs.
    • We achieved our 20x factor through RAPIDS Accelerator and NVIDIA GPUs. The biggest lesson learned was that we needed to better understand what was happening in our existing environment before benefiting from GPU acceleration. One A30 GPU sustained the same production load for some workloads as the ~200-CPU core test cluster.

For more information about achieving performance multiples over Apache Spark CPU-based environments, see GPU-Accelerated Apache Spark.


Our huge effort was more successful with the support, assistance, and patience of two great groups of people. At Taboola: Andrey Gourine, Gilad Zamoscinski, Igor Berman, Keren Corsia, Lior Chaga, and Michael Taranov. On the NVIDIA RAPIDS team: Alessandro Bellina, Hao Zhu, Karthikeyan Rajendran, Robert Evans, and Sameer Raheja.


Neuralangelo by NVIDIA Research Reconstructs 3D Scenes from 2D Videos

A new model generates 3D reconstructions using neural networks, turns 2D video clips into detailed 3D structures — generating lifelike virtual replicas of…

A new model generates 3D reconstructions using neural networks, turns 2D video clips into detailed 3D structures — generating lifelike virtual replicas of buildings, sculptures and other real-world objects.


Webinar: Accelerate AI Model Inference at Scale for Financial Services

Illustration of a character sitting at a computer with a warning popup.Learn how AI is transforming financial services across use cases such as fraud detection, risk prediction models, contact centers, and more. Illustration of a character sitting at a computer with a warning popup.

Learn how AI is transforming financial services across use cases such as fraud detection, risk prediction models, contact centers, and more. 


A New Age: ‘Age of Empires’ Series Joins GeForce NOW, Part of 20 Games Coming in June

The season of hot sun and longer days is here, so stay inside this summer with 20 games joining GeForce NOW in June. Or stream across devices by the pool, from grandma’s house or in the car — whichever way, GeForce NOW has you covered. Titles from the Age of Empires series are the next Read article >


Digital Renaissance: NVIDIA Neuralangelo Research Reconstructs 3D Scenes

Neuralangelo, a new AI model by NVIDIA Research for 3D reconstruction using neural networks, turns 2D video clips into detailed 3D structures — generating lifelike virtual replicas of buildings, sculptures and other real-world objects. Like Michelangelo sculpting stunning, life-like visions from blocks of marble, Neuralangelo generates 3D structures with intricate details and textures. Creative professionals Read article >


Protecting Sensitive Data and AI Models with Confidential Computing

Image of an AI model, data, and an application locked, with an NVIDIA H100 GPU underneath.Rapid digital transformation has led to an explosion of sensitive data being generated across the enterprise. That data has to be stored and processed in data…Image of an AI model, data, and an application locked, with an NVIDIA H100 GPU underneath.

Rapid digital transformation has led to an explosion of sensitive data being generated across the enterprise. That data has to be stored and processed in data centers on-premises, in the cloud, or at the edge. Examples of activities that generate sensitive and personally identifiable information (PII) include credit card transactions, medical imaging or other diagnostic tests, insurance claims, and loan applications.

This wealth of data presents an opportunity for enterprises to extract actionable insights, unlock new revenue streams, and improve the customer experience. Harnessing the power of AI enables a competitive edge in today’s data-driven business landscape.

However, the complex and evolving nature of global data protection and privacy laws can pose significant barriers to organizations seeking to derive value from AI:

  • General Data Protection Regulation (GDPR) in Europe
  • Health Insurance Portability and Accountability Act (HIPAA) and Gramm-Leach-Bliley Act (GLBA) in the United States

Customers in healthcare, financial services, and the public sector must adhere to a multitude of regulatory frameworks and also risk incurring severe financial losses associated with data breaches.

Three figures show the costs of data breaches: the global average total cost of a data breach is $4.35M; the average cost of a breach in healthcare is $10.10M; and the average cost of a breach in finance is $5.97M.
Figure 1. Cost of data breaches(Source: IBM Cost of Data Breaches report)

In addition to data, the AI models themselves are valuable intellectual property (IP). They are the result of significant resources invested by the model owners in building, training, and optimizing. When not adequately protected in use, AI models face the risk of exposing sensitive customer data, being manipulated, or being reverse-engineered. This can lead to incorrect results, loss of intellectual property, erosion of customer trust, and potential legal repercussions.

Data and AI IP are typically safeguarded through encryption and secure protocols when at rest (storage) or in transit over a network (transmission). But during use, such as when they are processed and executed, they become vulnerable to potential breaches due to unauthorized access or runtime attacks.

Diagram shows the three states in which data exists: Data at rest or in storage is secure; data in transit or moving across a network is secure; data in use or while being processed is not secure.
Figure 2. Security gap in the end-to-end data lifecycle without confidential computing

Preventing unauthorized access and data breaches

Confidential computing addresses this gap of protecting data and applications in use by performing computations within a secure and isolated environment within a computer’s processor, also known as a trusted execution environment (TEE).

The TEE acts like a locked box that safeguards the data and code within the processor from unauthorized access or tampering and proves that no one can view or manipulate it. This provides an added layer of security for organizations that must process sensitive data or IP.

Figure shows data in use or being processed as secure with confidential computing.
Figure 3. Confidential computing addresses the security gap

The TEE blocks access to the data and code, from the hypervisor, host OS, infrastructure owners such as cloud providers, or anyone with physical access to the servers. Confidential computing reduces the surface area of attacks from internal and external threats. It secures data and IP at the lowest layer of the computing stack and provides the technical assurance that the hardware and the firmware used for computing are trustworthy.

Today’s CPU-only confidential computing solutions are not sufficient for AI workloads that demand acceleration for faster time-to-solution, better user experience, and real-time response times. Extending the TEE of CPUs to NVIDIA GPUs can significantly enhance the performance of confidential computing for AI, enabling faster and more efficient processing of sensitive data while maintaining strong security measures.

Secure AI on NVIDIA H100

Confidential computing is a built-in hardware-based security feature introduced in the NVIDIA H100 Tensor Core GPU that enables customers in regulated industries like healthcare, finance, and the public sector to protect the confidentiality and integrity of sensitive data and AI models in use.

Figure shows AI model, data, and application running on NVIDIA H100 GPU being secure with confidential computing.
Figure 4. Confidential computing on NVIDIA H100 GPUs

With security from the lowest level of the computing stack down to the GPU architecture itself, you can build and deploy AI applications using NVIDIA H100 GPUs on-premises, in the cloud, or at the edge. No unauthorized entities can view or modify the data and AI application during execution. This protects both sensitive customer data and AI intellectual property.

Diagram compares legacy VMs without confidential computing (full access, unencrypted transfers) and fully isolated VMs with confidential computing turned on (no read/write access, encrypted transfers).
Figure 5. Full isolation of virtual machines with confidential computing

For a quick overview of the confidential computing feature in NVIDIA H100, see the following video.

Video 1. What is NVIDIA Confidential Computing?

Unlocking secure AI: Use cases empowered by confidential computing on NVIDIA H100

Accelerated confidential computing with NVIDIA H100 GPUs offers enterprises a solution that is performant, versatile, scalable, and secure for AI workloads. It unlocks new possibilities to innovate with AI while maintaining security, privacy, and regulatory compliance.

  • Confidential AI training
  • Confidential AI inference
  • AI IP protection for ISVs and enterprises
  • Confidential federated learning

Confidential AI training

In industries like healthcare, financial services, and the public sector, the data used for AI model training is sensitive and regulated. This includes PII, personal health information (PHI), and confidential proprietary data, all of which must be protected from unauthorized internal or external access during the training process.

With confidential computing on NVIDIA H100 GPUs, you get the computational power required to accelerate the time to train and the technical assurance that the confidentiality and integrity of your data and AI models are protected.

Figure shows that confidential computing during training provides proof of regulatory compliance, protection from adversarial attacks or tampering, and unauthorized access or modifications from the platform provider.
Figure 6. Protection for training data and AI models being trained

For AI training workloads done on-premises within your data center, confidential computing can protect the training data and AI models from viewing or modification by malicious insiders or any inter-organizational unauthorized personnel. When you are training AI models in a hosted or shared infrastructure like the public cloud, access to the data and AI models is blocked from the host OS and hypervisor. This includes server administrators who typically have access to the physical servers managed by the platform provider.

Confidential AI inference

When trained, AI models are integrated within enterprise or end-user applications and deployed on production IT systems—on-premises, in the cloud, or at the edge—to infer things about new user data.

End-user inputs provided to the deployed AI model can often be private or confidential information, which must be protected for privacy or regulatory compliance reasons and to prevent any data leaks or breaches.

The AI models themselves are valuable IP developed by the owner of the AI-enabled products or services. They are at risk of being viewed, modified, or stolen during inference computations, resulting in incorrect results and loss of business value.

Figure shows AI inference workflow and how confidential computing protects the deployed AI model, application and user data provided as input
Figure 7. Protection for the deployed AI model, AI-enabled apps, and user input data

Deploying AI-enabled applications on NVIDIA H100 GPUs with confidential computing provides the technical assurance that both the customer input data and AI models are protected from being viewed or modified during inference. This provides an added layer of trust for end users to adopt and use the AI-enabled service and also assures enterprises that their valuable AI models are protected during use.

AI IP protection for ISVs and enterprises

Independent software vendors (ISVs) invest heavily in developing proprietary AI models for a variety of application-specific or industry-specific use cases. Examples include fraud detection and risk management in financial services or disease diagnosis and personalized treatment planning in healthcare.

ISVs must protect their IP from tampering or stealing when it is deployed in customer data centers on-premises, in remote locations at the edge, or within a customer’s public cloud tenancy. In addition, customers need the assurance that the data they provide as input to the ISV application cannot be viewed or tampered with during use.

Figure shows AI ISVs protecting their AI IP during deployment across multiple deployment options like on-premises, cloud, edge, and colocation.
Figure 8. Secure deployment of AI IP

Confidential computing on NVIDIA H100 GPUs enables ISVs to scale customer deployments from cloud to edge while protecting their valuable IP from unauthorized access or modifications, even from someone with physical access to the deployment infrastructure. ISVs can also provide customers with the technical assurance that the application can’t view or modify their data, increasing trust and reducing the risk for customers using the third-party ISV application.

Confidential federated learning

Building and improving AI models for use cases like fraud detection, medical imaging, and drug development requires diverse, carefully labeled datasets for training. This demands collaboration between multiple data owners without compromising the confidentiality and integrity of the individual data sources.

Confidential computing on NVIDIA H100 GPUs unlocks secure multi-party computing use cases like confidential federated learning. Federated learning enables multiple organizations to work together to train or evaluate AI models without having to share each group’s proprietary datasets.

Figure shows federated learning use case in healthcare being protected by confidential computing. Confidential computing protects the local AI model training at each participating hospital and protects the model aggregation on the central server.
Figure 9. An added layer of protection for multi-party AI workloads like federated learning

Confidential federated learning with NVIDIA H100 provides an added layer of security that ensures that both data and the local AI models are protected from unauthorized access at each participating site.

When deployed at the federated servers, it also protects the global AI model during aggregation and provides an additional layer of technical assurance that the aggregated model is protected from unauthorized access or modification.

This helps drive the advancement of medical research, expedite drug development, mitigate insurance fraud, and trace money laundering globally while maintaining security, privacy, and regulatory compliance for all parties involved.

NVIDIA platforms for accelerated confidential computing on-premises

Getting started with confidential computing on NVIDIA H100 GPUs requires a CPU that supports a virtual machine (VM)–based TEE technology, such as AMD SEV-SNP and Intel TDX. Extending the VM-based TEE from the supported CPU to the H100 GPU enables all the VM memory to be encrypted and the application running does not require any code changes.

Top OEM partners are now shipping accelerated platforms for confidential computing, powered by NVIDIA H100 Tensor Core GPUs. These confidential computing–compatible systems combine the NVIDIA H100 PCIe Tensor Core GPUs with AMD Milan or AMD Genoa CPUs that support the AMD SEV-SNP technology.

The following partners are delivering the first wave of NVIDIA platforms for enterprises to secure their data, AI models, and applications in use in data centers on-premises:

  • ASRock Rack
  • ASUS
  • Cisco
  • Dell Technologies
  • Hewlett Packard Enterprise
  • Lenovo
  • Supermicro
  • Tyan

NVIDIA H100 GPU comes with the VBIOS (firmware) that supports all confidential computing features in the first production release. The NVIDIA confidential computing software stack to be released this summer will support single GPU first and then multi-GPU and Multi-Instance GPU in subsequent releases.

For more information about the NVIDIA confidential computing software stack and availability, see the Unlock the Potential of AI with Confidential Computing on NVIDIA GPUs session (Track 2) at the Confidential Computing Summit 2023 on June 29.

For more information about how confidential computing on NVIDIA H100 GPUs works, see the following videos: