Saving Apache Spark Big Data Processing Costs on Google Cloud Dataproc

According to IDC, the volume of data generated each year is growing exponentially. IDC’s Global DataSphere projects that the world will generate 221 ZB…

According to IDC, the volume of data generated each year is growing exponentially. IDC’s Global DataSphere projects that the world will generate 221 ZB of data by 2026. This data holds fantastic information. But as the volume of data grows, so does the processing cost. As a data scientist or engineer, you’ve certainly felt the pain of slow-running, data processing jobs.

Apache Spark addressed this data-rocessing problem at the scale of thousands of terabytes in the 2010s. However, in the 2020s, the amount of data that requires processing has exceeded the current CPU-based infrastructure compute capacity.

For organizations with hundreds of thousands of terabytes, this CPU-based infrastructure is limiting them and adding massive costs for expansion. Compute limitations are constricting their ability to expand insights with their data, get the data usable for training AI/ML pipelines, and experiment with new model types.

The old rule holds true: 80% of the time is spent on data prep rather than model development, which is hindering the growth of data science.

To address these challenges, the recent update of Apache 3.x delivers new functions for optimization, such as resource-aware scheduling and columnar data processing. With the RAPIDS Accelerator for Apache Spark, jobs can automatically be scheduled on NVIDIA GPUs for faster data processing. This solution requires zero code changes.

We are excited to announce a new integration with Google Cloud’s Dataproc. On the cloud, it costs up to 80% less to run data processing jobs than on an equivalent CPU-based infrastructure with acceleration speedup up to 5x faster.

Dataproc provides a fully managed Apache Spark service in the cloud. With the ability to create any Spark cluster within 90 seconds on average, enterprise-level security, and tight integration with other Google Cloud services, Dataproc provides a strong platform to deploy Apache Spark applications.

This post provides instructions on how to get started with using GPU acceleration on Spark workloads on Dataproc. We discuss the different challenges of CPU-to-GPU migration and explain how to speed up data processing pipelines. We highlight new RAPIDS Accelerator user tools for Dataproc that help set you up for success, such as providing insights into which jobs will perform best on GPU.

Speeding up data processing jobs

By combining the RAPIDS cuDF library with the scale-out capabilities of Apache Spark, data practitioners can process data quickly and cost-efficiently with GPUs. The RAPIDS Accelerator for Apache Spark is a plugin that enables you to speed up Apache Spark 3 jobs by leveraging GPUs.

Requiring no API changes from you, the RAPIDS Accelerator for Apache Spark tool automatically replaces GPU-supported SQL operations with the GPU-accelerated version whenever possible, while falling back to Spark CPU for other cases. Because no code or major infrastructure changes are required, you can iteratively design your workloads to be optimized for both performance and budget.

Finally, the new RAPIDS Accelerator user tools for Dataproc provide a set of functionality to support data scientists when migrating Spark jobs to GPU. This includes gap analysis and workload recommendations. Data practitioners can better determine which Spark jobs will see the best speedups when migrated to the GPU.

Pain points in CPU-to-GPU migration

Despite the supported migration to GPU provided by the RAPIDS Accelerator for Apache Spark, Spark users are often reluctant to make the jump because of assumed pain points in the process. In this post, we dive deeper into those common concerns, and show how the features of the new RAPIDS Accelerator user tools for Dataproc mitigate those issues.

Challenge #1: The cost associated with moving Spark jobs to GPU is not easy to predict

Many data practitioners assume that running an application on GPU will be more expensive, despite taking less time. In practice, this is rarely the case.

Resolution: The RAPIDS Accelerator for Apache Spark workload qualification tool analyzes Spark event logs generated from CPU-based Spark applications. It provides you with upfront cost estimates to help quantify the expected acceleration and cost savings of migrating a Spark application or query to GPU.

Challenge #2: It’s unclear which Spark jobs are suitable for GPU migration

Not every application is suitable for GPU acceleration. You don’t want to allocate GPU resources to a Spark job that would not benefit from the resulting acceleration.

Resolution: The workload qualification tool also enables you to pre-determine which of your applications or jobs are recommended for running on GPU with the RAPIDS accelerator for Apache Spark.

Challenge #3: There’s no predetermined way to compute GPU resource requirements

Selecting and sizing hardware for any workload, whether CPU or GPU, can be challenging. Incorrectly setting resources and configurations can impact cost and performance.

Resolution: The RAPIDS Accelerator for Apache Spark bootstrap tool supports the functionality of applying optimal configuration settings for a Spark cluster size and shape for GPU clusters.

Challenge #4: Too many parameters for tuning and configuration are available

When running jobs on the GPU, you want to narrow down the best candidates for GPU. This requires you to provide the optimal parameters for tuning and configuration.

Resolution: With the new profiling tool, Spark logs from CPU job runs are used to compute the recommended per-app Spark RAPIDS config settings for running a GPU application.

Challenge #5: There’s a high cost associated with changing compute infrastructure

It takes time, money, and labor to switch workloads over. This becomes a barrier to trying new technology, even if it addresses key pain points, because of the perceived risk to making an investment on business-critical applications.

Resolution: On the cloud, the switch is simplified. Data scientists can test it out without making code changes, and the cost is minimal. When using Dataproc, you rent your infrastructure on an hourly basis so there isn’t a need to pay upfront for hardware to be shipped to your data center. You can also track your cost and performance differences.

If you choose to switch back to the CPU, you can revert with next to no effort.

Migrating workloads with the RAPIDS Accelerator for Apache Spark in Google Cloud Dataproc

Now that we’ve discussed how the RAPIDS Accelerator speeds up your Spark jobs while reducing costs, here’s how to use it in practice.

Qualification

Qualification helps data scientists identify and estimate the cost savings and acceleration potential of RAPIDS Accelerator for Apache Spark. Qualification requires an active CPU Spark cluster that is selected for GPU migration. The qualification output shows a list of apps recommended for RAPIDS Accelerator for Apache Spark with estimated savings and speedup.

Bootstrap

Bootstrap provides and updates the GPU Dataproc cluster with an optimized RAPIDS Accelerator for Apache Spark configs based on the cluster shape. This ensures that Spark jobs executed on GPU Dataproc cluster can use all the resources and complete without errors.

Tuning

Bootstrap also ensures that the job is functionally passing but tuning optimizes the RAPIDS Accelerator for Apache Spark configs based on the initial (bootstrap) job run using Spark event logs. The output shows the recommended per-app RAPIDS Accelerator for Apache Spark config settings.

With these new features, you can address CPU-to-GPU migration concerns and speed up your Spark code implementation without the challenges of increased cost or complex processes.

Results

Figure 1 shows the speedup and cost comparison for a Spark NDS benchmark* run on Dataproc and NVIDIA GPUs. On this benchmark, we saw a near 5x speedup and 78% reduction in cost compared to running on CPU only.

The left bar chart shows that an NDS Power run runtime on a CPU-only four-node cluster takes 184 mins compared to the same four-node cluster with 8xT4 NVIDIA GPUs, which takes 34 mins. The right bar chart shows that the Google Cloud Dataproc cost for an NDS Power run on CPU nodes is $22.51 and $5.65 with NVIDIA T4 GPUs. — *Figure 1. Bar charts comparing the NDS power run on Google Cloud with associated cost when using CPU-only compared to with GPUs.*

* Benchmark and infrastructure details: CPU-only four-node cluster: 4xn1-standard-32 (32vCPU, 120 GB RAM). GPU four-node cluster: 4xn1-standard-32 (32vCPU, 120 GB RAM) and 8xT4 NVIDIA GPU. NDS stands for NVIDIA Decision Support benchmark, which is derived from the TPC-DS benchmark and is used for internal testing. Results from NDS are not comparable to TPC-DS.

Next steps

With the RAPIDS accelerator for Apache Spark, you can leverage GPU compute power for your Spark 3 workloads. By providing clear insights into which jobs are most suitable for acceleration and optimized GPU configurations and no API changes, you can run your critical Spark workloads faster. This helps you process more data in the same amount of time while saving on compute costs!

With Dataproc, you can do all of this in a fully supported environment, connected to the rest of the Google Cloud Ecosystem.

Quickly get started with migrating Spark CPU workloads to GPU by following GCP Dataproc documentation in GitHub. You can also download the latest version of RAPIDS Accelerator.