Categories
Misc

Fast, Terabyte-Scale Recommender Training Made Easy with NVIDIA Merlin Distributed-Embeddings

Embeddings play a key role in deep learning recommender models. They are used to map encoded categorical inputs in data to numerical values that can be…

Embeddings play a key role in deep learning recommender models. They are used to map encoded categorical inputs in data to numerical values that can be processed by the math layers or multilayer perceptrons (MLPs).

Embeddings often constitute most of the parameters in deep learning recommender models and can be quite large, even reaching into the terabyte scale. It can be difficult to fit them in a single GPU’s memory during training.

As such, modern recommenders may require a combination of model parallel and data parallel distributed training approaches to achieve reasonable training times, and the best utilization of available GPU compute.

NVIDIA Merlin Distributed-Embeddings, a library for training large embedding based (for example, recommenders) models in TensorFlow 2 enables you to accomplish this easily with just a few lines of code.

Background

With data-parallel distributed training on GPUs, the whole model is replicated on each GPU worker. A batch of data is split among the multiple GPUs during training, and each device independently operates on its own shard of data.

This allows scaling computations to higher quantities of data with larger batches. The gradients calculated during backpropagation are accumulated across all devices using a reduction operation (for example, horovod.tensorflow.allreduce) for a synchronous parameter update.

With model parallel distributed training, the model parameters are split between various workers. This is a more suitable way to distribute large embedding tables. Training requires using an all-to-all communication primitive (for example, horovod.tensorflow.alltoall) such that workers can get access to the parameters that are not in their partition.

In a related previous post, Training a Recommender System on DGX A100 with 100B+ Parameters in TensorFlow 2, Tomasz discussed how distributing embeddings for a 113 billion-parameter DLRM model across multiple NVIDIA GPUs helped achieve a 672x speedup over a CPU-only solution. This significant improvement can potentially bring down training times from days to minutes! This is accomplished by distributing embedding tables through model parallelism and performing the much-smaller math-intensive MLP layer computation through data parallelism.

Compared to storing the embeddings in CPU memory, this hybrid approach enables you to use the high-memory bandwidth of the GPU memory for memory-bound embedding lookups. It also accelerates the MLP layers using the compute powers among several GPU devices. For reference, an NVIDIA A100-80GB GPU has 80 GB HBM2 memory with a bandwidth at over 2 TB/s).

Diagram shows how the embedding tables can be split across several GPU workers, and the dense part of the network is replicated in a data-parallel fashion. An all-to-all communication primitive is used so that the GPU workers can get access to embeddings not part of the shard they store in their GPU memory.
Figure 1.  General “hybrid-parallel” approach for training large recommender systems

The embedding tables can be split “table-wise” (for example, embedding tables 0 and N), “column-wise” (for example, embedding table 2), or “row-wise”. The MLP layers are replicated across all GPUs. Numerical features can be fed directly into the MLP layers and are not represented in the figure.

However, implementing such a complex hybrid parallel training approach is not trivial and requires a domain expert to engineer several hundred lines of low-level code to develop and optimize training.

To make it more widely accessible, the NVIDIA Merlin Distributed-Embedding library provides an easy-to-use wrapper to democratize model parallelism in TensorFlow 2 with only three lines of Python code. It provides a scalable model parallel wrapper that automatically distributes embedding tables to multiple GPUs, in addition to some efficient embedding operations that cover and extend TensorFlow’s embedding functionalities. Here’s how it enables hybrid-parallelism.

Distributed model parallel

NVIDIA Merlin Distributed-Embeddings provides the distributed_embeddings.dist_model_parallel module. It helps distribute embeddings across several GPU workers without any complex code to handle cross-worker communication with primitives like all2all. The following code example shows the usage of this API:

import dist_model_parallel as dmp

class MyEmbeddingModel(tf.keras.Model):
  def  __init__(self, table_sizes):
    ...
    self.embedding_layers = [tf.keras.layers.Embedding(input_dim, output_dim) for input_dim, output_dim in table_sizes]
    # 1. Add this line to wrap list of embedding layers used in the model
    self.embedding_layers = dmp.DistributedEmbedding(self.embedding_layers)
  def call(self, inputs):
    # embedding_outputs = [e(i) for e, i in zip(self.embedding_layers, inputs)]
    embedding_outputs = self.embedding_layers(inputs)
    ...

To run the dense layers in a data-parallel fashion with Horovod, replace Horovod’s Distributed GradientTape and broadcast methods with their equivalent in Distributed Embeddings. The following example has been taken directly from the Horovod documentation and modified accordingly.

@tf.function
def training_step(inputs, labels, first_batch):
  with tf.GradientTape() as tape:
    probs = model(inputs)
    loss_value = loss(labels, probs)

  # 2. Change Horovod Gradient Tape to dmp tape
  # tape = hvd.DistributedGradientTape(tape)
  tape = dmp.DistributedGradientTape(tape)
  grads = tape.gradient(loss_value, model.trainable_variables)
  opt.apply_gradients(zip(grads, model.trainable_variables))

  if first_batch:
    # 3. Change Horovod broadcast_variables to dmp's
    # hvd.broadcast_variables(model.variables, root_rank=0)
    dmp.broadcast_variables(model.variables, root_rank=0)
  return loss_value

With these minor changes, you are all set with a hybrid-parallel training step! 

We also provide complete examples for training a DLRM model with Criteo 1TB click-logs data, as well as synthetic data that scales the model size up to 22.8 TiB.

Performance

To demonstrate the benefits of using NVIDIA Merlin Distributed-Embeddings, we show benchmarks on a DLRM model trained on the Criteo 1TB dataset, as well as various synthetic models with up to ~3 TiB embedding table sizes.

DLRM benchmark on Criteo dataset

Benchmarks indicate that we retain performance similar to expert engineered code with a much simpler API. The NVIDIA DeepLearningExamples DLRM code that uses TensorFlow 2 has now also been updated to leverage hybrid-parallel training with NVIDIA Merlin Distributed-Embeddings. For more information, see our previous post, Training a Recommender System on DGX A100 with 100B+ Parameters in TensorFlow 2.

The benchmarks section in the README provides more insight into performance numbers.

A DLRM model with 113 billion parameters (421 GiB model size) was trained on the Criteo Terabyte Click Logs dataset, over three different hardware setups:

  • A CPU-only solution.
  • A single-GPU solution, where CPU memory is used to store the largest embedding tables.
  • A hybrid-parallel solution using an NVIDIA DGX A100-80GB with 8 GPUs. This leverages the model parallel wrapper and the Embedding API provided by NVIDIA Merlin Distributed-Embeddings.
Hardware Description Training Throughput (samples/second) Speedup over CPU
2 x AMD EPYC 7742 Both MLP layers and embeddings on CPU 17.7k 1x
1 x A100-80GB; 2 x AMD EPYC 7742     Large embeddings on CPU, everything else on GPU 768k 43x
DGX A100 (8xA100-80GB)     Hybrid parallel with NVIDIA Merlin Distributed-Embeddings, whole model on GPU 12.1M 683x
Table 1. Training throughput and speedup for various setups

We observe that a Distributed-Embeddings solution on a DGX-A100 provides a whopping 683x speedup over a CPU-only solution! We also notice a significant improvement in performance over a single-GPU solution. This is because retaining all the embeddings in GPU memory eliminates the overhead of embedding lookups over the CPU-GPU interface.

Synthetic models benchmark

To demonstrate the scalability of the solution further, we created synthetic DLRM models of varying sizes (Table 2). For more information about model generation methodology and the training script, see the NVIDIA-Merlin/distributed-embeddings GitHub repo.

Model Total number of embedding tables Total embedding size (GiB)
Tiny 55 4.2
Small 107 26.3
Medium 311 206.2
Large 612 773.8
Jumbo 1,022 3,109.5
Table 2. Synthetic model sizes

Each synthetic model was trained using one or more DGX-A100-80GB nodes, with a global batch size of 65,536, and the Adagrad optimizer. You can see from Table 3 that NVIDIA Merlin Distributed-Embeddings can easily train terabyte-scale models on hundreds of GPUs.

  Model Training step time (ms)
1 GPU   8 GPU 16 GPU 32 GPU 128 GPU
Tiny 17.6 3.6 3.2    
Small 57.8 14.0 11.6 7.4  
Medium   64.4 44.9 31.1 17.2
Large       65.0 33.4
Jumbo         102.3

Table 3. Training step time (ms) for synthetic models on various hardware configurations

On the other hand, even for models that can fit into a single GPU, Distributed-Embeddings’ model parallelism still provides substantial speedup with multi-GPU, compared to conventional data parallelism. This is shown in Table 4 where a tiny model runs on DGX A100-80GB.

  Solution Training step time (ms)
1 GPU 2 GPU 4 GPU 8 GPU
NVIDIA Merlin Distributed Embeddings Model Parallel 17.7 11.6 6.4 4.2
Native TensorFlow Data Parallel 19.9 20.2 21.2 22.3

Table 4. Comparing training step time (ms) for the “Tiny” model (4.2GiB) between an NVIDIA Merlin Distributed Embeddings Model Parallel and a Native TensorFlow Data Parallel solution for embeddings

A global batch size of 65,536 and the Adagrad optimizer were used for this experiment.

Conclusion

In this post, we introduced the NVIDIA Merlin Distributed-Embeddings library, which enables scalable and efficient model-parallel training of embedding-based, deep learning models on NVIDIA GPUs with just a few lines of code. To get started, try the examples for scalable training with synthetic data and training a DLRM model on Criteo data.

Categories
Misc

Now Available: NVIDIA DRIVE AGX Orin Developer Kit with DRIVE OS 6

Autonomous vehicle developers now have access to flexible, scalable, and high-performance hardware and software to build the next generation of safer, more…

Autonomous vehicle developers now have access to flexible, scalable, and high-performance hardware and software to build the next generation of safer, more efficient transportation.

NVIDIA DRIVE AGX Orin Developer Kit is now available for general access. Powered by a single Orin system-on-a-chip (SoC), the AI compute platform includes the hardware, software, and sample applications needed to develop production-level autonomous vehicles. It’s also modular, sharing the same design as NVIDIA Jetson, NVIDIA Isaac, and NVIDIA Clara AGX platforms.

With a rich automotive I/O, you have the flexibility to expand​ and iterate upon your autonomous driving solutions. DRIVE AGX Orin includes a base kit for bench development and an add-on vehicle kit for vehicle installation. The platform also has a smaller footprint than previous generations, with system and accessories included in a single box.

Additionally, NVIDIA DRIVE OS 6 is now available on the NVIDIA DRIVE Developer Download page, providing the latest operating system purpose-built for autonomous vehicles. 

NVIDIA DRIVE OS includes NvMedia for sensor input processing, NVIDIA CUDA libraries for efficient parallel computing implementations, and NVIDIA TensorRT for real-time AI inference.

The current version available is DRIVE OS 6.0.4, which supports sensors included in the NVIDIA DRIVE Hyperion 8.1 platform architecture.

You can experience faster downloads and streamlined development environment setup by installing software with NGC DRIVE OS Docker containers or NVIDIA SDK Manager.

DRIVE OS 6.0 provides the following benefits:

For more information, see the NVIDIA DRIVE OS product page.

Before downloading, register for an NVIDIA Developer account and membership in the NVIDIA DRIVE AGX SDK Developer Program. Please submit questions or feedback to the DRIVE AGX Orin general forum.

Categories
Misc

Solving Automatic Speech Recognition Deployment Challenges

Successfully deploying an automatic speech recognition (ASR) application can be a frustrating experience. For example, it is difficult for an ASR system to…

Successfully deploying an automatic speech recognition (ASR) application can be a frustrating experience. For example, it is difficult for an ASR system to correctly identify words while maintaining low latency, considering the many different dialects and pronunciations that exist.

Sign up for the latest Speech AI News from NVIDIA.

Whether you are using a commercial or open-source solution, there are many challenges to consider when building an ASR application.

In this post, I highlight major pain points that developers face when adding ASR capabilities to applications. I share how to approach and overcome these challenges, using NVIDIA Riva speech AI SDK as an example.

Challenges of building ASR applications

Here are a few challenges present in the creation of any ASR system:

  • High accuracy
  • Low latency
  • Compute resource allocation
  • Flexible deployment and scalability
  • Customization
  • Monitoring and tracking

High accuracy

One key metric to measure speech recognition accuracy is the word error rate (WER). WER is defined as the ratio of total incorrect and missing words identified during transcription and sum of total number of words present in the labeled transcripts.

Several reasons cause transcription errors in ASR models leading to misinterpretations of information:

  • Quality of the training dataset
  • Different dialects and pronunciations
  • Accents and variations in speech
  • Custom or domain-specific words and acronyms
  • Contextual relationship of words
  • Differentiating phonetically similar sentences

Due to these factors, it is difficult to build a robust ASR model with a low WER score.

Low latency

A conversational AI application is an end-end pipeline composed of speech AI and natural language processing (NLP).

For any conversational AI application, response time plays a critical factor to make any natural conversations. It would not be practical to converse with a bot if a customer only receives a response after 1 minute of waiting time.

It has been observed that any conversation AI application should deliver a latency of less than 300 msec. So, it becomes critical to make sure that speech AI model latency is far below the 300 msec limit to be integrated into an end-end pipeline for real-time conversational AI applications.

 Many factors affect the overall latency of the ASR model:

  • Model size:  Large and complex models have better accuracy but require a lot of computation power and add to the latency compared to smaller models; that is, the inference cost is high.
  •  Hardware:  Edge deployment of such complex models further adds to the complexity of latency requirements.
  •  Network bandwidth:  Sufficient bandwidth is needed for streaming audio content and transcripts, especially in the case of cloud-based deployment.

Compute resource allocation

Optimizing the ASR model and its resource utilizations applies to all AI models and is not only specific to the ASR model. However, it is a critical factor that impacts overall latency as well as the compute cost to run any AI application.

The whole point of optimizing a model is to reduce the inference cost both at compute level as well as the latency level. But all models available online for a particular architecture are not created equally and do not have the same code quality. They also have dramatic differences in performance.

Also, not all of them respond in the same way to knowledge distillation, pruning, quantization, and other optimization techniques to result in improved inference performance without impacting the accuracy results.

Flexible deployment and scalability

Creating an accurate and efficient model is only a small fraction of any real-time AI application. The required surrounding infrastructure is vast and complex. For example, deployment infrastructure should include:

  • Streaming support
  • Resource management service
  • Servicing infrastructure
  • Analytics tool support
  • Monitoring service

Creating a custom end-end optimized deployment pipeline that supports the required latency requirement for any ASR application is challenging because it requires optimization and acceleration at each pipeline stage.

Based on the number of audio streams that must be supported at a given instance, your speech recognition application should be able to auto-scale the application deployment to provide acceptable performance.

Customization

Getting a model to work out-of-the-box is always the goal. However, the performance of currently available models depends on the dataset used during its training phase. Models typically work great for the use case they’ve already been exposed to, but the same model’s performance may degrade as soon as it is deployed in different domain applications.

Specifically, in the case of ASR, the model’s performance depends on accent or language and variations in speech. You should be able to customize models based on the application use case.

For instance, speech recognition models being deployed in healthcare– or financial-related applications require support for domain-specific vocabulary. This vocabulary differs from what is normally used during ASR model training.

To support regional languages for ASR, you need a complete set of training pipelines to easily customize the model and efficiently handle different dialects.

Monitoring and tracking

Real-time monitoring and tracking help in getting instant insights, alerts, and notifications so that you can take timely corrective actions. This helps in tracking the resource consumption as per the incoming traffic so that the corresponding application can be auto-scaled. Quota limits can also be set to minimize the infrastructure cost without impacting the overall throughput.

Capturing all these stats requires integrating multiple libraries to capture performance at various stages of the ASR pipeline.

Examples of how the Riva SDK addresses ASR challenges

Advanced SDKs can be used to conveniently add a voice interface to your applications. In this post, I demonstrate how a GPU-accelerated SDK like Riva can be applied to solve these challenges when building speech recognition applications.

High accuracy and compute optimization

You can use pretrained Riva speech models in NGC that can be fine-tuned with TAO Toolkit on a custom data set, further accelerating domain-specific model development by 10x.

All NGC models are optimized and sped up for GPU deployment to achieve better recognition accuracy. These models are also fully supported by NVIDIA TensorRT optimization. Riva’s high-performance inference is powered by TensorRT optimizations and served using the NVIDIA Triton Inference Server to optimize the overall compute requirement and, in turn, improve the server throughput

For example, here are a few ASR models on NGC that are further optimized as part of the Riva pipeline for better performance:

Continued optimizations across the entire stack of Riva—from models and software to hardware—has delivered 12x the gain compared to the previous generation.

Graph showing Riva ASR performance model acceleration compared to the previous generation.
Figure 1. ASR performance acceleration using NVIDIA Riva

Low latency

Latencies and throughput measurements for streaming and offline configurations are reported under the ASR performance section of Riva documentation.

In “streaming low latency” Riva ASR model deployment mode, the average latency (ms) is far less than 50 ms for most of the cases. Using such ASR models, it becomes easier to create a real-time conversational AI pipeline and still achieve the

Flexible deployment and scaling

Deploying your speech recognition application on any platform with ease requires full support. The Riva SDK provides flexibility at each step from fine-tuning models on domain-specific datasets to customizing pipelines. It can also be deployed in the cloud, on-premises, edge, and embedded devices.

To support scaling, Riva is fully containerized and can scale to hundreds and thousands of parallel streams. Riva is also included in the NGC Helm repository, which is a chart designed to automate for push-button deployment to a Kubernetes cluster.

For more information about deployment best practices, see the Autoscaling Riva Deployment with Kubernetes for Conversational AI in Production on-demand GTC session.

Customization

Workflow diagram illustrates a speech input being processed in a Riva ASR pipeline and a resulting text output.
Figure 2. Customization techniques range from word boosting to fine-tuning punctuation and capitalization models

Customization techniques are helpful when out-of-the-box Riva models fall short dealing with challenging scenarios not seen in the training data. This might include recognizing narrow domain terminologies, new accents, or noisy environments.

SDKs like Riva support customization, starting from the word-boosting level, and provision for the end user to custom train their acoustic models.

Riva speech skills also provide high-quality, pretrained models across a variety of languages. For more information about all the models for supported languages, see the Language Support section.

Monitoring and tracking

In Riva, underlying Triton Inference Server metrics are available to the end users to use, based on the customization and dashboard creation. These metrics are only available by accessing the endpoint.

NVIDIA Triton provides Prometheus metrics, as well as indicating GPU and request statistics. This helps in monitoring and tracking the production deployment setup.

Key takeaways

This post provides you with a high-level overview of common pain points that arise when developing AI applications with ASR capabilities. Being aware of the factors impacting your ASR application’s overall performance helps you streamline and improve the end-to-end development process.

We also shared a few functionalities of Riva that can help mitigate these problems to an extent and even provide customizability tips for ASR applications to support your domain-specific use case.

For more information, see the following resources:

Categories
Misc

Top AI Video Analytics Sessions at GTC 2022

Register now to experience the most advanced developer tools and learn how experts across industries are using vision AI to increase operational efficiency.

Register now to experience the most advanced developer tools and learn how experts across industries are using vision AI to increase operational efficiency.

Categories
Misc

UN Economic Commission for Africa Engages NVIDIA to Boost Data Science in 10 Nations

NVIDIA is collaborating with the United Nations Economic Commission for Africa (UNECA) to equip governments and developer communities in 10 nations with data science training and technology to support more informed policymaking and accelerate how resources are allocated. The initiative will empower the countries’ national statistical offices — agencies that handle population censuses data, economic Read article >

The post UN Economic Commission for Africa Engages NVIDIA to Boost Data Science in 10 Nations appeared first on NVIDIA Blog.

Categories
Misc

Fraunhofer Research Leads Way Into Future of Robotics

Joseph Fraunhofer was a 19th-century pioneer in optics who brought together scientific research with industrial applications. Fast forward to today and Germany’s Fraunhofer Society — Europe’s largest R&D organization — is setting its sights on the applied research of key technologies, from AI to cybersecurity to medicine. Its Fraunhofer IML unit is aiming to push Read article >

The post Fraunhofer Research Leads Way Into Future of Robotics appeared first on NVIDIA Blog.

Categories
Misc

Rendered.ai Founder and CEO Nathan Kundtz on Using AI to Build Better AI

Data is the fuel that makes artificial intelligence run. Training machine learning and AI systems requires data. And the quality of datasets has a big impact on the systems’ results. But compiling quality real-world data for AI and ML can be difficult and expensive. That’s where synthetic data comes in. The guest for this week’s Read article >

The post Rendered.ai Founder and CEO Nathan Kundtz on Using AI to Build Better AI appeared first on NVIDIA Blog.

Categories
Misc

OBS Studio to Release Software Update 28.0 With NVIDIA Broadcast Features ‘In the NVIDIA Studio’

In the NVIDIA Studio celebrates the Open Broadcaster Software (OBS) Studio’s 10th anniversary and its 28.0 software release. Plus, popular streamer WATCHHOLLIE shares how she uses OBS and a GeForce RTX 3080 GPU in a single-PC setup to elevate her livestreams.

The post OBS Studio to Release Software Update 28.0 With NVIDIA Broadcast Features ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Categories
Misc

Hands-on Access to VMware’s vSphere on NVIDIA BlueField DPUs with NVIDIA LaunchPad

The increasing adoption of cloud-native workloads is causing a significant shift in infrastructure architecture to support next-generation applications such as…

The increasing adoption of cloud-native workloads is causing a significant shift in infrastructure architecture to support next-generation applications such as AI and big data. Infrastructure must evolve to provide composability and flexibility using virtualization, containers, or bare metal servers.

Traditional software-defined infrastructure provides flexibility but suffers from performance and scalability limitations, as up to 30% of server CPU cores may be consumed by workloads such as networking, storage, and security.

VMware saw this as an opportunity to enhance vSphere infrastructure and reengineered it to take a hardware-based, disaggregated approach. The infrastructure software stack is now tightly coupled to the hardware stack. The result is VMware vSphere Distributed Services Engine (VMware Project Monterey), which integrates closely with the NVIDIA BlueField DPU to provide an evolutionary architectural approach for data center, cloud, and edge. It addresses the changing requirements of next-gen and cloud-native workloads. 

To help organizations test the benefits of vSphere on BlueField DPUs, NVIDIA LaunchPad has curated a lab with exclusive access to live demonstrations and self-paced learning. The lab is designed to provide your IT team with deep, practical experience in a hosted environment before starting an on-premises deployment. 

vSphere Distributed Services Engine

vSphere Distributed Services Engine is a software-defined and hardware-accelerated architecture for VMware’s Cloud Foundation. It provides a breakthrough solution that supports virtualization, containers, and scalable management.

VMware’s vSphere Distributed Services Engine architectural changes, re-architecting VMware’s ESXi for Multi-Cloud Environments.
Figure 1. vSphere Distributed Services Engine architecture diagram

NVIDIA BlueField

The BlueField DPU offloads, accelerates, and isolates the infrastructure workloads, allowing the server CPUs to focus on core applications and revenue-generating tasks. The integration of VMware vSphere and NVIDIA BlueField simplifies building a cloud-ready infrastructure, while providing a consistent service quality in multi-cloud environments.

VMware enables workload composability and portability that supports multi-cloud. At the same time, BlueField handles critical networking, storage, security, and management, including telemetry tasks, freeing up the CPUs to run business and tenant applications. The BlueField DPU also provides security isolation by running firewall, key management, and IDS/IPS on its Arm cores, in a separate domain from the applications.

NVIDIA LaunchPad

LaunchPad gives you access to dedicated hardware and software through a remote lab, so that IT teams can walk through the entire process of deploying and managing a vSphere on BlueField DPUs. The lab is designed to provide your IT team with deep, practical experience in a hosted environment before starting your on-premises deployment.

Background

To understand why this is important, start by looking at how modern applications have changed. Modern applications are driving many underlying hardware infrastructure requirements, including

  • Increased networking traffic that creates performance and scale challenges
  • Hardware acceleration requirements that drive significant operational complexities
  • A lack of a clear definition of the traditional data center perimeter, which intensifies the need for new security models. 

We are seeing that modern applications require more server CPU cycles. As the application requirements for compute continue to grow, increased infrastructure requirements for CPU cycles compete with application requirements.

Specialized data processing units (DPUs) have been developed to offload, accelerate and isolate CPU networking, storage, security, and management tasks. With DPUs, organizations free up server-CPU cycles for core application processing, which accelerates job completion time over a robust zero-trust data center infrastructure. 

Selected functions that used to run on the core CPU are offloaded, accelerated, and isolated on BlueField to support new possibilities, including the following:

  • Improved performance for application and infrastructure services
  • Enhanced visibility, application security, and observability
  • Isolated security capabilities
  • Enhanced data center efficiency and reduced enterprise, edge, and cloud costs

Next steps

VMware vSphere is leading the shift to advanced hybrid-cloud data center architectures, which benefit from the hypervisor and accelerated software-defined networking, security, and storage. With access to vSphere on BlueField DPU preconfigured clusters, you can explore the evolution of VMware Cloud Foundation and take advantage of the disruptive hardware capabilities of servers equipped with BlueField DPUs.

Interested organizations can register for the NVIDIA and VMware vSphere on BlueField DPU early access program. If you are already a member, you can get immediate access.

For more information about the NVIDIA and VMware collaboration, see NVIDIA & VMware: A New Partnership, A New Data Center.

Categories
Misc

Accelerating ETL on KubeFlow with RAPIDS

In the machine learning and MLOps world, GPUs are widely used to speed up model training and inference, but what about the other stages of the workflow like ETL…

In the machine learning and MLOps world, GPUs are widely used to speed up model training and inference, but what about the other stages of the workflow like ETL pipelines or hyperparameter optimization?

Within the RAPIDS data science framework, ETL tools are designed to have a familiar look and feel to data scientists working in Python. Do you currently use Pandas, NumPy, Scikit-learn, or other parts of the PyData stack within your KubeFlow workflows? If so, you can use RAPIDS to accelerate those parts of your workflow by leveraging the GPUs likely already available in your cluster.

Banner ad for downloading a getting started kit for RAPIDS.

In this post, I demonstrate how to drop RAPIDS into a KubeFlow environment. You start with using RAPIDS in the interactive notebook environment and then scale beyond your single container to use multiple GPUs across multiple nodes with Dask.

Optional: Installing KubeFlow with GPUs

This post assumes you are already somewhat familiar with Kubernetes and KubeFlow. To explore how you can use GPUs with RAPIDS on KubeFlow, you need a KubeFlow cluster with GPU nodes. If you already have a cluster or are not interested in KubeFlow installation instructions, feel free to skip ahead.

KubeFlow is a popular machine learning and MLOps platform built on Kubernetes for designing and running machine learning pipelines, training models, and providing inference services.

KubeFlow also provides a notebook service that you can use to launch an interactive Jupyter server in your Kubernetes cluster and a pipeline service with a DSL library, written in Python, to create repeatable workflows. Tools for adjusting hyperparameters and running a model inference server are also accessible. This is essentially all the tooling that you need for building a robust machine learning service.

For this post, you use Google Kubernetes Engine (GKE) to launch a Kubernetes cluster with GPU nodes and install KubeFlow onto it, but any KubeFlow cluster with GPUs will do.

Creating a Kubernetes cluster with GPUs

First, use the gcloud CLI to create a Kubernetes cluster.

$ gcloud container clusters create rapids-gpu-kubeflow 
  --accelerator type=nvidia-tesla-a100,count=2 --machine-type a2-highgpu-2g 
  --zone us-central1-c --release-channel stable
Note: Machines with GPUs have certain limitations which may affect your workflow. Learn more at https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
Creating cluster rapids-gpu-kubeflow in us-central1-c... 
Cluster is being health-checked (master is healthy)...
Created 
kubeconfig entry generated for rapids-gpu-kubeflow.
NAME             	LOCATION   	MASTER_VERSION	MASTER_IP   	MACHINE_TYPE   NODE_VERSION  	NUM_NODES  STATUS
rapids-gpu-kubeflow  us-central1-c  1.21.12-gke.1500  34.132.107.217  a2-highgpu-2g  1.21.12-gke.1500  3      	RUNNING

With this command, you’ve launched a GKE cluster called rapids-gpu-kubeflow. You’ve specified that it should use nodes of type a2-highgpu-2g, each with two A100 GPUs.

KubeFlow also requires a stable version of Kubernetes, so you specified that along with the zone in which to launch the cluster.

Next, install the NVIDIA drivers onto each node.

$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
daemonset.apps/nvidia-driver-installer created

Verify that the NVIDIA drivers are successfully installed.

$ kubectl get po -A --watch | grep nvidiakube-system   nvidia-driver-installer-6zwcn                               	1/1 	Running   0      	8m47s
kube-system   nvidia-driver-installer-8zmmn                               	1/1 	Running   0      	8m47s
kube-system   nvidia-driver-installer-mjkb8                               	1/1 	Running   0      	8m47s
kube-system   nvidia-gpu-device-plugin-5ffkm                              	1/1 	Running   0      	13m
kube-system   nvidia-gpu-device-plugin-d599s                              	1/1 	Running   0      	13m
kube-system   nvidia-gpu-device-plugin-jrgjh                              	1/1 	Running   0      	13m

After your drivers are installed, create a quick sample pod that uses some GPU compute to make sure that everything is working as expected.

$ cat 



If you see Test PASSED in the output, you can be confident that your Kubernetes cluster has GPU compute set up correctly. Next, clean up that pod.

$ kubectl delete po cuda-vectoradd
pod "cuda-vectoradd" deleted

Installing KubeFlow

Now that you have Kubernetes, install KubeFlow. KubeFlow uses kustomize, so be sure to have that installed.

$ curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash

Then, install KubeFlow by cloning the KubeFlow manifests repo, checking out the latest release, and applying them.

$ git clone https://github.com/kubeflow/manifests
$ cd manifests
$ git checkout v1.5.1  # Or whatever the latest release is
$ while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

After all the resources have been created, KubeFlow still has to bootstrap itself on your cluster. Even after this command finishes, things may not be ready yet. This can take upwards of 15 minutes.

Eventually, you should see a full list of KubeFlow services in the kubeflow namespace.

$ kubectl get po -n kubeflow
NAME                                                     	READY   STATUS	RESTARTS   AGE
admission-webhook-deployment-667bd68d94-4n62z            	1/1 	Running   0      	10m
cache-deployer-deployment-79fdf9c5c9-7cpn7               	1/1 	Running   2      	10m
cache-server-6566dc7dbf-7ndm5                            	1/1 	Running   0      	10m
centraldashboard-8fc7d8cc-q62cd                          	1/1 	Running   0      	10m
jupyter-web-app-deployment-84c459d4cd-krxq4              	1/1 	Running   0      	10m
katib-controller-68c47fbf8b-bjvst                        	1/1 	Running   0      	10m
katib-db-manager-6c948b6b76-xtrwz                        	1/1 	Running   2      	10m
katib-mysql-7894994f88-6ndtp                             	1/1 	Running   0      	10m
katib-ui-64bb96d5bf-v598l                                	1/1 	Running   0      	10m
kfserving-controller-manager-0                           	2/2 	Running   0      	9m54s
kfserving-models-web-app-5d6cd6b5dd-hp2ch                	1/1 	Running   0      	10m
kubeflow-pipelines-profile-controller-69596b78cc-zrvhc   	1/1 	Running   0      	10m
metacontroller-0                                         	1/1 	Running   0      	9m53s
metadata-envoy-deployment-5b4856dd5-r7xnn                	1/1 	Running   0      	10mmetadata-grpc-deployment-6b5685488-9rd9q                 	1/1 	Running   6      	10m
metadata-writer-548bd879bb-7fr7x                         	1/1 	Running   1      	10m
minio-5b65df66c9-dq2rr                                   	1/1 	Running   0      	10m
ml-pipeline-847f9d7f78-pl7z5                             	1/1 	Running   0      	10m
ml-pipeline-persistenceagent-d6bdc77bd-wd4p8             	1/1 	Running   2      	10m
ml-pipeline-scheduledworkflow-5db54d75c5-6c5vv           	1/1 	Running   0      	10m
ml-pipeline-ui-5bd8d6dc84-sg9t8                          	1/1 	Running   0      	9m59s
ml-pipeline-viewer-crd-68fb5f4d58-wjhvv                  	1/1 	Running   0      	9m59s
ml-pipeline-visualizationserver-8476b5c645-96ptw         	1/1 	Running   0      	9m59s
mpi-operator-5c55d6cb8f-vwr8p                            	1/1 	Running   0      	9m58s
mysql-f7b9b7dd4-pv767                                    	1/1 	Running   0      	9m58s
notebook-controller-deployment-6b75d45f48-rpl5b          	1/1 	Running   0      	9m57s
profiles-deployment-58d7c94845-gbm8m                     	2/2 	Running   0      	9m57s
tensorboard-controller-controller-manager-775777c4c5-b6c2k   2/2 	Running   2      	9m56s
tensorboards-web-app-deployment-6ff79b7f44-g5cr8         	1/1 	Running   0      	9m56s
training-operator-7d98f9dd88-hq6v4                       	1/1 	Running   0      	9m55s
volumes-web-app-deployment-8589d664cc-krfxs              	1/1 	Running   0      	9m55s
workflow-controller-5cbbb49bd8-b7qmd                     	1/1 	Running   1      	9m55s

After all your pods are in a Running state, port forward the KubeFlow web user interface, and access it in your browser.

Navigate to 127.0.0.1:8080 and log in with the default credentials user@example.com and 12341234. Then, you should see the KubeFlow dashboard (Figure 1).

Screenshot of the KubeFlow dashboard
Figure 1. KubeFlow dashboard

Using RAPIDS in KubeFlow notebooks

To get started with RAPIDS on your KubeFlow cluster, start a notebook session using the official RAPIDS container images.

Before launching your cluster, you must create a configuration profile that is important for when you start using Dask later. To do this, apply the following manifest:

# configure-dask-dashboard.yaml
apiVersion: "kubeflow.org/v1alpha1"
kind: PodDefault
metadata:
  name: configure-dask-dashboardspec:
  selector:
	matchLabels:
  	configure-dask-dashboard: "true"
  desc: "configure dask dashboard"
  env:
	- name: DASK_DISTRIBUTED__DASHBOARD__LINK
  	value: "{NB_PREFIX}/proxy/{host}:{port}/status"  volumeMounts:
   - name: jupyter-server-proxy-config
 	mountPath: /root/.jupyter/jupyter_server_config.py
 	subPath: jupyter_server_config.py
  volumes:
   - name: jupyter-server-proxy-config
 	configMap:
   	name: jupyter-server-proxy-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: jupyter-server-proxy-config
data:
  jupyter_server_config.py: |
	c.ServerProxy.host_allowlist = lambda app, host: True

Create a file with the contents of this code example, and then apply it into the user@example.com user namespace with kubectl.

$ kubectl apply -n kubeflow-user-example-com -f configure-dask-dashboard.yaml

Now, choose a RAPIDS version to use. Typically, you want to choose the container image for the latest release. The default CUDA version installed on GKE Stable is 11.4, so choose that. As of version 11.5 and later, it won’t matter as they will be backward compatible. Copy the container image name from the installation command:

rapidsai/rapidsai-core:22.06-cuda11.5-runtime-ubuntu20.04-py3.9

Back in KubeFlow, choose the Notebooks tab on the left and choose New Notebook.

On this page, you must set a few configuration options:

  • Name: rapids
  • Namespace: kubeflow-user-example-com
  • Custom Image: Select this checkbox.
  • Custom Image: rapidsai/rapidsai-core:22.06-cuda11.4-runtime-ubuntu20.04-py3.9
  • Requested CPUs: 2
  • Requested memory in Gi: 8
  • Number of GPUs: 1
  • GPU Vendor: NVIDIA

Scroll down to Configurations, check the configure dask dashboard option, scroll to the bottom of the page, and then choose Launch. You should see it starting up in your list of notebooks. The RAPIDS container images are packed full of amazing tools, so this step can take a little while.

When the notebook is ready, to launch Jupyter, choose Connect. Verify that everything works okay by opening a terminal window and running nvidia-smi (Figure 2).

Screenshot of a terminal window open in Jupyter Lab with the output of the nvidia-smi command listing one A100 GPU.
Figure 2. The nvidia-smi command is a great way to check that your GPU is set up

Success! Your A100 GPU is being passed through into your notebook container.

The RAPIDS container that you chose also comes with some example notebooks, which you can find in /rapidsai/notebooks. Make a quick symbolic link to these from your home directory so that you can navigate using the file explorer on the left:

ln -s /rapids/notebooks /home/jovyan/notebooks.

Navigate to those example notebooks and explore all the libraries that RAPIDS offers. For example, ETL developers that use pandas should check out the cuDF notebooks for examples of accelerated DataFrames.

Scaling your RAPIDS workflows

Many RAPIDS libraries also support scaling out your computations onto multiple GPUs spread over many nodes for added acceleration. To do this, use Dask, an open-source Python library for distributed computing.

To use Dask, create a scheduler and some workers to perform your calculations. These workers also need GPUs and the same Python environment as the notebook session. Dask has an operator for Kubernetes that you can use to manage Dask clusters on your KubeFlow cluster, so install that now.

Installing the Dask Kubernetes operator

To install the operator, you create the operator itself and its associated custom resources. For more information, see Installing in the Dask documentation.

In the terminal window that you used to create your KubeFlow cluster, run the following commands:

$ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskcluster.yaml

$ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskworkergroup.yaml

$ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskjob.yaml

$ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/operator.yaml

Verify that your resources were applied successfully by listing your Dask clusters. You shouldn’t expect to see any but the command should succeed.

$ kubectl get daskclusters
No resources found in default namespace.

You can also check that the operator pod is running and ready to launch new Dask clusters.

$ kubectl get pods -A -l application=dask-kubernetes-operator
NAMESPACE       NAME                                        READY   STATUS    RESTARTS   AGE
dask-operator   dask-kubernetes-operator-775b8bbbd5-zdrf7   1/1     Running   0          74s

Lastly, make sure that your notebook session can create and manage the Dask custom resources. To do this, edit the kubeflow-kubernetes-edit cluster role that gets applied to your notebook pods. Add a new rule to the rules section for this role to allow everything in the kubernetes.dask.org API group.

$ kubectl edit clusterrole kubeflow-kubernetes-edit
…
rules:
…
- apiGroups:
  - "kubernetes.dask.org"
  verbs:
  - "*"
  resources:
  - "*"
…

Creating the Dask cluster

Now, create DaskCluster resources in Kubernetes to launch all the necessary pods and services for your cluster to work. You can do this in YAML through the Kubernetes API if you like but for this post, use the Python API from the notebook session.

Back in the Jupyter session, create a new notebook and install the dask-kubernetes package that you need for launching your clusters.

!pip install dask-kubernetes

Next, create a Dask cluster using the KubeCluster class. Confirm that you set the container image to match the one you chose for your notebook environment and set the number of GPUs to 1. You also tell the RAPIDS container not to start Jupyter by default and run the Dask command instead.

This can take a similar amount of time to starting up the notebook container, as it also has to pull the RAPIDS Docker image.

from dask_kubernetes.experimental import KubeCluster

cluster = KubeCluster(name="rapids-dask",
                  	image="rapidsai/rapidsai-core:22.06-cuda11.4-runtime-ubuntu20.04-py3.9",
                  	worker_command="dask-cuda-worker",
                  	n_workers=2,
                  	resources={"limits": {"nvidia.com/gpu": "1"}},
                  	env={"DISABLE_JUPYTER": "true"})

Figure 3 shows that you have a Dask cluster with two workers, and that each worker has an A100 GPU, the same as your Jupyter session.

Screenshot of the Dask cluster widget in JupyterLab showing two workers with A100 GPUs.
Figure 3. Dask has many useful widgets that you can view in your notebook to show the status of your cluster

You scale this cluster up and down with either the scaling tab in the widget in Jupyter or by calling cluster.scale(n) to set the number of workers, and therefore the number of GPUs.

Now, connect a Dask client to your cluster. From that point on, any RAPIDS libraries that support Dask, such as dask_cudf, use your cluster to distribute your computation over all your GPUs. Figure 4 shows a short example of creating a Series object and distributing it with Dask.

Screenshot of cuDF code in JupyterLab that uses Dask.
Figure 4. Create a cuDF DataFrame, distributed it with Dask, then perform a computation and get the results

Accessing the Dask dashboard

At the beginning of this section, you added an extra config file with some options for the Dask dashboard. These options are necessary to enable you to access the dashboard running in the scheduler pod on your Kubernetes cluster from your Jupyter environment.

You may have noticed that the cluster and client widgets both had links to the dashboard. Select these links to open the dashboard in a new tab (Figure 5).

Screenshot of the Dask dashboard showing the from_cudf call has run on two GPUs.
Figure 5. Dask dashboard with the from_cudf call

You can also use the Dask JupyterLab extension to view various plots and stats about your Dask cluster right in JupyterLab.

On the Dask tab on the left, choose the search icon. This connects JupyterLab to the dashboard through the client in your notebook. Select the various plots and arrange them in JupyterLab by dragging the tabs around.

Screenshot of JupyterLab with the Dask Lab extension open on the left and various Dask plots arranged on the screen
Figure 6. The Dask dashboard has many useful plots, including some dedicated GPU metrics like memory use and utilization

If you followed along with this post, clean up all the created resources by deleting the GKE cluster created at the start.

$ gcloud container clusters delete rapids-gpu-kubeflow --zone us-central1-c

Closing thoughts

RAPIDS integrates seamlessly with KubeFlow enabling you to use your GPU resources in the ETL stages of your workflows, as well during training and inference.

You can either drop the RAPIDS environment straight into the KubeFlow notebooks service for single-node work or use the Dask Operator for Kubernetes from KubeFlow Pipelines to scale that workload onto many nodes and GPUs.

For more information about using RAPIDS, see the following resources: