Categories
Misc

Solving Automatic Speech Recognition Deployment Challenges

Successfully deploying an automatic speech recognition (ASR) application can be a frustrating experience. For example, it is difficult for an ASR system to…

Successfully deploying an automatic speech recognition (ASR) application can be a frustrating experience. For example, it is difficult for an ASR system to correctly identify words while maintaining low latency, considering the many different dialects and pronunciations that exist.

Sign up for the latest Speech AI News from NVIDIA.

Whether you are using a commercial or open-source solution, there are many challenges to consider when building an ASR application.

In this post, I highlight major pain points that developers face when adding ASR capabilities to applications. I share how to approach and overcome these challenges, using NVIDIA Riva speech AI SDK as an example.

Challenges of building ASR applications

Here are a few challenges present in the creation of any ASR system:

  • High accuracy
  • Low latency
  • Compute resource allocation
  • Flexible deployment and scalability
  • Customization
  • Monitoring and tracking

High accuracy

One key metric to measure speech recognition accuracy is the word error rate (WER). WER is defined as the ratio of total incorrect and missing words identified during transcription and sum of total number of words present in the labeled transcripts.

Several reasons cause transcription errors in ASR models leading to misinterpretations of information:

  • Quality of the training dataset
  • Different dialects and pronunciations
  • Accents and variations in speech
  • Custom or domain-specific words and acronyms
  • Contextual relationship of words
  • Differentiating phonetically similar sentences

Due to these factors, it is difficult to build a robust ASR model with a low WER score.

Low latency

A conversational AI application is an end-end pipeline composed of speech AI and natural language processing (NLP).

For any conversational AI application, response time plays a critical factor to make any natural conversations. It would not be practical to converse with a bot if a customer only receives a response after 1 minute of waiting time.

It has been observed that any conversation AI application should deliver a latency of less than 300 msec. So, it becomes critical to make sure that speech AI model latency is far below the 300 msec limit to be integrated into an end-end pipeline for real-time conversational AI applications.

 Many factors affect the overall latency of the ASR model:

  • Model size:  Large and complex models have better accuracy but require a lot of computation power and add to the latency compared to smaller models; that is, the inference cost is high.
  •  Hardware:  Edge deployment of such complex models further adds to the complexity of latency requirements.
  •  Network bandwidth:  Sufficient bandwidth is needed for streaming audio content and transcripts, especially in the case of cloud-based deployment.

Compute resource allocation

Optimizing the ASR model and its resource utilizations applies to all AI models and is not only specific to the ASR model. However, it is a critical factor that impacts overall latency as well as the compute cost to run any AI application.

The whole point of optimizing a model is to reduce the inference cost both at compute level as well as the latency level. But all models available online for a particular architecture are not created equally and do not have the same code quality. They also have dramatic differences in performance.

Also, not all of them respond in the same way to knowledge distillation, pruning, quantization, and other optimization techniques to result in improved inference performance without impacting the accuracy results.

Flexible deployment and scalability

Creating an accurate and efficient model is only a small fraction of any real-time AI application. The required surrounding infrastructure is vast and complex. For example, deployment infrastructure should include:

  • Streaming support
  • Resource management service
  • Servicing infrastructure
  • Analytics tool support
  • Monitoring service

Creating a custom end-end optimized deployment pipeline that supports the required latency requirement for any ASR application is challenging because it requires optimization and acceleration at each pipeline stage.

Based on the number of audio streams that must be supported at a given instance, your speech recognition application should be able to auto-scale the application deployment to provide acceptable performance.

Customization

Getting a model to work out-of-the-box is always the goal. However, the performance of currently available models depends on the dataset used during its training phase. Models typically work great for the use case they’ve already been exposed to, but the same model’s performance may degrade as soon as it is deployed in different domain applications.

Specifically, in the case of ASR, the model’s performance depends on accent or language and variations in speech. You should be able to customize models based on the application use case.

For instance, speech recognition models being deployed in healthcare– or financial-related applications require support for domain-specific vocabulary. This vocabulary differs from what is normally used during ASR model training.

To support regional languages for ASR, you need a complete set of training pipelines to easily customize the model and efficiently handle different dialects.

Monitoring and tracking

Real-time monitoring and tracking help in getting instant insights, alerts, and notifications so that you can take timely corrective actions. This helps in tracking the resource consumption as per the incoming traffic so that the corresponding application can be auto-scaled. Quota limits can also be set to minimize the infrastructure cost without impacting the overall throughput.

Capturing all these stats requires integrating multiple libraries to capture performance at various stages of the ASR pipeline.

Examples of how the Riva SDK addresses ASR challenges

Advanced SDKs can be used to conveniently add a voice interface to your applications. In this post, I demonstrate how a GPU-accelerated SDK like Riva can be applied to solve these challenges when building speech recognition applications.

High accuracy and compute optimization

You can use pretrained Riva speech models in NGC that can be fine-tuned with TAO Toolkit on a custom data set, further accelerating domain-specific model development by 10x.

All NGC models are optimized and sped up for GPU deployment to achieve better recognition accuracy. These models are also fully supported by NVIDIA TensorRT optimization. Riva’s high-performance inference is powered by TensorRT optimizations and served using the NVIDIA Triton Inference Server to optimize the overall compute requirement and, in turn, improve the server throughput

For example, here are a few ASR models on NGC that are further optimized as part of the Riva pipeline for better performance:

Continued optimizations across the entire stack of Riva—from models and software to hardware—has delivered 12x the gain compared to the previous generation.

Graph showing Riva ASR performance model acceleration compared to the previous generation.
Figure 1. ASR performance acceleration using NVIDIA Riva

Low latency

Latencies and throughput measurements for streaming and offline configurations are reported under the ASR performance section of Riva documentation.

In “streaming low latency” Riva ASR model deployment mode, the average latency (ms) is far less than 50 ms for most of the cases. Using such ASR models, it becomes easier to create a real-time conversational AI pipeline and still achieve the

Flexible deployment and scaling

Deploying your speech recognition application on any platform with ease requires full support. The Riva SDK provides flexibility at each step from fine-tuning models on domain-specific datasets to customizing pipelines. It can also be deployed in the cloud, on-premises, edge, and embedded devices.

To support scaling, Riva is fully containerized and can scale to hundreds and thousands of parallel streams. Riva is also included in the NGC Helm repository, which is a chart designed to automate for push-button deployment to a Kubernetes cluster.

For more information about deployment best practices, see the Autoscaling Riva Deployment with Kubernetes for Conversational AI in Production on-demand GTC session.

Customization

Workflow diagram illustrates a speech input being processed in a Riva ASR pipeline and a resulting text output.
Figure 2. Customization techniques range from word boosting to fine-tuning punctuation and capitalization models

Customization techniques are helpful when out-of-the-box Riva models fall short dealing with challenging scenarios not seen in the training data. This might include recognizing narrow domain terminologies, new accents, or noisy environments.

SDKs like Riva support customization, starting from the word-boosting level, and provision for the end user to custom train their acoustic models.

Riva speech skills also provide high-quality, pretrained models across a variety of languages. For more information about all the models for supported languages, see the Language Support section.

Monitoring and tracking

In Riva, underlying Triton Inference Server metrics are available to the end users to use, based on the customization and dashboard creation. These metrics are only available by accessing the endpoint.

NVIDIA Triton provides Prometheus metrics, as well as indicating GPU and request statistics. This helps in monitoring and tracking the production deployment setup.

Key takeaways

This post provides you with a high-level overview of common pain points that arise when developing AI applications with ASR capabilities. Being aware of the factors impacting your ASR application’s overall performance helps you streamline and improve the end-to-end development process.

We also shared a few functionalities of Riva that can help mitigate these problems to an extent and even provide customizability tips for ASR applications to support your domain-specific use case.

For more information, see the following resources:

Categories
Misc

Top AI Video Analytics Sessions at GTC 2022

Register now to experience the most advanced developer tools and learn how experts across industries are using vision AI to increase operational efficiency.

Register now to experience the most advanced developer tools and learn how experts across industries are using vision AI to increase operational efficiency.

Categories
Misc

UN Economic Commission for Africa Engages NVIDIA to Boost Data Science in 10 Nations

NVIDIA is collaborating with the United Nations Economic Commission for Africa (UNECA) to equip governments and developer communities in 10 nations with data science training and technology to support more informed policymaking and accelerate how resources are allocated. The initiative will empower the countries’ national statistical offices — agencies that handle population censuses data, economic Read article >

The post UN Economic Commission for Africa Engages NVIDIA to Boost Data Science in 10 Nations appeared first on NVIDIA Blog.

Categories
Misc

Fraunhofer Research Leads Way Into Future of Robotics

Joseph Fraunhofer was a 19th-century pioneer in optics who brought together scientific research with industrial applications. Fast forward to today and Germany’s Fraunhofer Society — Europe’s largest R&D organization — is setting its sights on the applied research of key technologies, from AI to cybersecurity to medicine. Its Fraunhofer IML unit is aiming to push Read article >

The post Fraunhofer Research Leads Way Into Future of Robotics appeared first on NVIDIA Blog.

Categories
Misc

Rendered.ai Founder and CEO Nathan Kundtz on Using AI to Build Better AI

Data is the fuel that makes artificial intelligence run. Training machine learning and AI systems requires data. And the quality of datasets has a big impact on the systems’ results. But compiling quality real-world data for AI and ML can be difficult and expensive. That’s where synthetic data comes in. The guest for this week’s Read article >

The post Rendered.ai Founder and CEO Nathan Kundtz on Using AI to Build Better AI appeared first on NVIDIA Blog.

Categories
Misc

OBS Studio to Release Software Update 28.0 With NVIDIA Broadcast Features ‘In the NVIDIA Studio’

In the NVIDIA Studio celebrates the Open Broadcaster Software (OBS) Studio’s 10th anniversary and its 28.0 software release. Plus, popular streamer WATCHHOLLIE shares how she uses OBS and a GeForce RTX 3080 GPU in a single-PC setup to elevate her livestreams.

The post OBS Studio to Release Software Update 28.0 With NVIDIA Broadcast Features ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Categories
Misc

Hands-on Access to VMware’s vSphere on NVIDIA BlueField DPUs with NVIDIA LaunchPad

The increasing adoption of cloud-native workloads is causing a significant shift in infrastructure architecture to support next-generation applications such as…

The increasing adoption of cloud-native workloads is causing a significant shift in infrastructure architecture to support next-generation applications such as AI and big data. Infrastructure must evolve to provide composability and flexibility using virtualization, containers, or bare metal servers.

Traditional software-defined infrastructure provides flexibility but suffers from performance and scalability limitations, as up to 30% of server CPU cores may be consumed by workloads such as networking, storage, and security.

VMware saw this as an opportunity to enhance vSphere infrastructure and reengineered it to take a hardware-based, disaggregated approach. The infrastructure software stack is now tightly coupled to the hardware stack. The result is VMware vSphere Distributed Services Engine (VMware Project Monterey), which integrates closely with the NVIDIA BlueField DPU to provide an evolutionary architectural approach for data center, cloud, and edge. It addresses the changing requirements of next-gen and cloud-native workloads. 

To help organizations test the benefits of vSphere on BlueField DPUs, NVIDIA LaunchPad has curated a lab with exclusive access to live demonstrations and self-paced learning. The lab is designed to provide your IT team with deep, practical experience in a hosted environment before starting an on-premises deployment. 

vSphere Distributed Services Engine

vSphere Distributed Services Engine is a software-defined and hardware-accelerated architecture for VMware’s Cloud Foundation. It provides a breakthrough solution that supports virtualization, containers, and scalable management.

VMware’s vSphere Distributed Services Engine architectural changes, re-architecting VMware’s ESXi for Multi-Cloud Environments.
Figure 1. vSphere Distributed Services Engine architecture diagram

NVIDIA BlueField

The BlueField DPU offloads, accelerates, and isolates the infrastructure workloads, allowing the server CPUs to focus on core applications and revenue-generating tasks. The integration of VMware vSphere and NVIDIA BlueField simplifies building a cloud-ready infrastructure, while providing a consistent service quality in multi-cloud environments.

VMware enables workload composability and portability that supports multi-cloud. At the same time, BlueField handles critical networking, storage, security, and management, including telemetry tasks, freeing up the CPUs to run business and tenant applications. The BlueField DPU also provides security isolation by running firewall, key management, and IDS/IPS on its Arm cores, in a separate domain from the applications.

NVIDIA LaunchPad

LaunchPad gives you access to dedicated hardware and software through a remote lab, so that IT teams can walk through the entire process of deploying and managing a vSphere on BlueField DPUs. The lab is designed to provide your IT team with deep, practical experience in a hosted environment before starting your on-premises deployment.

Background

To understand why this is important, start by looking at how modern applications have changed. Modern applications are driving many underlying hardware infrastructure requirements, including

  • Increased networking traffic that creates performance and scale challenges
  • Hardware acceleration requirements that drive significant operational complexities
  • A lack of a clear definition of the traditional data center perimeter, which intensifies the need for new security models. 

We are seeing that modern applications require more server CPU cycles. As the application requirements for compute continue to grow, increased infrastructure requirements for CPU cycles compete with application requirements.

Specialized data processing units (DPUs) have been developed to offload, accelerate and isolate CPU networking, storage, security, and management tasks. With DPUs, organizations free up server-CPU cycles for core application processing, which accelerates job completion time over a robust zero-trust data center infrastructure. 

Selected functions that used to run on the core CPU are offloaded, accelerated, and isolated on BlueField to support new possibilities, including the following:

  • Improved performance for application and infrastructure services
  • Enhanced visibility, application security, and observability
  • Isolated security capabilities
  • Enhanced data center efficiency and reduced enterprise, edge, and cloud costs

Next steps

VMware vSphere is leading the shift to advanced hybrid-cloud data center architectures, which benefit from the hypervisor and accelerated software-defined networking, security, and storage. With access to vSphere on BlueField DPU preconfigured clusters, you can explore the evolution of VMware Cloud Foundation and take advantage of the disruptive hardware capabilities of servers equipped with BlueField DPUs.

Interested organizations can register for the NVIDIA and VMware vSphere on BlueField DPU early access program. If you are already a member, you can get immediate access.

For more information about the NVIDIA and VMware collaboration, see NVIDIA & VMware: A New Partnership, A New Data Center.

Categories
Misc

Accelerating ETL on KubeFlow with RAPIDS

In the machine learning and MLOps world, GPUs are widely used to speed up model training and inference, but what about the other stages of the workflow like ETL…

In the machine learning and MLOps world, GPUs are widely used to speed up model training and inference, but what about the other stages of the workflow like ETL pipelines or hyperparameter optimization?

Within the RAPIDS data science framework, ETL tools are designed to have a familiar look and feel to data scientists working in Python. Do you currently use Pandas, NumPy, Scikit-learn, or other parts of the PyData stack within your KubeFlow workflows? If so, you can use RAPIDS to accelerate those parts of your workflow by leveraging the GPUs likely already available in your cluster.

Banner ad for downloading a getting started kit for RAPIDS.

In this post, I demonstrate how to drop RAPIDS into a KubeFlow environment. You start with using RAPIDS in the interactive notebook environment and then scale beyond your single container to use multiple GPUs across multiple nodes with Dask.

Optional: Installing KubeFlow with GPUs

This post assumes you are already somewhat familiar with Kubernetes and KubeFlow. To explore how you can use GPUs with RAPIDS on KubeFlow, you need a KubeFlow cluster with GPU nodes. If you already have a cluster or are not interested in KubeFlow installation instructions, feel free to skip ahead.

KubeFlow is a popular machine learning and MLOps platform built on Kubernetes for designing and running machine learning pipelines, training models, and providing inference services.

KubeFlow also provides a notebook service that you can use to launch an interactive Jupyter server in your Kubernetes cluster and a pipeline service with a DSL library, written in Python, to create repeatable workflows. Tools for adjusting hyperparameters and running a model inference server are also accessible. This is essentially all the tooling that you need for building a robust machine learning service.

For this post, you use Google Kubernetes Engine (GKE) to launch a Kubernetes cluster with GPU nodes and install KubeFlow onto it, but any KubeFlow cluster with GPUs will do.

Creating a Kubernetes cluster with GPUs

First, use the gcloud CLI to create a Kubernetes cluster.

$ gcloud container clusters create rapids-gpu-kubeflow 
  --accelerator type=nvidia-tesla-a100,count=2 --machine-type a2-highgpu-2g 
  --zone us-central1-c --release-channel stable
Note: Machines with GPUs have certain limitations which may affect your workflow. Learn more at https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
Creating cluster rapids-gpu-kubeflow in us-central1-c... 
Cluster is being health-checked (master is healthy)...
Created 
kubeconfig entry generated for rapids-gpu-kubeflow.
NAME             	LOCATION   	MASTER_VERSION	MASTER_IP   	MACHINE_TYPE   NODE_VERSION  	NUM_NODES  STATUS
rapids-gpu-kubeflow  us-central1-c  1.21.12-gke.1500  34.132.107.217  a2-highgpu-2g  1.21.12-gke.1500  3      	RUNNING

With this command, you’ve launched a GKE cluster called rapids-gpu-kubeflow. You’ve specified that it should use nodes of type a2-highgpu-2g, each with two A100 GPUs.

KubeFlow also requires a stable version of Kubernetes, so you specified that along with the zone in which to launch the cluster.

Next, install the NVIDIA drivers onto each node.

$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
daemonset.apps/nvidia-driver-installer created

Verify that the NVIDIA drivers are successfully installed.

$ kubectl get po -A --watch | grep nvidiakube-system   nvidia-driver-installer-6zwcn                               	1/1 	Running   0      	8m47s
kube-system   nvidia-driver-installer-8zmmn                               	1/1 	Running   0      	8m47s
kube-system   nvidia-driver-installer-mjkb8                               	1/1 	Running   0      	8m47s
kube-system   nvidia-gpu-device-plugin-5ffkm                              	1/1 	Running   0      	13m
kube-system   nvidia-gpu-device-plugin-d599s                              	1/1 	Running   0      	13m
kube-system   nvidia-gpu-device-plugin-jrgjh                              	1/1 	Running   0      	13m

After your drivers are installed, create a quick sample pod that uses some GPU compute to make sure that everything is working as expected.

$ cat 



If you see Test PASSED in the output, you can be confident that your Kubernetes cluster has GPU compute set up correctly. Next, clean up that pod.

$ kubectl delete po cuda-vectoradd
pod "cuda-vectoradd" deleted

Installing KubeFlow

Now that you have Kubernetes, install KubeFlow. KubeFlow uses kustomize, so be sure to have that installed.

$ curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash

Then, install KubeFlow by cloning the KubeFlow manifests repo, checking out the latest release, and applying them.

$ git clone https://github.com/kubeflow/manifests
$ cd manifests
$ git checkout v1.5.1  # Or whatever the latest release is
$ while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

After all the resources have been created, KubeFlow still has to bootstrap itself on your cluster. Even after this command finishes, things may not be ready yet. This can take upwards of 15 minutes.

Eventually, you should see a full list of KubeFlow services in the kubeflow namespace.

$ kubectl get po -n kubeflow
NAME                                                     	READY   STATUS	RESTARTS   AGE
admission-webhook-deployment-667bd68d94-4n62z            	1/1 	Running   0      	10m
cache-deployer-deployment-79fdf9c5c9-7cpn7               	1/1 	Running   2      	10m
cache-server-6566dc7dbf-7ndm5                            	1/1 	Running   0      	10m
centraldashboard-8fc7d8cc-q62cd                          	1/1 	Running   0      	10m
jupyter-web-app-deployment-84c459d4cd-krxq4              	1/1 	Running   0      	10m
katib-controller-68c47fbf8b-bjvst                        	1/1 	Running   0      	10m
katib-db-manager-6c948b6b76-xtrwz                        	1/1 	Running   2      	10m
katib-mysql-7894994f88-6ndtp                             	1/1 	Running   0      	10m
katib-ui-64bb96d5bf-v598l                                	1/1 	Running   0      	10m
kfserving-controller-manager-0                           	2/2 	Running   0      	9m54s
kfserving-models-web-app-5d6cd6b5dd-hp2ch                	1/1 	Running   0      	10m
kubeflow-pipelines-profile-controller-69596b78cc-zrvhc   	1/1 	Running   0      	10m
metacontroller-0                                         	1/1 	Running   0      	9m53s
metadata-envoy-deployment-5b4856dd5-r7xnn                	1/1 	Running   0      	10mmetadata-grpc-deployment-6b5685488-9rd9q                 	1/1 	Running   6      	10m
metadata-writer-548bd879bb-7fr7x                         	1/1 	Running   1      	10m
minio-5b65df66c9-dq2rr                                   	1/1 	Running   0      	10m
ml-pipeline-847f9d7f78-pl7z5                             	1/1 	Running   0      	10m
ml-pipeline-persistenceagent-d6bdc77bd-wd4p8             	1/1 	Running   2      	10m
ml-pipeline-scheduledworkflow-5db54d75c5-6c5vv           	1/1 	Running   0      	10m
ml-pipeline-ui-5bd8d6dc84-sg9t8                          	1/1 	Running   0      	9m59s
ml-pipeline-viewer-crd-68fb5f4d58-wjhvv                  	1/1 	Running   0      	9m59s
ml-pipeline-visualizationserver-8476b5c645-96ptw         	1/1 	Running   0      	9m59s
mpi-operator-5c55d6cb8f-vwr8p                            	1/1 	Running   0      	9m58s
mysql-f7b9b7dd4-pv767                                    	1/1 	Running   0      	9m58s
notebook-controller-deployment-6b75d45f48-rpl5b          	1/1 	Running   0      	9m57s
profiles-deployment-58d7c94845-gbm8m                     	2/2 	Running   0      	9m57s
tensorboard-controller-controller-manager-775777c4c5-b6c2k   2/2 	Running   2      	9m56s
tensorboards-web-app-deployment-6ff79b7f44-g5cr8         	1/1 	Running   0      	9m56s
training-operator-7d98f9dd88-hq6v4                       	1/1 	Running   0      	9m55s
volumes-web-app-deployment-8589d664cc-krfxs              	1/1 	Running   0      	9m55s
workflow-controller-5cbbb49bd8-b7qmd                     	1/1 	Running   1      	9m55s

After all your pods are in a Running state, port forward the KubeFlow web user interface, and access it in your browser.

Navigate to 127.0.0.1:8080 and log in with the default credentials user@example.com and 12341234. Then, you should see the KubeFlow dashboard (Figure 1).

Screenshot of the KubeFlow dashboard
Figure 1. KubeFlow dashboard

Using RAPIDS in KubeFlow notebooks

To get started with RAPIDS on your KubeFlow cluster, start a notebook session using the official RAPIDS container images.

Before launching your cluster, you must create a configuration profile that is important for when you start using Dask later. To do this, apply the following manifest:

# configure-dask-dashboard.yaml
apiVersion: "kubeflow.org/v1alpha1"
kind: PodDefault
metadata:
  name: configure-dask-dashboardspec:
  selector:
	matchLabels:
  	configure-dask-dashboard: "true"
  desc: "configure dask dashboard"
  env:
	- name: DASK_DISTRIBUTED__DASHBOARD__LINK
  	value: "{NB_PREFIX}/proxy/{host}:{port}/status"  volumeMounts:
   - name: jupyter-server-proxy-config
 	mountPath: /root/.jupyter/jupyter_server_config.py
 	subPath: jupyter_server_config.py
  volumes:
   - name: jupyter-server-proxy-config
 	configMap:
   	name: jupyter-server-proxy-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: jupyter-server-proxy-config
data:
  jupyter_server_config.py: |
	c.ServerProxy.host_allowlist = lambda app, host: True

Create a file with the contents of this code example, and then apply it into the user@example.com user namespace with kubectl.

$ kubectl apply -n kubeflow-user-example-com -f configure-dask-dashboard.yaml

Now, choose a RAPIDS version to use. Typically, you want to choose the container image for the latest release. The default CUDA version installed on GKE Stable is 11.4, so choose that. As of version 11.5 and later, it won’t matter as they will be backward compatible. Copy the container image name from the installation command:

rapidsai/rapidsai-core:22.06-cuda11.5-runtime-ubuntu20.04-py3.9

Back in KubeFlow, choose the Notebooks tab on the left and choose New Notebook.

On this page, you must set a few configuration options:

  • Name: rapids
  • Namespace: kubeflow-user-example-com
  • Custom Image: Select this checkbox.
  • Custom Image: rapidsai/rapidsai-core:22.06-cuda11.4-runtime-ubuntu20.04-py3.9
  • Requested CPUs: 2
  • Requested memory in Gi: 8
  • Number of GPUs: 1
  • GPU Vendor: NVIDIA

Scroll down to Configurations, check the configure dask dashboard option, scroll to the bottom of the page, and then choose Launch. You should see it starting up in your list of notebooks. The RAPIDS container images are packed full of amazing tools, so this step can take a little while.

When the notebook is ready, to launch Jupyter, choose Connect. Verify that everything works okay by opening a terminal window and running nvidia-smi (Figure 2).

Screenshot of a terminal window open in Jupyter Lab with the output of the nvidia-smi command listing one A100 GPU.
Figure 2. The nvidia-smi command is a great way to check that your GPU is set up

Success! Your A100 GPU is being passed through into your notebook container.

The RAPIDS container that you chose also comes with some example notebooks, which you can find in /rapidsai/notebooks. Make a quick symbolic link to these from your home directory so that you can navigate using the file explorer on the left:

ln -s /rapids/notebooks /home/jovyan/notebooks.

Navigate to those example notebooks and explore all the libraries that RAPIDS offers. For example, ETL developers that use pandas should check out the cuDF notebooks for examples of accelerated DataFrames.

Scaling your RAPIDS workflows

Many RAPIDS libraries also support scaling out your computations onto multiple GPUs spread over many nodes for added acceleration. To do this, use Dask, an open-source Python library for distributed computing.

To use Dask, create a scheduler and some workers to perform your calculations. These workers also need GPUs and the same Python environment as the notebook session. Dask has an operator for Kubernetes that you can use to manage Dask clusters on your KubeFlow cluster, so install that now.

Installing the Dask Kubernetes operator

To install the operator, you create the operator itself and its associated custom resources. For more information, see Installing in the Dask documentation.

In the terminal window that you used to create your KubeFlow cluster, run the following commands:

$ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskcluster.yaml

$ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskworkergroup.yaml

$ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/daskjob.yaml

$ kubectl apply -f https://raw.githubusercontent.com/dask/dask-kubernetes/main/dask_kubernetes/operator/deployment/manifests/operator.yaml

Verify that your resources were applied successfully by listing your Dask clusters. You shouldn’t expect to see any but the command should succeed.

$ kubectl get daskclusters
No resources found in default namespace.

You can also check that the operator pod is running and ready to launch new Dask clusters.

$ kubectl get pods -A -l application=dask-kubernetes-operator
NAMESPACE       NAME                                        READY   STATUS    RESTARTS   AGE
dask-operator   dask-kubernetes-operator-775b8bbbd5-zdrf7   1/1     Running   0          74s

Lastly, make sure that your notebook session can create and manage the Dask custom resources. To do this, edit the kubeflow-kubernetes-edit cluster role that gets applied to your notebook pods. Add a new rule to the rules section for this role to allow everything in the kubernetes.dask.org API group.

$ kubectl edit clusterrole kubeflow-kubernetes-edit
…
rules:
…
- apiGroups:
  - "kubernetes.dask.org"
  verbs:
  - "*"
  resources:
  - "*"
…

Creating the Dask cluster

Now, create DaskCluster resources in Kubernetes to launch all the necessary pods and services for your cluster to work. You can do this in YAML through the Kubernetes API if you like but for this post, use the Python API from the notebook session.

Back in the Jupyter session, create a new notebook and install the dask-kubernetes package that you need for launching your clusters.

!pip install dask-kubernetes

Next, create a Dask cluster using the KubeCluster class. Confirm that you set the container image to match the one you chose for your notebook environment and set the number of GPUs to 1. You also tell the RAPIDS container not to start Jupyter by default and run the Dask command instead.

This can take a similar amount of time to starting up the notebook container, as it also has to pull the RAPIDS Docker image.

from dask_kubernetes.experimental import KubeCluster

cluster = KubeCluster(name="rapids-dask",
                  	image="rapidsai/rapidsai-core:22.06-cuda11.4-runtime-ubuntu20.04-py3.9",
                  	worker_command="dask-cuda-worker",
                  	n_workers=2,
                  	resources={"limits": {"nvidia.com/gpu": "1"}},
                  	env={"DISABLE_JUPYTER": "true"})

Figure 3 shows that you have a Dask cluster with two workers, and that each worker has an A100 GPU, the same as your Jupyter session.

Screenshot of the Dask cluster widget in JupyterLab showing two workers with A100 GPUs.
Figure 3. Dask has many useful widgets that you can view in your notebook to show the status of your cluster

You scale this cluster up and down with either the scaling tab in the widget in Jupyter or by calling cluster.scale(n) to set the number of workers, and therefore the number of GPUs.

Now, connect a Dask client to your cluster. From that point on, any RAPIDS libraries that support Dask, such as dask_cudf, use your cluster to distribute your computation over all your GPUs. Figure 4 shows a short example of creating a Series object and distributing it with Dask.

Screenshot of cuDF code in JupyterLab that uses Dask.
Figure 4. Create a cuDF DataFrame, distributed it with Dask, then perform a computation and get the results

Accessing the Dask dashboard

At the beginning of this section, you added an extra config file with some options for the Dask dashboard. These options are necessary to enable you to access the dashboard running in the scheduler pod on your Kubernetes cluster from your Jupyter environment.

You may have noticed that the cluster and client widgets both had links to the dashboard. Select these links to open the dashboard in a new tab (Figure 5).

Screenshot of the Dask dashboard showing the from_cudf call has run on two GPUs.
Figure 5. Dask dashboard with the from_cudf call

You can also use the Dask JupyterLab extension to view various plots and stats about your Dask cluster right in JupyterLab.

On the Dask tab on the left, choose the search icon. This connects JupyterLab to the dashboard through the client in your notebook. Select the various plots and arrange them in JupyterLab by dragging the tabs around.

Screenshot of JupyterLab with the Dask Lab extension open on the left and various Dask plots arranged on the screen
Figure 6. The Dask dashboard has many useful plots, including some dedicated GPU metrics like memory use and utilization

If you followed along with this post, clean up all the created resources by deleting the GKE cluster created at the start.

$ gcloud container clusters delete rapids-gpu-kubeflow --zone us-central1-c

Closing thoughts

RAPIDS integrates seamlessly with KubeFlow enabling you to use your GPU resources in the ETL stages of your workflows, as well during training and inference.

You can either drop the RAPIDS environment straight into the KubeFlow notebooks service for single-node work or use the Dask Operator for Kubernetes from KubeFlow Pipelines to scale that workload onto many nodes and GPUs.

For more information about using RAPIDS, see the following resources:

Categories
Misc

Beating SOTA Inference Performance on NVIDIA GPUs with GPUNet

Crafted by AI for AI, GPUNet is a class of convolutional neural networks designed to maximize the performance of NVIDIA GPUs using NVIDIA TensorRT. Built using…

Crafted by AI for AI, GPUNet is a class of convolutional neural networks designed to maximize the performance of NVIDIA GPUs using NVIDIA TensorRT.

Built using novel neural architecture search (NAS) methods, GPUNet demonstrates state-of-the-art inference performance up to 2x faster than EfficientNet-X and FBNet-V3.

The NAS methodology helps build GPUNet for a wide range of applications such that deep learning engineers can directly deploy these neural networks depending on the relative accuracy and latency targets.                                                                                                           

GPUNet NAS design methodology

Efficient architecture search and deployment-ready models are the key goals of the NAS design methodology. This means little to no interaction with the domain experts and efficient use of cluster nodes for training potential architecture candidates. Most important is that the generated models are deployment-ready.

Crafted by AI

Finding the best performing architecture search for a target device can be time-consuming. NVIDIA built and deployed a novel NAS AI agent that efficiently makes the tough design choices required to build GPUNets that beat the current SOTA models by a factor of 2x.

This NAS AI agent automatically orchestrates hundreds of GPUs in the Selene supercomputer without any intervention from the domain experts.

Optimized for NVIDIA GPU using TensorRT

GPUNet picks up the most relevant operations required to meet the target model accuracy with related TensorRT inference latency cost, promoting GPU-friendly operators (for example, larger filters) over memory-bound operators (for example, fancy activations). It delivers the SOTA GPU latency and the accuracy on ImageNet.

Deployment-ready

The GPUNet reported latencies include all the performance optimization available in the shipping version of TensorRT, including fused kernels, quantization, and other optimized paths. Built GPUNets are ready for deployment.

Building a GPUNet: An end-to-end NAS workflow

At a high level, the neural architecture search (NAS) AI agent is split into two stages:

  • Categorizing all possible network architectures by the inference latency.
  • Using a subset of these networks that fit within the latency budget and optimizing them for accuracy.

In the first stage, as the search space is high-dimensional, the agent uses Sobol sampling to distribute the candidates more evenly. Using the latency look-up table, these candidates are then categorized into a subsearch space, for example, a subset of networks with total latency under 0.5 msecs on NVIDIA V100 GPUs.

The inference latency used in this stage is an approximate cost, calculated by summing up the latency of each layer from the latency lookup table. The latency table uses input data shape and layer configurations as keys to look up the related latency on the queried layer.

In the second stage, the agent sets up Bayesian optimization loss function to find the best performing higher accuracy network within the latency range of the subspace:

loss = CrossEntropy(model weights) + alpha * latency(architecture candidate)^{beta}

Control flow block diagram of the NAS AI Agent, starting with a baseline model to ending with a list of best ranked neural architectures.
Figure 2. NVIDIA NAS AI Agent End-to-End workflow

The AI agent uses a client-server distributed training controller to perform NAS simultaneously across multiple network architectures. The AI agent runs on one server node, proposing and training network candidates that run on several client nodes on the cluster.

Based on the results, only the promising network architecture candidates that meet both the accuracy and the latency targets of the target hardware get ranked, resulting in a handful of best-performing GPUNets that are ready to be deployed on NVIDIA GPUs using TensorRT.

GPUNet model architecture

The GPUNet model architecture is an eight-stage architecture using EfficientNet-V2 as the baseline architecture.

The search space definition includes searching on the following variables:

  • Type of operations
  • Number of strides
  • Kernel size
  • Number of layers
  • Activation function
  • IRB expansion ratio
  • Output channel filters
  • Squeeze excitation (SE)

Table 1 shows the range of values for each variable in the search space. 

Table 1. Value ranges for search space variables
Stage Type Stride Kernel Layers Activation ER Filters SE
0 Conv 2 [3,5] 1 [R,S]   [24, 32, 8]  
1 Conv 1 [3,5] [1,4] [R,S]   [24, 32, 8]  
2 F-IRB 2 [3,5] [1,8] [R,S] [2, 6] [32, 80, 16] [0, 1]
3 F-IRB 2 [3,5] [1,8] [R,S] [2, 6] [48, 112, 16] [0, 1]
4 IRB 2 [3,5] [1,10] [R,S] [2, 6] [96, 192, 16] [0, 1]
5 IRB 1 [3,5] [0,15] [R,S] [2, 6] [112, 224, 16] [0, 1]
6 IRB 2 [3,5] [1,15] [R,S] [2, 6] [128, 416, 32] [0, 1]
7 IRB 1 [3,5] [0,15] [R,S] [2, 6] [256, 832, 64] [0, 1]
8 Conv1x1 & Pooling & FC    

The first two stages search for the head configurations using convolutions. Inspired by EfficientNet-V2, the second and third stages use Fused-IRB. Fused-IRBs result in higher latency though, so in stages 4 to 7 these are replaced by IRBs.

The column Layers show the range of layers in the stage. For example, [1, 10] in stage 4 means that the stage can have 1 to 10 IRBs. The column Filters shows the range of output channel filters for the layers in the stage. This search space also tunes the expansion ratio (ER), activation types, kernel sizes, and the Squeeze Excitation (SE) layer inside the IRB/Fused-IRB.

Finally, the dimensions of the input image are searched from 224 to 512, at the step of 32.

Each GPUNet candidate build from the search space is encoded into a 41-wide integer vector (Table 2).

Table 2. The encoding scheme of networks in the search space
Stage Type Hyperparameters Length
Resolution [Resolution] 1
0 Conv [#Filters] 1
1 Conv [Kernel, Activation, #Layers] 3
2 Fused-IRB [#Filters, Kernel, E, SE, Act, #Layers] 6
3 Fused-IRB [#Filters, Kernel, E, SE, Act, #Layers] 6
4 IRB [#Filters, Kernel, E, SE, Act, #Layers] 6
5 IRB [#Filters, Kernel, E, SE, Act, #Layers] 6
6 IRB [#Filters, Kernel, E, SE, Act, #Layers] 6
7 IRB [#Filters, Kernel, E, SE, Act, #Layers] 6

At the end of the NAS search, the returned ranked candidates is a list of these best-performing encodings, which are in turn the best-performing GPUNets.

Summary

All ML practitioners are encouraged to read the CVPR 2022 GPUNet paper, with related GPUNet training code on the NVIDIA/DeepLearningExamples GitHub repo, and run inference on the colab instance on available cloud GPUs. GPUNet inference is also available on the PyTorch hub. The colab run instance uses the GPUNet checkpoints hosted on the NGC hub. These checkpoints have varying accuracy and latency tradeoffs, which can be applied based on the requirement of the target application.

Categories
Misc

AI in Endoscopy: Improving Detection Rates and Visibility with Real-Time Sensing

Frame by frame identification and tracking in endoscopyClinical applications for AI are improving digital surgery, helping to reduce errors, provide consistency, and enable surgeon augmentations that were previously…Frame by frame identification and tracking in endoscopy

Clinical applications for AI are improving digital surgery, helping to reduce errors, provide consistency, and enable surgeon augmentations that were previously unimaginable.

In endoscopy, a minimally invasive procedure used to examine the interior of an organ or cavity of a body, AI and accelerated computing are enabling better detection rates and visibility.

Endoscopists can investigate symptoms, make a diagnosis, and treat patients by cauterizing a bleeding blood vessel, for example. There are numerous forms of endoscopy, many of them focused on gastroenterological diseases that affect the digestive tract.

Colonoscopy, one of the most common forms of gastrointestinal endoscopy, is essential for catching colorectal cancer, a disease that the American Cancer Society predicts will affect over 150,000 people in 2022.

With the assistance of AI, surgeries like endoscopy are becoming safer and more consistent while reducing surgeon workload. The tasks being augmented with machine learning algorithms include labeling, clearing surgical smoke, classifying airway diseases, identifying airway sizes, identifying lesions and diseased tissue, and auto-calculating the best physical routes for instruments.

To enable these clinical applications, technical algorithms are being developed for specific tasks:

  • Organ segmentation for detection and automatic measurements
  • Tool tracking
  • Tissue type identification
  • Optical flow
  • Legion classification

Enhancing and processing video streams

In endoscopy, the task of enhancing and processing video streams is key to augmenting surgeon technical skills. This includes tasks of endoscopic image denoising, anomaly object detection, and anomaly measurements, as well as the streaming tasks of ingesting high-resolution and high-bandwidth data.

To implement AI and accomplish these tasks, developers must address numerous challenges in the medical device development process such as:

  • Ingesting high resolution, high-bandwidth data streams
  • Running AI inference with a low-latency budget
  • Finding flexible sensor and data I/O options
  • Building a distributed compute platform from edge to data center to cloud
  • Adopting new deep learning algorithms

Today, most developers have to build individual solutions every time they have to solve a problem or workflow bottleneck. NVIDIA Clara Holoscan is a development platform with compute for AI workloads. With workloads such as enhanced visualization and automatic anomaly detection, you can easily customize solutions for various challenges.

Having an accelerated platform to iterate and integrate AI into endoscopy workflows gives you a low-risk, low-cost approach for adding augmentations to your existing endoscopy systems.

Whether deploying latency-sensitive, real-time tasks on the edge or analytic and summarization tasks to the cloud, NVIDIA Clara Holoscan offloads complexity, allowing you to quickly build custom AI solutions to improve endoscopy.

Endoscopy AI sample application on NVIDIA Clara Holoscan

Reference applications can provide an easy starting point if you’re looking to build custom applications for medical devices. The NVIDIA Clara Holoscan SDK includes a sample AI-enabled endoscopy application as a template for reusing components and app graphs in existing applications to build custom AI pipelines.

The endoscopy AI sample application has the end-to-end functionality of GXF, a modular and extensible framework to build high-performance applications. GXF provides support for devices that interface with AJA with an HDMI input. Its deep learning model can perform object detection and tool tracking in real time on an endoscopy video stream.

Several features are used to minimize the overall latency:

  • GPUDirect RDMA video data transfer to eliminate the overhead of copying to or from system memory.
  • NVIDIA Performance Primitive library for CUDA-accelerated 2D image transformations before AI inference.
  • TensorRT runtime for optimized AI Inference and speed-up.
  • CUDA and OpenGL interoperability, which provides efficient resource sharing on the GPU for visualization.

For more information about the endoscopy AI sample application, its hardware and software reference architecture on NVIDIA Clara Holoscan, as well as the path to production, see the Clara Holoscan Endoscopy whitepaper.

Featured image

An endoscopy image from a gallbladder surgery showing an AI-powered frame-by-frame tool identification and tracking. Image courtesy of Research Group Camma, IHU Strasbourg, and University of Strasbourg.