Is there such a thing a default null where a bunch of random real world images are used to make no choice? If we had two label classes class-1 & class-2 our model will always choose one or the other but we need a third class to say it is neither class one or class two. So there should be a default class that cross references & removes any data in this default class that correlates to either of our classes?
If we transform the image data… We input the raw, then transform it by splitting channels to R, G, B then input the seperate color channels then we transform it again to greyscale & input that so we do multiple rounds of transform for each single input data point…. Then to get a classification / prediction we all predictions from each transform and compare the results of each? Wouldn’t this allow us to achieve more accurate results…
Clara Holoscan SDK 0.2 offers real-time AI inference capabilities and fast I/O for high-performance streaming applications in medical devices.
Advances in edge computing, video cameras, real-time processing, and AI have helped transform medical devices over the years. NVIDIA developed the NVIDIA Clara Holoscan platform to support the development of software-defined AI medical devices. The platform consists of NVIDIA Clara Developer Kits, the NVIDIA Clara Holoscan SDK, and NVIDIA Clara Holoscan MGX for production-ready deployment.
The latest release of the NVIDIA Clara Holoscan SDK 0.2 offers real-time AI inference capabilities and fast I/O for high-performance streaming applications in medical devices. This includes endoscopy, ultrasound, surgical robots, microscopy, and genomics sequencing instruments.
The release also consists of:
Core backend on NVIDIA Graph eXecution Framework (GXF) vs. GStreamer.
A sample endoscopy AI application.
A customizable AI pipeline to add your own model.
Support for both the Clara AGX Developer Kit with the Jetson AGX Xavier and NVIDIA RTX 6000 and the Clara Holoscan Development Kit with the Jetson AGX Orin and NVIDIA RTX A6000.
Support on the NVIDIA JetPack 5.0 SDK, which includes Ubuntu 20.04.
Graph eXecution Framework processes streaming data
The most significant change in the Clara Holoscan SDK 0.2 is the shift of the core backend from GStreamer to the NVIDIA GXF. GXF is a framework supporting component-based programming for streaming data processing pipelines. It is built for very efficient data ingestion, data transfer, and AI/ML workloads.
With GXF, developers can create reusable components and combine them in graphs to build applications for different products quickly. GXF supports the processing of video and AU streams as well as user-defined streaming data types used in medical devices such as raw ultrasound, radiology imaging scanners, and microscopes.
A recent test using the NVIDIA Latency Display Analysis Tool on a 1080p video stream showed that GXF offers a significant speedup compared to previous solutions. In the test, GXF reduced the overhead in an AI Inferencing application by nearly 3x compared to a similar GStreamer-based pipeline in the Clara Holoscan SDK 0.1.
Figure 1. GXF in Clara Holoscan SDK 0.2 compared to GStreamer in the previous SDK 0.1
Additionally, GXF supports user-customizable components to support generic data processing pipelines. GXF handles the critical parts of building a high-performance application due to two important components.
First is a scheduler that determines when components execute. The scheduler supports single or multithreaded execution, with conditional execution, asynchronous scheduling, and other custom tools.
Second, GXF has a memory allocator that provides a system with an upfront allocation of a large contiguous memory pool and reuses regions as needed. To ensure zero-copy data exchange between components, memory can be pinned to the device.
Figure 2. An example of a generic pipeline where a developer can customize the workflow including data processing, model inference, storage, and visualization
Endoscopy AI sample application on Clara Holoscan
Digital endoscopy has evolved as a key technology for medical screenings and minimally invasive surgeries. Using real-time AI platforms to process and analyze the video signal produced by the endoscopic camera has been growing.
This technology is helping with anomaly detection and measurements, image enhancements, alerts, and analytics. The Clara Holoscan SDK 0.2 includes a sample AI-enabled endoscopy application showcasing the end-to-end functionality of GXF and support for devices that interface with AJA with an HDMI input.
The endoscopy AI sample application has a deep learning model to perform object detection and tool tracking in real time on an endoscopy video stream.
The application uses several NVIDIA features to minimize the overall latency, including:
GPUDirect RDMA video data transfer to eliminate the overhead of copying to or from system memory.
NVIDIA Performance Primitive Library for CUDA-accelerated 2D image transformations before AI inference.
TensorRT runtime for optimized AI Inference and speed-up.
CUDA and OpenGL interoperability, which provides efficient resource sharing on the GPU for visualization.
To learn more about the endoscopy AI sample application, its hardware and software reference architecture on Clara Holoscan, as well as the path to production, download the Clara Holoscan Endoscopy Whitepaper.
Figure 3: An endoscopy image from a gallbladder surgery showing AI-powered frame-by-frame tool identification and tracking. Image courtesy of Research Group Camma, IHU Strasbourg and the University of Strasbourg
Bring your own model AI application
Developers can bring their own AI model into the Clara Holoscan reference pipeline to create their own streaming workflow quickly. Swapping out of one model for another is accomplished by updating one configuration file and exporting data to the GXF native data format. Models saved in portable ONNX, as well as the NVIDIA performance-optimized TRT format, can be run on GXF’s built-in inference engines.
Support for the Clara Developer Kit
The Clara Holoscan SDK 0.2 is supported on the Clara AGX and the new Clara Holoscan Developer Kit. The next generation Clara Holoscan Development Kit is built with a high-performance NVIDIA Orin module, a powerful RTX A6000 GPU, and the connectivity performance of the ConnectX SmartNIC.
This kit is the ideal solution for developing the next generation of software-defined medical devices. Orin is geared for autonomous machines with high-speed interface support for multiple sensors and 8X the performance of the last generation for multiple concurrent AI inference pipelines.
Updated JetPack 5.0HP1 with Ubuntu 20.04
The NVIDIA JetPack SDK contains the base OS for the Clara Holoscan SDK. For version 0.2, the JetPack SDK is being upgraded from version 4.5 to version 5.0HP1. This upgrades the OS to L4T rel-34, to be on par with Ubuntu 20.04 with LTS Kernel 5.10.
Get started with the Clara Holoscan SDK
The Clara Holoscan SDK 0.2 and source code are now accessible on GitHub with an Apache 2.0 license.
This release of Isaac Sim adds more tools for AI-based robotics including Isaac Gym support for RL, Isaac Cortex for cobot programming, and much more.
Today, NVIDIA announced the availability of the 2022.1 release of NVIDIA Isaac Sim. As a robotics simulation and synthetic data generation (SDG) tool, this NVIDIA Omniverse application accelerates the development, testing, and training of AI in robotics.
With Isaac Sim, developers can generate production quality datasets to train AI-perception models. Developers will also be able to simulate robotic navigation and manipulation, as well as build a test environment to validate robotics applications continually.
The latest version advances the age of AI robots with new tools like NVIDIA Isaac Cortex, a decision framework for training collaborative robots (cobots), and Isaac Gym, a GPU-accelerated reinforcement learning (RL) framework. NVIDIA Isaac Replicator, a set of synthetic data generation tools, APIs, and workflows, has also been updated with new capabilities to generate industrial environments for SDG procedurally.
Figure 1. Stacking blocks example from Isaac Cortex
NVIDIA Isaac Sim 2022.1 release highlights
Isaac Cortex: Program cobot tasks as easily as programming game AI. Leverage this decision framework for cobots to develop task-aware and adaptive skills. Using its belief representation of the world, analogous to the robot’s brain, real or simulated data can be used as inputs and the resulting actuations will be generated.
Isaac GYM: Train robots in minutes instead of weeks. Train complex robotic skills using RL. The Isaac GYM is a GPU-accelerated tool that keeps the entire RL training workflow on the GPU, which is critical to reduce training time.
Omnigraph: Simplify application development and debugging with visual programming. Build robotic applications by visually connecting compute nodes together in this Omniverse visual programming and scripting environment. Robotic applications tend to be very modular and lend themselves well to visual programming.
Isaac Sim/Gazebo Connector: Move between both simulators depending on tasks. ROS developers using Gazebo can import simulation assets into Isaac Sim for tasks like generating synthetic datasets or high-fidelity rendering. Additionally, multiple gazebo simulations can stay live synched by connecting to Omniverse’s nucleus server.
Additional Features:
Windows Support (limited)
New Robots
Quadrupeds:A1, GO1, Anymal
AMR: Obelix
New modular warehouse and conveyor assets
New ROS pipelines implemented in Omnigraph
Video 1. ShadowHands demo training in Isaac Gym
Training AI with synthetic data
Isaac Replicator is the synthetic data generation tool in Isaac Sim. Synthetic data is very useful in robotics to bootstrap training, address long-tail dataset challenges, and provide unavailable real-world data like speed and direction from synthetic video. Autonomous machines require synthetic data in training to ensure model robustness.
In the latest release, a new SDG feature called SceneBlox was added to generate scenes procedurally. SceneBlox can be used to create industrial environments like warehouses automatically. New examples were also added that demonstrate how to generate synthetic data and train a pose estimation model using Replicator.
Figure 2. Example of a warehouse procedurally generated using SceneBlox
#include "tensorflow/lite/interpreter.h" // getting a not found error
How can I add resolve this error? My assumption is that I’d need to add the tflite to my bash to make it available for all of my projects. How can I add tflite to the bash file?
For Seq2Seq deep learning architectures, viz., LSTM/GRU and multivariate, multistep time series forecasting, its important to convert the data to a 3D dimension: (batch_size, look_back, number_features). Here _look_back_ decides the number of past data points/samples to consider using _number_features_ from your training dataset. Similarly, _look_ahead_ needs to be defined which defines the number of steps in future, you want your model to forecast for.
I have a written a function to help achieve this:
def split_series_multivariate(data, n_past, n_future): ''' Create training and testing splits required by Seq2Seq architecture(s) for multivariate, multistep and multivariate output time-series modeling. ''' X, y = list(), list() for window_start in range(len(data)): past_end = window_start + n_past future_end = past_end + n_future if future_end > len(data): break # slice past and future parts of window- past, future = data[window_start: past_end, :], data[past_end: future_end, :] # past, future = data[window_start: past_end, :], data[past_end: future_end, 4] X.append(past) y.append(future) return np.array(X), np.array(y)
But, _look_back_ and _look_ahead_ are hyper-parameters which need to be tuned for a given dataset.
# Define hyper-parameters for Seq2Seq modeling: # look-back window size- n_past = 30 # number of future steps to predict for- n_future = 10 # number of features used n_features = 8
What is the _best practice_ for choosing/finding _look_back_ and _look_ahead_ hyper-parameters?
Some NVIDIA Jetson modules have limited storage space, which imposes a challenge in packing applications and libraries. Here are ways to cut down on disk usage.
NVIDIA Jetson provides flexible storage options/configurations for development, but some of the Jetson modules are equipped with a limited eMMC flash memory storage size for more cost-conscious, large-scale product deployment.
It may initially seem impossible to fit your applications and necessary libraries in the limited storage space, especially with the full set of NVIIA JetPack, BSP, and all the development software that NVIDIA has prepackaged for Jetson.
Table 1. Disk usage in the original configurations
However, you can cut down on disk usage by removing unnecessary packages, libraries, and other assets. Table 2 shows how you can reclaim more than 8 GB of storage space on some of the latest NVIDIA JetPack versions.
Table 2. Disk usage in an optimized deployment configuration
In this post, I present simplified steps to minimize the disk usage on your Jetson device, while sharing tips on ways to analyze disk usage, actual commands, and the example outputs on different versions of JetPack. I also show how to check if an AI application is still working functionally under the slimmed-down configuration.
Identifying what takes up space
The jetson-min-disk documentation shows how to analyze the current disk usage, identify what files and directories take up the space, and clarify the package dependencies. It also shows example command outputs on NVIDIA JetPack 4.6.x and NVIDIA JetPack 5.0.x, so that you can assess how much you may be able to cut down for your application.
Minimized configurations
Figure 1 shows an overview of the minimal configurations. The jetson-min-disk documentation introduces multiple configurations ([A] to [D]) for different development and productization needs.
Figure 1. Different minimal configurations
You can take the following actions to regain disk space.
Remove the desktop user interface.
Remove the documentation and samples package.
Remove dev packages.
Remove the desktop graphical user interface
You can remove ubuntu-desktop if you know your system does not require a graphical user interface on the NVIDIA Jetson native display output through HDMI, DP/eDP, or LVDS. For more information, see Removing GUI.
If you have installed the full set of JetPack components (libraries and SDKs) either with the sudo apt install nvidia-jetpack command or by using SDK Manager to install all, you may have packages that you do not need for your application.
Documentation and samples packages are some of the safest to remove, so you can start by uninstalling them. For more information, see Removing docs/sample.
When you are done with building your applications, you do not need dev packages that provide header files and static libraries. You can remove them after checking how much disk space each package takes up. For more information, see Removing dev packages.
If you are using a host x86-64 Linux host machine to flash your Jetson, you can create a minimal configuration RootFS and flash that image onto Jetson.
For more information about building the minimal L4T RooFS image, see Option: Minimal L4T.
Verification
The guide introduces ways to use the NVIDIA DeepStream reference app as a typical AI application to verify the minimally configured Jetson environment. For more information, see Verification.
git clone https://github.com/NVIDIA-AI-IOT/jetson-min-disk/
cd jetson-min-disk
cd test-docker
./docker-run-deepstream-app-overlay.sh
Figure 2. DeepStream reference app with overlay output config in action
Conclusion
In this post, I demonstrated ways to work within NVIDIA Jetson storage space limitations while identifying and keeping the essential runtime libraries for AI applications.
The documentation cross-referenced in this post provides commands and tips for different NVIDIA JetPack versions. They can be great tools if you are interested in optimizing the storage usage, especially on NVIDIA Jetson production modules.
To improve NVIDIA GPU utilization in K8s clusters, we offer new GPU time-slicing APIs, enabling multiple GPU-accelerated workloads to time-slice and run on a single NVIDIA GPU.
For scalable data center performance, NVIDIA GPUs have become a must-have.
NVIDIA GPU parallel processing capabilities, supported by thousands of computing cores, are essential to accelerating a wide variety of applications across different industries. The most compute-intensive applications across diverse industries use GPUs today:
High-performance computing, such as aerospace, bioscience research, or weather forecasting
Consumer applications that use AI to improve search, recommendations, language translation, or transportation, such as autonomous driving
Healthcare, such as enhanced medical imaging
Financial, such as fraud detection
Entertainment, such as visual effects
Different applications across this spectrum can have different computational requirements. Training giant AI models where the GPUs batch process hundreds of data samples in parallel, keeps the GPUs fully utilized during the training process. However, many other application types may only require a fraction of the GPU compute, thereby resulting in underutilization of the massive computational power.
In such cases, provisioning the right-sized GPU acceleration for each workload is key to improving utilization and reducing the operational costs of deployment, whether on-premises or in the cloud.
To address the challenge of GPU utilization in Kubernetes (K8s) clusters, NVIDIA offers multiple GPU concurrency and sharing mechanisms to suit a broad range of use cases. The latest addition is the new GPU time-slicing APIs, now broadly available in Kubernetes with NVIDIA K8s Device Plugin 0.12.0 and the NVIDIA GPU Operator 1.11. Together, they enable multiple GPU-accelerated workloads to time-slice and run on a single NVIDIA GPU.
Before diving into this new feature, here’s some background on use cases where you should consider sharing GPUs and an overview of all the technologies available to do that.
When to share NVIDIA GPUs
Here are some example workloads that can benefit from sharing GPU resources for better utilization:
Low-batch inference serving, which may only process one input sample on the GPU
High-performance computing (HPC) applications, such as simulating photon propagation, that balance computation between the CPU (to read and process inputs) and GPU (to perform computation). Some HPC applications may not achieve high throughput on the GPU portion due to bottlenecks on the CPU core performance.
Interactive development for ML model exploration using Jupyter notebooks
Spark-based data analytics applications, where some tasks, or the smallest units of work, are run concurrently and benefit from better GPU utilization
Visualization or offline rendering applications that may be bursty in nature
Continuous integration/continuous delivery (CICD) pipelines that want to use any available GPUs for testing
In this post, we explore the various technologies available for sharing access to NVIDIA GPUs in a Kubernetes cluster, including how to use them and the tradeoffs to consider while choosing the right approach.
GPU concurrency mechanisms
The NVIDIA GPU hardware, in conjunction with the CUDA programming model, provides a number of different concurrency mechanisms for improving GPU utilization. The mechanisms range from programming model APIs, where the applications need code changes to take advantage of concurrency, to system software and hardware partitioning including virtualization, which are transparent to applications (Figure 1).
Figure 1. GPU concurrency mechanisms
CUDA streams
The asynchronous model of CUDA means that you can perform a number of operations concurrently by a single CUDA context, analogous to a host process on the GPU side, using CUDA streams.
A stream is a software abstraction that represents a sequence of commands, which may be a combination of computation kernels, memory copies, and so on that all execute in order. Work launched in two different streams can execute simultaneously, allowing for coarse-grained parallelism. The application can manage parallelism using CUDA streams and stream priorities.
CUDA streams maximize GPU utilization for inference serving, for example, by using streams to run multiple models in parallel. You either scale the same model or serve different models. For more information, see Asynchronous Concurrent Execution.
The tradeoff with streams is that the APIs can only be used within a single application, thus offering limited hardware isolation, as all resources are shared, and error isolation between various streams.
Time-slicing
When dealing with multiple CUDA applications, each of which may not fully utilize the GPU’s resources, you can use a simple oversubscription strategy to leverage the GPU’s time-slicing scheduler. This is supported by compute preemption starting with the Pascal architecture. This technique, sometimes called temporal GPU sharing, does carry a cost for context-switching between the different CUDA applications, but some underutilized applications can still benefit from this strategy.
Since CUDA 11.1 (R455+ drivers), the time-slice duration for CUDA applications is configurable through the nvidia-smi utility:
$ nvidia-smi compute-policy --help
Compute Policy -- Control and list compute policies.
Usage: nvidia-smi compute-policy [options]
Options include:
[-i | --id]: GPU device ID's. Provide comma
separated values for more than one device
[-l | --list]: List all compute policies
[ | --set-timeslice]: Set timeslice config for a GPU:
0=DEFAULT, 1=SHORT, 2=MEDIUM, 3=LONG
[-h | --help]: Display help information
The tradeoffs with time-slicing are increased latency, jitter, and potential out-of-memory (OOM) conditions when many different applications are time-slicing on the GPU. This mechanism is what we focus on in the second part of this post.
CUDA Multi-Process Service
You can take the oversubscription strategy described earlier a step further with CUDA MPS. MPS enables CUDA kernels from different processes, typically MPI ranks, to be processed concurrently on the GPU when each process is too small to saturate the GPU’s compute resources. Unlike time-slicing, MPS enables the CUDA kernels from different processes to execute in parallel on the GPU.
Newer releases of CUDA (since CUDA 11.4+) have added more fine-grained resource provisioning in terms of being able to specify limits on the amount of memory allocatable (CUDA_MPS_PINNED_DEVICE_MEM_LIMIT) and the available compute to be used by MPS clients (CUDA_MPS_ACTIVE_THREAD_PERCENTAGE). For more information about the usage of these tuning knobs, see the Volta MPS Execution Resource Provisioning.
The tradeoffs with MPS are the limitations with error isolation, memory protection, and quality of service (QoS). The GPU hardware resources are still shared among all MPS clients. You can CUDA MPS with Kubernetes today but NVIDIA plans to improve support for MPS over the coming months.
Multi-instance GPU (MIG)
The mechanisms discussed so far rely either on changes to the application using the CUDA programming model APIs, such as CUDA streams, or CUDA system software, such as time-slicing or MPS.
With MIG, GPUs based on the NVIDIA Ampere Architecture, such as NVIDIA A100, can be securely partitioned up to seven separate GPU Instances for CUDA applications, providing multiple applications with dedicated GPU resources. These include streaming multiprocessors (SMs) and GPU engines, such as copy engines or decoders, to provide a defined QoS with fault isolation for different clients such as processes, containers or virtual machines (VMs).
When the GPU is partitioned, you can use the prior mechanisms of CUDA streams, CUDA MPS, and time-slicing within a single MIG instance.
NVIDIA vGPU enables virtual machines with full input-output memory management unit (IOMMU) protection to have simultaneous, direct access to a single physical GPU. Apart from security, NVIDIA vGPU brings in other benefits such as VM management with live VM migration and the ability to run mixed VDI and compute workloads, as well as integration with a number of industry hypervisors.
On GPUs that support MIG, each GPU partition is exposed as a single-root I/O virtualization (SR-IOV) virtual function for a VM. All VMs can run in parallel as opposed to being time-sliced (on GPUs that do not support MIG).
Table 1 summarizes these technologies including when to consider these concurrency mechanisms.
Streams
MPS
Time-Slicing
MIG
vGPU
Partition Type
Single process
Logical
Temporal (Single process)
Physical
Temporal & Physical – VMs
Max Partitions
Unlimited
48
Unlimited
7
Variable
SM Performance Isolation
No
Yes (by percentage, not partitioning)
Yes
Yes
Yes
Memory Protection
No
Yes
Yes
Yes
Yes
Memory Bandwidth QoS
No
No
No
Yes
Yes
Error Isolation
No
No
Yes
Yes
Yes
Cross-Partition Interop
Always
IPC
Limited IPC
Limited IPC
No
Reconfigure
Dynamic
At process launch
N/A
When idle
N/A
GPU Management (telemetry)
N/A
Limited GPU metrics
N/A
Yes – GPU metrics, support for containers
Yes – live migration and other industry virtualization tools
Target use cases (and when to use each)
Optimize for concurrency within a single application
Run multiple applications in parallel but can deal with limited resiliency
Run multiple applications that are not latency-sensitive or can tolerate jitter
Run multiple applications in parallel but need resiliency and QoS
Support multi-tenancy on the GPU through virtualization and need VM management benefits
Table 1. Comparison of GPU concurrency mechanisms
With this background, the rest of the post focuses on oversubscribing GPUs using the new time-slicing APIs in Kubernetes.
Time-slicing support in Kubernetes
NVIDIA GPUs are advertised as schedulable resources in Kubernetes through the device plugin framework. However, this framework only allows for devices, including GPUs (as nvidia.com/gpu) to be advertised as integer resources and thus does not allow for oversubscription. In this section, we discuss a new method for oversubscribing GPUs in Kubernetes using time-slicing.
Before we discuss the new APIs, we introduce a new mechanism for configuring the NVIDIA Kubernetes device plugin using a configuration file.
New configuration file support
The Kubernetes device plugin offers a number of options for configuration, which are set either as command-line options or environment variables, such as setting MIG strategy, device enumeration, and so on. Similarly, gpu-feature-discovery (GFD) uses a similar option for generating labels to describe GPU nodes.
As configuration options become more complex, you use a configuration file to express these options to the Kubernetes device plugin and GFD, which is then deployed as a configmap object and applied to the plugin and the GFD pods during startup.
The configuration options are expressed in a YAML file. In the following example, you record the various options in a file called dp-example-config.yaml, created under /tmp.
The configuration is applied to all GPUs on all nodes by default. The Kubernetes device plugin enables multiple configuration files to be specified. You can override the configuration on a node-by-node basis by overwriting a label on the node.
The Kubernetes device plugin uses a sidecar container that detects changes in desired node configurations and reloads the device plugin so that new configurations can take effect. In the following example, you create two configurations for the device plugin: a default that is applied to all nodes and another that you can apply to A100 GPU nodes on demand.
The Kubernetes device plugin then enables dynamic changes to the configuration whenever the node label is overwritten, allowing for configuration on a per-node basis if so desired:
That is, for each named resource under sharing.timeSlicing.resources, a number of replicas can now be specified for that resource type.
Moreover, if renameByDefault=true, then each resource is advertised under the name source-name>.shared instead of simply .
The failRequestsGreaterThanOne flag is false by default for backward compatibility. It controls whether pods can request more than one GPU resource. A request of more than one GPU does not imply that the pod gets proportionally more time slices, as the GPU scheduler currently gives an equal share of time to all processes running on the GPU.
The failRequestsGreaterThanOne flag configures the behavior of the plugin to treat a request of one GPU as an access request rather than an exclusive resource request.
As the new oversubscribed resources are created, the Kubernetes device plugin assigns these resources to the requesting jobs. When two or more jobs land on the same GPU, the jobs automatically use the GPU’s time-slicing mechanism. The plugin does not offer any other additional isolation benefits.
Labels applied by GFD
For GFD, the labels that get applied depend on whether renameByDefault=true. Regardless of the setting for renameByDefault, the following label is always applied:
nvidia.com/.replicas =
However, when renameByDefault=false, the following suffix is also added to the nvidia.com/.product label:
nvidia.com/gpu.product = -SHARED
Using these labels, you have a way of selecting a shared or non-shared GPU in the same way you would traditionally select one GPU model over another. That is, the SHARED annotation ensures that you can use a nodeSelector object to attract pods to nodes that have shared GPUs on them. Moreover, the pods can ensure that they land on a node that is dividing a GPU into their desired proportions using the new replicas label.
Oversubscribing example
Here’s a complete example of oversubscribing GPU resources using the time-slicing APIs. In this example, you walk through the additional configuration settings for the Kubernetes device plugin and GFD) to set up GPU oversubscription and launch a workload using the specified resources.
If this configuration were applied to a node with eight GPUs on it, the plugin would now advertise 40 nvidia.com/gpu resources to Kubernetes instead of eight. If the renameByDefault: true option was set, then 40 nvidia.com/gpu.shared resources would be advertised instead of eight nvidia.com/gpu resources.
You enable time-slicing in the following example configuration. In this example, oversubscribe the GPUs by 2x:
Next, deploy two applications (in this case, an FP16 CUDA GEMM workload) with each requesting one GPU. Observe that the applications context switch on the GPU and thus only achieve approximately half the peak FP16 bandwidth on a T4.
$ cat
You can now see the two containers deployed and running on a single physical GPU, which would not have been possible in Kubernetes without the new time-slicing APIs:
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default dcgmproftester-1 1/1 Running 0 45s
default dcgmproftester-2 1/1 Running 0 45s
kube-system calico-kube-controllers-6fcb5c5bcf-cl5h5 1/1 Running 3 32d
You can use nvidia-smi on the host to see that the two containers are scheduled on the same physical GPU by the plugin and context switch on the GPU:
$ nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-491287c9-bc95-b926-a488-9503064e72a1)
$ nvidia-smi
......
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 466420 C /usr/bin/dcgmproftester11 315MiB |
| 0 N/A N/A 466421 C /usr/bin/dcgmproftester11 315MiB |
+-----------------------------------------------------------------------------+
Summary
Get started with leveraging the new GPU oversubscription support in Kubernetes today. Helm charts for the new release of the Kubernetes device plugin make it easy to start using the feature right away.
The short-term roadmap includes integration with the NVIDIA GPU Operator so that you can get access to the feature, whether that is with Red Hat’s OpenShift, VMware Tanzu, or provisioned environments such as NVIDIA Cloud Native Core on NVIDIA LaunchPad. NVIDIA is also working on improving support for CUDA MPS in the Kubernetes device plugin so that you can take advantage of other GPU concurrency mechanisms within Kubernetes.
If you have questions or comments, please leave them in the comments section. For technical questions about installation and usage, we recommend filing an issue on the NVIDIA/k8s-device-plugin GitHub repo. We appreciate your feedback!