You don’t need a private plane to be at the forefront of personal travel. Electric automaker Xpeng took the wraps off the G9 SUV this week at the international Auto Guangzhou show in China. The intelligent, software-defined vehicle is built on the high-performance compute of NVIDIA DRIVE Orin and delivers AI capabilities that are continuously Read article >
When Julien Trombini and Guillaume Cazenave founded video-analytics startup Two-i four years ago, they had an ambitious goal: improving the quality of urban life by one day being able to monitor a city’s roads, garbage collection and other public services. Along the way, the pair found a wholly different niche. Today, the company’s technology — Read article >
How to integrate the NVIDIA GPU and Network Operators
NVIDIA Operators simplify GPU and SmartNIC management on Kubernetes. This post shows how to integrate NVIDIA Operators into new edge AI platforms using preinstalled drivers. This is the first post in a two-part series. The next post describes how to integrate NVIDIA Operators using custom driver containers.
AI makes sensor data actionable. Trained AI models recognize patterns and trigger responses. A trained AI model represents a company’s business intelligence. Just as crude oil becomes valuable when refined into petroleum, AI transforms sensor data into insight.
That is why edge AI needs acceleration. NVIDIA GPUs and SmartNICs future-proof an edge AI platform against exponential data growth.
Edge AI is Cloud Native
This post describes how to integrate NVIDIA accelerators with Kubernetes. Why focus on Kubernetes? Because edge AI is cloud native. Most AI applications are container-based microservices. Kubernetes is the unofficial standard for container orchestration.
Edge AI platforms build on Kubernetes due to its flexibility. The Kubernetes API supports declarative automation and is extensible through custom resource definitions. A robust software ecosystem supports Kubernetes day one and day two operations.
NVIDIA Fleet Command is one example of a Kubernetes-based Edge AI platform. Fleet Command is a hybrid cloud service designed for security and performance. It manages AI application lifecycle on bare metal edge nodes. Fleet Command also integrates with NGC, NVIDIA’s curated registry of more than 700 GPU-optimized applications.
While Fleet Command supports NVIDIA GPUs and SmartNICs, many edge platforms do not. For those, NVIDIA provides open-source Kubernetes operators to enable GPU and SmartNIC acceleration. There are two operators: the NVIDIA GPU Operator and the NVIDIA Network Operator.
The NVIDIA GPU Operator automates GPU deployment and management on Kubernetes. The GPU Operator Helm Chart is available on NGC. It includes several components:
Figure 1. These components make up the NVIDIA GPU Operator
The NVIDIA Network Operator automates CONNECTX SmartNIC configuration for Kubernetes pods that need fast networking. It is also delivered as a Helm chart. The Network Operator adds a second network interface to a pod using the Multus CNI plug-in. It supports both Remote Direct Memory Access (RDMA) and Shared Root I/O Virtualization (SRIOV).
The NVIDIA Network Operator includes the following components:
The SRIOV device plug-in attaches SRIOV Virtual Functions (VFs) to pods.
The Containernetworking CNI plug-in is a standard interface for extending Kubernetes networking capabilities.
The Whereabouts CNI plug-in manages cluster-wide automatic IP addresses creation and assignment.
The MACVLAN CNI functions as a virtual switch to connect pods to network functions.
The Multus CNI plug-in enables attaching multiple network devices to a Kubernetes pod.
The Host-device CNI plug-in moves an existing device (such as an SRIOV VF) from the host to network namespace the pod’s.
Figure 2. These components make up the NVIDIA Network Operator components
Both operators use Node Feature Discovery. This service identifies which cluster nodes have GPUs and SmartNICs.
The operators work together or separately. Deploying them together enables GPUDirect RDMA. This feature bypasses host buffering to increase throughput between the NIC and GPU.
The NVIDIA operators are open source software. They already support popular Kubernetes distributions running on NVIDIA Certified servers. But many edge platforms run customized Linux distributions the operators do not support. This post explains how to integrate NVIDIA operators with those platforms.
Two Paths, One Way
Figure 3. This image represents the are two methods for integrating NVIDIA Operators: preinstalled drivers or custom driver containers
Portability is one of the main benefits of cloud native software. Containers bundle applications with their dependencies. This lets them run, scale, and migrate across different platforms without friction.
NVIDIA operators are container-based, cloud native applications. Most of the operator services do not need any integration to run on a new platform. But both operators include driver containers, and drivers are the exception. Drivers are kernel-dependent. Integrating NVIDIA operators with a new platform involves rebuilding the driver containers for the target kernel. The platform may be running an unsupported Linux distribution or a custom-compiled kernel.
There are two approaches to delivering custom drivers:
First, by installing the drivers onto the host before installing the operators. Many edge platforms deliver signed drivers in their base operating system image to support secure and measured boot. Platforms requiring signed drivers cannot use the driver containers deployed by the operators. NVIDIA Fleet Command follows this pattern. Both the Network and GPU operators support preinstalled drivers by disabling their own driver containers.
The second approach is to replace the operator’s driver containers with custom containers. Edge platforms with immutable file systems prefer this method. Edge servers often run as appliances. They use read-only file systems to increase security and prevent configuration drift. Running driver and application containers in memory instead of adding them to the immutable image reduces its size and complexity. This also allows the same image to run on nodes with different hardware profiles.
This post explains how to set up both patterns. The first section of the post describes driver preinstallation. The second section describes how to build and install custom driver containers.
Apart from the driver containers, the remaining operator services generally run on new platforms without modification. NVIDIA tests both operators on leading container runtimes such as Docker Engine, CRI-O, and Containerd. The GPU Operator also supports the runtime class resource for per-pod runtime selection.
Preinstalled driver integration
The rest of this post shows how to integrate NVIDIA operators with custom edge platforms. It includes step-by-step procedures for both the driver preinstallation and driver container methods.
Table 1 describes the test system used to demonstrate these procedures.
TABLE 1: Test System Description
Linux Distribution
Centos 7.9.2009
GPU Operator
v1.8.2
Kernel version
3.10.0-1160.45.1.el7.custom
GPU Driver (operator)
470.74
Container runtime
Crio-21.3
Network Operator
v1.0.0
Kubernetes
1.21.3-0
MOFED (operator)
5.4-1.0.3.0
Helm
v3.3.3
CUDA
11.4
Cluster network
Calico v3.20.2
GPU Driver (local)
470.57.02
Compiler
GCC 4.8.5 2015062
MOFED (local)
5.4-1.0.3.0
Developer tools
Elfutils 0.176-5
Node Feature Discovery
v0.8.0
Server
NVIDIA DRIVE Constellation
GPU
A100-PCIE-40GB
Server BIOS
v5.12
SmartNIC
ConnectX-6 Dx MT2892
CPU
(2) Intel Xeon Gold 6148
SmartNIC Firmware
22.31.1014
The operating system, Linux kernel, and container runtime combination on the test system is not supported by either operator. The Linux kernel is custom compiled, so precompiled drivers are not available. The test system also uses the Cri-o container runtime, which is less common than alternatives like Containerd and Docker Engine.
Prepare the System
First, verify that the CONNECTX SmartNIC and NVIDIA GPU are visible on the test system.
$ lspci | egrep 'nox|NVI'
23:00.0 3D controller: NVIDIA Corporation Device 20f1 (rev a1)
49:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
49:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
5e:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
e3:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
e3:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
e6:00.0 3D controller: NVIDIA Corporation Device 20f1 (rev a1)
2. View the operating system and Linux kernel versions. In this example, the Centos 7 3.10.0-1160.45.1 kernel was recompiled to 3.10.0-1160.45.1.el7.custom.x86_64.
3. View the Kubernetes version, network configuration, and cluster nodes. This output shows a single node cluster, which is a typical pattern for edge AI deployments. The node is running Kubernetes version 1.21.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
cgx-20 Ready control-plane 23d v1.21.3
4. View the installed container runtime. This example shows the cri-o container runtime.
5. NVIDIA delivers operators through Helm charts. View the installed Helm version.
$ helm version
version.BuildInfo{Version:"v3.3.3", GitCommit:"55e3ca022e40fe200fbc855938995f40b2a68ce0", GitTreeState:"clean", GoVersion:"go1.14.9"}
Install the Network Operator with preinstalled Drivers
The Mellanox OpenFabrics Enterprise Distribution for Linux installs open source drivers and libraries for high-performance networking. The NVIDIA Network Operator optionally installs a MOFED container to load these drivers and libraries on Kubernetes. This section describes the process for preinstalling MOFED drivers on the host in the event that the included driver container cannot be used.
5. After reboot, make sure that the drivers are loaded.
$ /etc/init.d/openibd status
HCA driver loaded
Configured Mellanox EN devices:
enp94s0
ens13f0
ens13f1
ens22f0
ens22f1
Currently active Mellanox devices:
enp94s0
ens13f0
ens13f1
ens22f0
ens22f1
The following OFED modules are loaded:
rdma_ucm
rdma_cm
ib_ipoib
mlx5_core
mlx5_ib
ib_uverbs
ib_umad
ib_cm
ib_core
mlxfw
Once MOFED is successfully installed and the drivers are loaded, proceed to installing the NVIDIA Network Operator.
6. Identify the secondary network device name. This will be the device or devices plumbed into the pod as a secondary network interface.
$ ibdev2netdev
mlx5_0 port 1 ==> ens13f0 (Up)
mlx5_1 port 1 ==> ens13f1 (Down)
mlx5_2 port 1 ==> enp94s0 (Up)
mlx5_3 port 1 ==> ens22f0 (Up)
mlx5_4 port 1 ==> ens22f1 (Down)
7. By default the Network Operator does not deploy to a Kubernetes master. Remove the master label from the node to accommodate the all-in-one cluster deployment.
Note this is a temporary workaround to allow Network Operator to schedule pods to the master node in a single node cluster. Future versions of the Network Operator will add toleration and nodeAffinity to avoid this workaround.
8. Add the Mellanox Helm chart repository.
$ helm repo add mellanox https://mellanox.github.io/network-operator
$ helm repo update
$ helm repo ls
NAME URL
mellanox https://mellanox.github.io/network-operator
9. Create a values.yaml to specify Network Operator configuration. This example deploys the RDMA shared device plug-in and specifies ens13f0 as the RDMA-capable interface.
11. Verify that all Network Operator pods are in Running status.
$ kubectl get pods -n nvidia-network-operator-resources
NAME READY STATUS RESTARTS AGE
cni-plugins-ds-fcrsq 1/1 Running 0 3m44s
kube-multus-ds-4n526 1/1 Running 0 3m44s
rdma-shared-dp-ds-5rq4x 1/1 Running 0 3m44s
whereabouts-9njxm 1/1 Running 0 3m44s
Note that some versions of Calico are incompatible with certain Multus CNI versions. Change the Multus API version after the Multus daemonset starts.
$ sed -i 's/0.4.0/0.3.1/' /etc/cni/net.d/00-multus.conf
12. The Helm chart creates a configMap that is used to label the node with the selectors defined in the values.yaml file. Verify that the node is correctly labeled by NFD and that the RDMA shared devices are created.
7. Install the GPU Operator Helm chart repository.
$ helm repo add nvidia https://nvidia.github.io/gpu-operator
$ helm repo update
# helm repo ls
NAME URL
nvidia https://nvidia.github.io/gpu-operator
mellanox https://mellanox.github.io/network-operator
8. Install the GPU Operator Helm chart. Overriding the driver.enabled parameter to false disables driver container installation. Also specify crio as the container runtime.
$ helm install --generate-name nvidia/gpu-operator --set driver.enabled=false --set toolkit.version=1.7.1-centos7 --set operator.defaultRuntime=crio
$ helm ls
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gpu-operator-1635194696 default 1 2021-10-25 16:44:57.237363636 -0400 EDT deployed gpu-operator-v1.8.2 v1.8.2
9. View the GPU Operator resources. All pods should be in status Running or Completed.
10. View the validation pod logs to verify validation tests completed.
$ kubectl logs -n gpu-operator-resources nvidia-device-plugin-validator-845pw
device-plugin workload validation is successful
$ kubectl logs -n gpu-operator-resources nvidia-cuda-validator-ndc78
cuda workload validation is successful
11. Run nvidia-smi from within the validator container to display the GPU, driver, and CUDA versions. This also validates that the container runtime prestart hook works as expected.
$ kubectl exec -n gpu-operator-resources -i -t nvidia-operator-validator-5ngbk --container nvidia-operator-validator -- nvidia-smi
Mon Oct 25 20:57:28 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:23:00.0 Off | 0 |
| N/A 26C P0 32W / 250W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... Off | 00000000:E6:00.0 Off | 0 |
| N/A 26C P0 32W / 250W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Test the Preinstalled Driver Integration
Test the preinstalled driver integration by creating test pods.
1. Create a network attachment definition. A network attachment definition is a custom resource that allows pods to connect to one or more networks. This network attachment definition defines a MAC VLAN Network that bridges multiple pods across a secondary interface. The Whereabouts CNI automates IP address assignments for pods connected to the secondary network.
7. The GPU Operator creates pods to validate the driver, container runtime, and Kubernetes device plug-in. Create an additional GPU test pod.
$ cat
8. View the results.
$ kubectl get pod cuda-vectoradd
NAME READY STATUS RESTARTS AGE
cuda-vectoradd 0/1 Completed 0 34s
$ kubectl logs cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
9. Load the nvidia-peermem driver. It provides GPUDirect RDMA for CONNECTX SmartNICs. This driver is included in NVIDIA Linux GPU driver version 470 and greater. It is compiled automatically during Linux driver installation if both the ib_core and NVIDIA GPU driver sources are present on the system. This means the MOFED driver should be installed before the GPU driver so the MOFED source is available to build the nvidia-peermem driver.
Part 2 of this series will be published on 11/22. It will describe how to integrate the NVIDIA GPU and Network Operators with custom driver containers.
By the time the night was over, it felt like Jensen Huang had given everyone in the ballroom a good laugh and a few things to think about. The annual dinner of the Semiconductor Industry Association—a group of companies that together employ a quarter-million workers in the U.S. and racked up U.S. sales over $200 Read article >
I am newbie to keras and tensorflow. Please i need help to convert feature maps generated by a conv layer to numpy to do some computation and then convert them back to tensor to be fed to next layer in the model.
i believe this is easy to you. here is a dummy sample of code to show the problem :
def convert_to_numpy(tensor): grab_the_new_feature_maps = [] #to grab every feature map feature_maps_arry = tensor.numpy() # convert tensor to array for i in range(feature_maps_arry.shape[2]): single_fm = feature_maps_arry[i] max_value= np.max(single_fm) #find the maximum pixel value in fm min_value= np.min(single_fm) #find the minimum pixel value in fm ########### do the rest of conputations ########## grab_the_new_feature_maps.append(single_fm) back_to_tensor = tf.convert_to_tensor(grab_the_new_feature_maps) return back_to_tensor
Note the custom layer should not create new layer but use the weights and bias of the received tensor and convert it to numpy, do the computation, and then return the tensor with updated feature maps to the model: My custom layer is as the following
Fewer than 4,000 tigers remain worldwide, according to Tigers United, a university consortium that recently began using AI to help save the species. Jeremy Dertien is a conservation biologist with Tigers United and a Ph.D. candidate in wildlife biology and conservation planning at Clemson University. He spoke with NVIDIA AI Podcast host Noah Kravitz about Read article >
Manufacturers are bringing product designs to life in a newly immersive world. Rendermedia, based in the U.K., specializes in immersive solutions for commerce and industries. The company provides clients with tools and applications for photorealistic virtual, augmented and extended reality (collectively known as XR) in areas like product design, training and collaboration. With NVIDIA RTX Read article >