Categories
Misc

Streamlining Kubernetes Networking in Scale-out GPU Clusters with the new NVIDIA Network Operator 1.0

NVIDIA EGX contains the NVIDIA GPU Operator and the new NVIDIA Network Operator 1.0 to standardize and automate the deployment of all the necessary components for provisioning Kubernetes clusters. Now released for use in production environments, the NVIDIA Network Operator makes Kubernetes networking simple and effortless for bringing AI to the enterprise.

The growing prevalence of GPU-accelerated computing in the cloud, enterprise, and at the edge increasingly relies on robust and powerful network infrastructures. NVIDIA ConnectX SmartNICs and NVIDIA BlueField DPUs provide high-throughput, low-latency connectivity that enables the scaling of GPU resources across a fleet of nodes. To address the demand for cloud-native AI workloads, NVIDIA delivers the GPU Operator, aimed at simplifying scale-out GPU deployment and management on Kubernetes.

Today, NVIDIA announced the 1.0 release of the NVIDIA Network Operator. An analog to the NVIDIA GPU Operator, the Network Operator simplifies scale-out network design for Kubernetes by automating aspects of network deployment and configuration that would otherwise require manual work. It loads the required drivers, libraries, device plugins, and CNIs on any cluster node with an NVIDIA network interface. 

Paired with the GPU Operator, the Network Operator enables GPUDirect RDMA, a key technology that accelerates cloud-native AI workloads by orders of magnitude. The technology provides an efficient, zero-copy data transfer between NVIDIA GPUs while leveraging the hardware engines in the SmartNICs and DPUs. Figure 1 shows GPUDirect RDMA technology between two GPU nodes. The GPU on Node 1 directly communicates with the GPU on Node 2 over the network, bypassing the CPU devices.

Two GPU nodes communicate over the networking using GPUDirect RDMA t technology, that allows GPU on Node1 read/write data from/to the GPU memory of Node B, while bypassing the CPU devices
Figure 1. GPUDirect RDMA technology between two GPU nodes

Now available on NGC and GitHub, the NVIDIA Network Operator uses Kubernetes custom resources (CRD) and the Operator framework to provision the host software needed for enabling accelerated networking. This post discusses what’s inside the network operator, including its features and capabilities.

Kubernetes networking that’s easy-to-deploy and operate 

The Network Operator is geared towards making Kubernetes networking simple and effortless. It’s an open-source software project under the Apache 2.0 license. The 1.0 release was validated for Kubernetes running on bare-metal server infrastructure and in Linux virtualization environments. Here are the key features of the 1.0 release:

  • Automated deployment of host software components in a bare-metal Kubernetes environment for enabling the following:
    • macvlan secondary networks
    • SR-IOV secondary networks (VF assigned to pod)
    • Host device secondary networks (PF assigned to pod)
    • GPUDirect RoCE (with the NVIDIA GPU Operator)
  • Automated deployment of host software components in a nested Kubernetes environment (Kubernetes Pods running in Linux VMs) for creating the following:
    • SR-IOV secondary networks (# of VFs assigned to VM and passthrough to different pods)
    • Host device secondary networks (PF assigned to Pod)
    • GPUDirect RoCE (with the NVIDIA GPU Operator)
  • Platform support: 
    • Kubernetes v1.17 or later
    • Container runtime: Containerd
    • Bare-metal host OS/Linux guest OS: Ubuntu 20.04
    • Linux KVM virtualization
  • Helm chart installation

While GPU-enabled nodes are a primary use case, the Network Operator is also useful for enabling accelerated Kubernetes network environments that are independent of NVIDIA GPUs. Some examples include setting up SR-IOV networking and DPDK for accelerating telco NFV applications, establishing RDMA connectivity for fast access to NVMe storage, and more.

Inside the NVIDIA Network Operator

The Network Operator is designed from the ground up as a Kubernetes Operator that makes use of several custom resources for adding accelerated networking capabilities to a node. The 1.0 version supports several networking models adapted to both various Kubernetes networking environments and varying application requirements. Today, the Network Operator configures RoCE only for secondary networks. This means that the primary pod network remains untouched. Future work may enable configuring RoCE for the primary network.

The following sections describe the different components that are packaged and used by the Network Operator.

Node Feature Discovery

Node Feature Discovery (NFD) is a Kubernetes add-on for detecting hardware features and system configuration. The Network Operator uses NFD to detect nodes installed with NVIDIA SmartNICs and GPU, and label them as such. Based on those labels, the Network Operator schedules the appropriate software resources.

Multus CNI

The Multus CNI is a container network interface (CNI) plugin for Kubernetes that enables attaching multiple network interfaces to pods. Normally in Kubernetes each Pod only has one network interface. With Multus, you can create a multihomed Pod that has multiple interfaces. Multus acting as a meta-plugin, a CNI plugin that can call multiple other CNI plugins. The NVIDIA Network Operator installs Multus to add to a container pod the secondary networks that are used for high-speed GPU-to-GPU communications.

NVIDIA OFED driver

The NVIDIA OpenFabrics Enterprise Distribution (OFED) networking libraries and drivers are packaged and tested by the NVIDIA networking team. NVIDIA OFED supports Remote Direct Memory Access (RDMA) over both Infiniband and Ethernet interconnects. The Network Operator deploys a precompiled NVIDIA OFED driver container onto each Kubernetes host using node labels. The container loads and unloads the NVIDIA OFED drivers when it is started or stopped.

NVIDIA peer memory driver

The NVIDIA peer memory driver is a client that interacts with the network drivers to provide RDMA between GPUs and host memory. The Network Operator installs the NVIDIA peer memory driver on nodes that have both a ConnectX adapter and an NVIDIA GPU. This driver is also loaded and unloaded automatically when the container is started and stopped.

RDMA shared device plugin

The Kubernetes device plugin framework advertises system hardware resources to the Kubelet agent running on a Kubernetes node. The Network Operator deploys the RDMA shared device plugin that advertises RDMA resources to Kubelet and exposes RDMA devices to Pods running on the node. It allows the Pods to perform RDMA operations. All Pods running on the node share access to the same RDMA device files.

Container networking CNI plugins

The macvlan CNI and host-device CNI are generic container networking plugins that are hosted under the CNI project. The macvlan CNI creates a new MAC address, and forwards all traffic to the container. The host-device CNI moves an already-existing device into a container. The Network Operator uses these CNI plugins for creating macvlan networks, and assigning NIC physical functions to a container or virtual machine, respectively.

SR-IOV device plugin and CNI

SR-IOV is a technology providing a direct interface between the virtual machine or container pod and the NIC hardware. It bypasses the host CPU and OS , frees up expensive CPU resources from I/O tasks, and greatly accelerates connectivity. The SR-IOV device plugin and CNI plugin enable advertising SR-IOV virtual functions (VFs) available on a Kubernetes node. Both are required by the Network Operator for creating and assigning SR-IOV VFs to secondary networks on which GPU-to-GPU communication is handled. 

SR-IOV Operator

The SR-IOV Operator is designed to help the user to provision and configure the SR-IOV device plugin and SR-IOV CNI plugin in the cluster. The Network Operator uses the SR-IOV Operator to deploy and manage SR-IOV in the Kubernetes cluster.

Whereabouts CNI

The Whereabouts CNI is an IP address management (IPAM) CNI plugin that can assign IP addresses in a Kubernetes cluster. The Network Operator uses this CNI to assign IP addresses for secondary networks that carry GPU-to-GPU communication.

Better together: NVIDIA accelerated compute and networking

Figure 2 shows how the Network Operator works in tandem with the GPU Operator to deploy and manage the host networking software.

The Network Operator and GPU Operators are installed side by side on a Kubernetes node, powered by the NVIDIA EGX software stack and NVIDIA-certified server hardware platform.
Figure 2. The Network Operator is installed alongside the NVIDIA GPU Operator to automate GPUDirect RDMA configuration on the EGX stack

The following sections describe the supported networking models, and corresponding host software components.

RoCE shared mode

Shared mode implies the method where a single IB device is shared between several container pods on the node. This networking model is optimized for enterprise and edge environments that require high-performance networking, without multitenancy. The Network Operator installs the following software components: 

  • Multus CNI
  • RoCE shared mode device plugin
  • macvlan CNI
  • Whereabouts IPAM CNI

The Network Operator also installs the NVIDIA OFED Driver and NVIDIA Peer Memory on GPU nodes.

SR-IOV, RoCE, and DPDK networking

As mentioned earlier, SR-IOV is an acceleration technology that provides direct access to the NIC hardware. This networking model is optimized for multitenant Kubernetes environments, running on bare-metal. The Network Operator installs the following software components: 

  • Multus CNI
  • SR-IOV device plugin
  • SR-IOV CNI
  • Whereabouts IPAM CNI 

The Network Operator also installs the NVIDIA OFED Driver and NVIDIA Peer Memory on GPU nodes.

NIC PF passthrough

This networking model is suited for extremely demanding applications. The Network Operator can assign the NIC physical function to a Pod so that the Pod uses it fully. The Network Operator installs the following host software components: 

  • Multus CNI
  • SR-IOV device plugin
  • Host-Dev CNI 
  • Whereabouts IPAM CNI 

The Network Operator also installs the NVIDIA OFED Driver and NVIDIA Peer Memory on GPU nodes.

Streamlining Kubernetes networking for scale-out GPU clusters

The NVIDIA GPU and Network Operators are both part of the NVIDIA EGX Enterprise platform that allows GPU-accelerated computing work alongside traditional enterprise applications on the same IT infrastructure. Taken together, the operators make the NVIDIA GPU a first-class citizen in Kubernetes. Now released for use in production environments, the Network Operator streamlines Kubernetes networking, bringing the necessary levels of simplicity and scalability for enabling scale-out training and edge inferencing in the enterprise.

For more information, see the Network Operator documentation. You can also download the Network Operator from NGC to see it in action, and join the developer community in the network-operator GitHub repo.

Categories
Misc

NVIDIA and Palo Alto Networks Boost Cyber Defenses with DPU Acceleration

Cybercrime cost the American public more than $4 billion in reported losses over the course of 2020, according to the FBI. To stay ahead of emerging threats, Palo Alto Networks, a global cybersecurity leader, has developed the first virtual next-generation firewall (NGFW) designed to be accelerated by NVIDIA’s BlueField data processing unit (DPU). The DPU Read article >

The post NVIDIA and Palo Alto Networks Boost Cyber Defenses with DPU Acceleration appeared first on The Official NVIDIA Blog.

Categories
Misc

Accelerate Academic Research and Curriculum with the NVIDIA Hardware Grant Program

The NVIDIA Hardware Grant Program helps advance AI and data science by partnering with academic institutions around the world to enable researchers and educators with industry-leading hardware and software.

The NVIDIA Hardware Grant Program helps advance AI and data science by partnering with academic institutions around the world to enable researchers and educators with industry-leading hardware and software.

Applicants can request compute support from a large portfolio of NVIDIA products. Awardees of this highly selective program will receive a hardware donation to use in their teaching or research.

The hardware granted to qualified applicants could include NVIDIA RTX workstation GPUs powered by NVIDIA Ampere Architecture, NVIDIA BlueField Data Processing Units (DPUs), Remote V100 instances in the cloud with prebuilt container images, NVIDIA Jetson developer kits, and more. Alternatively, certain projects may be awarded with cloud compute credits instead of physical hardware.

Please note: NVIDIA RTX 30 Series GPUs are not available through the Academic Hardware Grant Program.

The current application submission window will begin on July 12 and close on July 23, 2021. The next submission window will open in early 2022.

LEARN MORE >

Categories
Misc

Best Tensorflow tutorial for this year?

Hi, I am looking at Tensorflow tutorials and would like some opinions on the best free Tensorflow course. I have looked at these tutorials so far:

freeCodeCamp Tensorflow complete tutorial https://youtu.be/tPYj3fFJGjk

Udacity Tensorflow tutorial https://www.udacity.com/course/intro-to-tensorflow-for-deep-learning–ud187

However, these are 1-2 years old. Do they still hold up and which of the two should I pick?

Feel free to suggest other tutorials too!

submitted by /u/Zysora
[visit reddit] [comments]

Categories
Misc

While running the tensorflow Object detection tutorial I am getting this error, everything else before that was running fine. Could someone help me with this please?

While running the tensorflow Object detection tutorial I am getting this error, everything else before that was running fine. Could someone help me with this please? submitted by /u/Lazy_Acadia5970
[visit reddit] [comments]
Categories
Misc

How to create several layers of the same type that can be iterated over in graph mode?

Hey all!

I’m trying to implement this Transformer article. In the source code they pass a number (n_heads) representing the number of attention heads that should be created, which is used when building the model to create that number of attention heads and save them to a list. Later, when the model is called, the attention heads are iterated over as follows: attn = [self.attn_heads[i](inputs) for i in range(self.n_heads)]. When running this code in graph mode the following error is thrown: OperatorNotAllowedInGraphError: iterating over “tf.Tensor” is not allowed: AutoGraph did convert this function.

How should I go about creating an arbitrary number of the same layer in such a way that it can be iterated over in graph mode?

submitted by /u/EdvardDashD
[visit reddit] [comments]

Categories
Misc

Is there an age requirement for the tensorflow developer certificate exam?

Is there an age requirement for the tensorflow developer certificate exam?

submitted by /u/Sad_Combination9971
[visit reddit] [comments]

Categories
Misc

Does TensorFlow or CUDA need to be reinstalled for new GPU?

TensorFlow was working fine when I only had a 1660ti, but I recently installed a 3070 and the 3070 works fine for other stuff, but not TensorFlow.

It will take a good minute to just do – tf.test.gpu_device_name()

import os os.environ["CUDA_VISIBLE_DEVICES"]="0" import tensorflow as tf tf.test.gpu_device_name() 

If CUDA_VISIBLE_DEVICES is changed to “1” (my old card) it will print the device name instantly.

Everything else TensorFlow related will also take super long on the new card.

Is there some box I have to check in the CUDA installation files for the new card to work, or do I need to specially install something to find the new card?

I’m using Jupyter Notebook.

Any help would be appreciated.

submitted by /u/KingGeorge12321
[visit reddit] [comments]

Categories
Misc

Is there really no way to let TensorFlow work on AMD GPU on Mac?

I was working on an independent project and our school only has AMD GPU with Mac. (we have NVIDIA GPU PC but it is only GTX1030)

submitted by /u/Striking-Warning9533
[visit reddit] [comments]

Categories
Misc

[Looking for Teammates] Object Detection Challenge with $50k CP 🌵

Hey everyone 👋,

I have been taking part solo in an ML challenge by AIcrowd.com which has a cash prize pool of $50,000 🤑

The Machine Learning challenge is for Object Detection enthusiasts, hosted by Amazon Air Prime, called Airborne Object Tracking

The challenge revolves around predicting the future motion of flying airborne objects to avoid collision. Also the dataset they have is pretty lit (11TB in size!! 😲), one of the largest collections of flight sequences from aerial vehicles so you might want to look into that.

I am looking for someone to team up in this challenge with. Let me know in the comments if anyone is up!

submitted by /u/desiMLguy
[visit reddit] [comments]