Misc

Building Optimal ASN Configurations in Data Centers with Auto BGP

Post author By
Post date June 18, 2021
No Comments on Building Optimal ASN Configurations in Data Centers with Auto BGP

The NVIDIA Cumulus Linux 4.2.0 release introduces a new feature called auto BGP, which makes BGP ASN assignment in a two-tier leaf and spine network configuration quick and easy. Auto BGP does the work for you without making changes to standard BGP behavior or configuration so that you don’t have to think about which numbers … Continued

The NVIDIA Cumulus Linux 4.2.0 release introduces a new feature called auto BGP, which makes BGP ASN assignment in a two-tier leaf and spine network configuration quick and easy.

Auto BGP does the work for you without making changes to standard BGP behavior or configuration so that you don’t have to think about which numbers to allocate to your switches. This helps you build optimal ASN configurations in your data center and avoid suboptimal routing and path hunting, which occurs when you assign the wrong spine ASNs.

If you don’t care about ASNs, then this feature is for you. If you do, you can always configure BGP the traditional way where you have control over which ASN to allocate to your switch. What I like about this feature is that you can mix and match. You don’t have to use auto BGP across all switches in your configuration. Instead, you can use it to configure one switch but allocate ASN numbers manually to other switches.

ASN assignment

Cumulus Linux uses private 32-bit ASN numbers in the range 4200000000 through 4294967294. This is the private space defined in RFC 6996. Each leaf is assigned a random and unique value in the range 4200000001 through 4294967294 and is based on a hash of the switch MAC address. Each spine is assigned 4200000000; the first number in the range.

Figure 1 shows the ASN numbers assigned to switches in a leaf and spine configuration.

Diagram of the internet connect to spine ASNs and leaf ASNs, which are then connected to servers. — *Figure 1. Auto BGP ASN assignment*

Configuring auto BGP

Use a simple NCLU command with the keyword leaf or spine:

net add bgp auto leaf
net add bgp auto spine

The auto BGP leaf and spine keywords are only used to configure the ASN. The configuration files and net show commands display the ASN number only.

For more information about BGP ASN numbering and path hunting, see BGP in the data center and BGP topics in the Networking Resource Center.

Misc

Achieving a Cloud-Scale Architecture with DPUs

Post author By
Post date June 18, 2021
No Comments on Achieving a Cloud-Scale Architecture with DPUs

This post explains why you need a DPU-based SmartNIC and discusses some Smart NIC use cases.

In the first post of this series, I argued that it is a function and not a form that distinguishes a SmartNIC from a data processing unit (DPU). I introduced the category of datacenter NICs called SmartNICs, which include both hardware transport and a programmable data path for virtual switch acceleration. These capabilities are necessary but not sufficient for a NIC to be a DPU. A true DPU must also include an easily extensible, C-programmable Linux environment that enables datacenter architects to virtualize all resources in the cloud and make them appear as local. To understand why DPUs need this, I go back to what created the need for DPUs in the first place.

Why the world needs DPUs

One of the most important reasons why the world needs DPUs is that modern workloads and datacenter designs impose too much networking overhead on the CPU cores. With faster networking (now up to 200 Gb/s per link), the CPU just spends too much of its valuable cores classifying, tracking, and steering network traffic. These expensive CPU cores are designed for general purpose application processing, and the last thing needed is to consume all this processing power simply looking at and managing the movement of data. After all, application processing that analyzes data and produces results is where the real value creation occurs.

The introduction of compute virtualization makes this problem worse, as it creates more traffic on the server both internally–between VMs or containers—and externally to other servers or storage. Applications such as software-defined storage (SDS), hyperconverged infrastructure (HCI), and big data also increase the amount of east-west traffic between servers, whether virtual or physical, and often Remote Direct Memory Access (RDMA) is used to accelerate data transfers between servers.

Through traffic increases and the use of overlay networks such as VXLAN, NVGRE, or GENEVE, increasingly popular for public and private clouds, adds further complications to the network by introducing layers of encapsulation. Software-defined networking (SDN) imposes additional packet steering and processing requirements and adds additional burden to the CPU with even more work, such as running the Open vSwitch (OVS).

DPUs can handle all this virtualization (SR-IOV, RDMA, overlay network traffic encapsulation, OVS offload) faster, more efficiently, and at lower cost than standard CPUs.

Another reason: Security isolation

Sometimes, you might want to isolate the networking from the CPU for security reasons. The network is the most likely vector for a hacker attack or malware intrusion and the first place you’d look to detect or stop a hack. It’s also the most likely place to implement in-line encryption.

The DPU, being a NIC, is the first, easiest, best place to inspect network traffic, block attacks, and encrypt transmissions. This has both performance and security benefits, as it eliminates the frequent need to route all incoming and outgoing data back to the CPU and across the PCIe bus. It provides security isolation by running separately from the main CPU. If the main CPU is compromised, then the DPU can still detect or block malicious activity. The DPUs can work to detect or block attacks without immediately involving the CPU.

For more information about the security benefits of a DPU, see Foreshadowing the Future of Security Meltdowns and the Spectre of a Breach.

Virtualizing storage and the cloud

A newer use case for DPUs is to virtualize software-defined storage, hyperconverged infrastructure, and other cloud resources. Before the virtualization explosion, most servers just ran local storage, which is not always efficient but it’s easy to consume. Every OS, application, and hypervisor knows how to use local storage.

Then came the rise of network storage: SAN, NAS, and more recently NVMe over Fabrics (NVMe-oF). However, not every application is natively SAN-aware. Some operating systems and hypervisors, like Windows and VMware, don’t speak NVMe-oF yet. Something DPUs can do is virtualize networked storage, which is more efficient and easier to manage, to look like local storage, which is easier for applications to consume. A DPU could even virtualize GPUs or other neural network processors so that any server can access as many GPUs as it needs whenever it needs them, over the network.

A similar advantage applies to software-defined storage and hyperconverged infrastructure. Both use a management layer, often running as a VM or as a part of the hypervisor itself, to virtualize and abstract the local storage and the network to make it available to other servers or clients across the cluster. This is wonderful for rapid deployments on commodity servers and is good at sharing storage resources. However, the layer of management and virtualization soaks up many CPU cycles that should be running the applications. As with standard servers, the faster the networking runs and the faster the storage devices are, the more CPU must be devoted to virtualizing these resources.

Here again is where the intelligent DPU creates efficiencies. First, it offloads and helps virtualize the networking. They accelerate the private and public cloud, which is why they are sometimes called CloudNICs. They can offload both the networking and much or all the storage virtualization. DPUs can also offload a wide variety of functions for SDS and HCI, such as compression, encryption, deduplication, RAID, reporting, and so on. This is all in the name of sending more expensive CPU cores back to what they do best: running applications.

Picture of NVIDIA Bluefield, a DPU-based system-on-a-Chip (SoC). — *Figure 1. DPU is a programmable, specialized, electronic circuit board with hardware acceleration of data processing for datacentric computing*

Must have hardware acceleration

Having covered the major DPU use cases, you know when you need them and where they can provide the greatest benefit. They must be able to accelerate and offload network traffic. They also might need to virtualize storage resources, share GPUs over the network, support RDMA, and perform encryption.

Now what are the top DPU requirements? First, all DPUs must have hardware acceleration. Hardware acceleration offers the best performance and efficiency, which also means more offloading with less spending. The ability to have dedicated hardware for certain functions is key to the justification for a DPU.

Must be programmable

For the best performance, most of the acceleration functions must run on hardware. For the greatest flexibility, the control and programming of these functions must run in software.

There are many functions that could be programmed on a DPU, a few of which are outlined in the feature table of my previous post. Usually, the specific offload methods, encryption algorithms, and transport mechanisms don’t change much, but the routing rules, flow tables, encryption keys, and network addresses change all the time. The former functions are the data plane and the latter functions are the control plane. The data plane rules and algorithms can be coded into silicon after they are standardized and established. The control plane rules and programming change too quickly to be hard-coded in silicon but can be run on an FPGA (modified occasionally, but with difficulty) or in a C-programmable Linux environment (modified easily and often).

DPU function	Use case	Run in hardware (data plane)	Run in hardware (control plane)
Packet inspection	Intrusion detection, firewall	Packet filtering, header inspection and rewrite	Rules, reporting, packet content inspection
Flow table processing	vRouter, OVS, firewall	Packet switching	Define switching rules and flow tables
Encryption	Security, privacy	Encryption/decryption	Key management
RDMA	Faster networking	Transport, networking	Addressing, connections
DPDK/OVS	NFV	Packet switching	Rules, reporting
VXLAN overlays	Private/public cloud	Encryption/decryption, VTFP	Overlay definitions
NVMe-oF	Flash storage	NVMe-oF protocol, RDMA	Connection setup, RAID, provisioning

Table 1. DPU function guidelines

How much programming has to live on the DPU?

You have a choice on how much of a DPU’s programming is done on the adapter. That is, the adapter’s handling of packets must be hardware-accelerated and programmable, but the control of that programming can live on the adapter or elsewhere. If it’s the former, we say the adapter has a programmable data plane for executing the packet processing rules and control plane for setting up and managing the rules. In the latter case, the adapter only does the data plane while the control plane lives somewhere else, like the CPU.

For example with Open vSwitch, the packet switching can be done in software or hardware, and the control plane can run on the CPU or on the DPU. With a regular foundational or dumb NIC, all the switching and control is done by software on the CPU. With a SmartNIC, the switching is run on the adapter’s ASIC but the control is still on the CPU. With a true DPU, the switching is done by ASIC-type hardware on the adapter while the control plane also runs on the adapter in easily programmable Arm cores.

The ConnectX SmartNIC offloads data path functions through the internal eSwitch. — *Figure 2. ConnectX-5 SmartNIC offloads OVS switching to NIC hardware*

Which is best, DPU or SmartNIC?

To achieve application efficiency in the datacenter, both transport offload and a programmable data path with hardware offload for virtual switching are vital functions. According to the definition, these functions are part of a SmartNIC and are table stakes on the path to a DPU. However, just transport and programmable virtual switching offload by themselves don’t raise a SmartNIC to the level of a DPU.

Customers often tell us they must have a DPU because they need programmable virtual switching hardware acceleration. This is mainly because another vendor competitor with an expensive, barely programmable offering has told them a “DPU” is the only way to achieve this. In this case, we are happy to deliver the same functionality with the ConnectX family of SmartNICs, which are very smart NICs after all.

But by my reckoning, there are a few more things required to take a NIC to the exalted level of a DPU, such as running the control-plane on the NIC and offering C-programmability with a Linux environment. In those cases, we’re proud to offer the BlueField DPU, which includes all the smarter NIC features of ConnectX adapters plus from 4 to 16 64-bit Arm cores, all running Linux, of course, and easily programmable.

As you plan your next infrastructure build-out or refresh, remember these key points:

DPUs are increasingly useful for offloading networking functions and virtualizing resources like storage, networking, and GPUs
SmartNICs (or smarter NICs) accelerate data plane tasks in hardware but run the control plane in software
The control plane software and other management software can run on the regular CPU or on a DPU.
NVIDIA offers best-in class, intelligent SmartNICs (ConnectX), FPGA NICs (Innova), and fully programmable data plan/control plane DPUs (BlueField programmable DPU).

For more information, see the following resources:

Misc

How to add custom tensorboard visualizations to tf’s object detection model_main training script

Post author By
Post date June 18, 2021
No Comments on How to add custom tensorboard visualizations to tf’s object detection model_main training script

Hello all,

I’m a bit out of my depth here. I know just enough to use tensorflow’s object detection api to do transfer learning on a model from the model zoo, and train a custom object detector. What I’d like to do is add a confusion matrix to the tensorboard visualization I get when I run the main training script in the model garden repository. Currently, I get a bunch of scalar graphs (mAP, AR@K , loss) and images (visualization of model output vs gt). I don’t think I did anything special to create these visualizations. I assume they’re implemented somewhere, and i’m too new to know where. I found a pretty concise tutorial for adding a confusion matrix visualization to tensorboard here (https://towardsdatascience.com/exploring-confusion-matrix-evolution-on-tensorboard-e66b39f4ac12), but I don’t feel like I understand where I would ‘intervene’ or inject my changes into the model_main training script that i’m using (this is the one found in models/research/object_detection of tf’s ‘model garden’). If anyone has had a similar case, or knows of a demo that I might be able to follow I’d be grateful.

Thanks

ItsAnApe

submitted by /u/ItsAnApe
[visit reddit] [comments]

Misc

Become a Tensorflow Developer Certificate

Post author By
Post date June 18, 2021
No Comments on Become a Tensorflow Developer Certificate

Become a Tensorflow Developer Certificate

Hello Tensorflow Developers,

I’m Yassine Hamdaoui, Tensorflow Developer Expert!

I’m here to help anyone get certified!

Linkedin: https://www.linkedin.com/in/yassine-hamdaoui/

https://preview.redd.it/mo4fxzpni2671.jpg?width=1054&format=pjpg&auto=webp&s=296df6c0d1421661e920d11987e05e97d87993a2

submitted by /u/YassineHamdaoui
[visit reddit] [comments]

Misc

Meet the Researcher: Antti Honkela, Applying Machine Learning to Preserve Private Data

Post author By
Post date June 18, 2021
No Comments on Meet the Researcher: Antti Honkela, Applying Machine Learning to Preserve Private Data

Dr. Honkela is the Coordinating Professor of the Research Program in Privacy-preserving and Secure AI at the Finnish Center for Artificial Intelligence (FCAI).

‘Meet the Researcher’ is a series in which we spotlight researchers in academia who use NVIDIA technologies to accelerate their work.

This month we spotlight Antti Honkela, Associate Professor in Computer Science at University of Helsinki in Finland.

Honkela is the Coordinating Professor of the Research Program in Privacy-preserving and Secure AI at the Finnish Center for Artificial Intelligence (FCAI). In addition, he serves as a privacy and anonymity expert in the steering group of Findata, the recently established Finnish Health and Social Data Permit Authority, and has given expert statements to the Finnish Parliament on legislation related to health data privacy.

Figure1. Illustration of noise-aware differentially private Bayesian learning in learning a Gaussian process (blue curve with shaded blue region denoting posterior confidence region) to approximate an underlying function (black curve) from 1,024 noisy observations from a recent pre-print. Different panels show how results become more accurate, and the confidence region shrinks correspondingly under decreasing levels of privacy (increasing ε).

What are your research areas of focus?

Most of my group focuses on developing machine learning and probabilistic inference methods under differential privacy. This provides a strong guarantee that the results cannot be used to violate the privacy of data subjects. I also supervise two students working on applying probabilistic models to analyze genetic data.

What motivated you to pursue this research area?

I have been interested in math and computers for a while. After one year at university, I was given the opportunity to work as a research assistant for Dr. Harri Valpola, who is now the CEO and co-founder of Curious AI. It was during this time that I became hooked on Bayesian machine learning.

Bioinformatics came into the picture after I received my PhD when I struggled to find an application for my machine learning work, which was not apparent at the time. Thanks to an opportunity at a NeurIPS workshop, I met Professor Eric Mjolsness; he told me that some of the models I had been developing might be perfect for modeling gene regulation.

After a few years of working on bioinformatics, I moved back into machine learning to work on differential privacy. This has been an excellent opportunity to link my research, my long-term interest in digital human rights, and my solid theoretical background in mathematics to help solve what I believe will be a significant bottleneck for machine learning for health.

Tell us about a few of your current research projects.

One major project in my group is the work led by Dr. Antti Koskela on using numerical methods for accurate privacy accounting for differential privacy. Differential privacy allows the deriving of an upper bound on so-called privacy loss of data subjects when their data is used. However, the loss increases with each additional access to the data, and it is easy to derive very loose upper bounds on the total loss. Still, these provide a very pessimistic view of the actual privacy loss. Deriving accurate bounds for complex algorithms such as training a neural network with differentially private stochastic gradient descent has been a major challenge, but our work provides an efficient numerical solution with provable error bounds.

Another major initiative is developing tools for differentially private probabilistic programming, which allows the user to specify the structure of a probabilistic model. At the same time the system will automatically derive an algorithm for learning the model from data. Such models allow creating anonymized twins of sensitive data sets more efficiently by easily incorporating prior knowledge. This work is based on a very close collaboration with researchers from the group of Professor Samuel Kaski at Aalto University, and led by Joonas Jälkö and Lukas Prediger.

What problems or challenges does your research address?

We want to develop technologies that allow using sensitive personal data such as health data for things like precision healthcare with guarantees that data subject privacy is maintained. I believe these will be essential for achieving the desired AI revolution in healthcare in a societally sustainable way.

What technological breakthroughs are you most proud of?

From our recent work, I am excited about noise-aware differentially private Bayesian inference we have recently developed for generalized linear models such as logistic regression (led by Dr. Tejas Kulkarni from Aalto University) as well as Gaussian processes. These methods beautifully bring together two important technologies: differential privacy for strong privacy protection, and Bayesian inference for quantifying the uncertainty of predictions and inferences. These are a perfect combination because differential privacy requires injecting more randomness to guarantee the privacy and with these methods we can quantify the impact of that randomness in the final result.

Going further back and really technical, things that stand-out are using natural gradients in variational inference that can really speed-up learning, and has led to significant later breakthroughs in stochastic variational inference and Bayesian deep learning.

A small but significant technical breakthrough that enabled a few major papers, but did not make the headlines, was a method for expressing the computations by using numerically stable evaluation of differences of so-called error functions. These come up in operations with the Gaussian distribution, and recently came up even in some differential privacy work. My original MATLAB code has now been ported to many other languages.

How are using NVIDIA technologies for your research?

GPUs make training large machine learning models a lot faster and we use NVIDIA V100 and A100 GPUs extensively in my group. I really wish such tools would have been available when I was doing my PhD in the early 2000s using weeks to train neural networks.

Training models under differential privacy has caused some problems here, because it needs access to per-example gradients that standard deep learning frameworks do not support efficiently. I am really happy about the great collaboration we had with the NVIDIA AI Technology Center in Helsinki who helped make our differentially private probabilistic programming code run really fast on NVIDIA GPUs.

What is next for your research?

I have two big goals at the moment: developing new methods to allow doing machine learning and Bayesian inference better under differential privacy, and bringing these to users in open source tools that integrate nicely with their existing workflows and run efficiently.

Any advice for new researchers, especially to those who are inspired and motivated by your work?

Lasting scientific contributions arise from rigorous work built on a solid foundation. There are a lot of components out there that tempt you to try some quick hacks for a quick result, but these very seldom lead to lasting results. This is especially true in fields like privacy, where mathematically rigorous privacy proofs are essential, and seemingly minor details may break the proof for some otherwise attractive combination of methods.

To learn more about the work that Antti Honkela and his group is doing, visit his academia webpage.

Misc

Top 5 AI Sessions for Graphics Developers from GTC 2021

Post author By
Post date June 18, 2021
No Comments on Top 5 AI Sessions for Graphics Developers from GTC 2021

We showcased the NVIDIA-powered AI technologies and features that have made creativity faster and easier for artists and designers worldwide.

Engineers, product developers and designers around the world attended GTC to experience the latest NVIDIA solutions that are accelerating interactive rendering and simulation workflows in real-time.

We showcased the NVIDIA-powered AI technologies and features that have made creativity faster and easier for artists and designers worldwide. Industry luminaries joined us at GTC to share their vision for the future of AI and how current developers such as Autodesk, Adobe, Pixar, Bentley Systems and Siemens have integrated the AI technology into their most popular applications.

All of these GTC sessions are now available through NVIDIA On-Demand, so learn more about AI and catch up on the latest advancements in professional content creation, from digital twins to GPU-accelerated production rendering.

The developer resources listed below are exclusively available to NVIDIA Developer Program members. Join today for free to get access to the tools and training necessary to build on NVIDIA’s technology platform here.

On-Demand Sessions

Modern AI 1980s-2021 and Beyond
Professor Jürgen Schmidhuber speaks about the past, present, and future of AI.

Deep Learning Demystified
Learn about the fundamentals of AI, high-level use cases and problem-solving methods.

AI-Enabled Digital Twins for Resilient Infrastructure (Presented by Bentley Systems)
Hear from Bentley Systems on how AI-enabled infrastructure digital twins help facilitate and support the decision-making process for engineers, operators, and other stakeholders.

Production Rendering on GPU with Arnold (Presented by Autodesk)
Get an exclusive peek on the latest GPU-accelerated developments coming to Arnold, Autodesk’s Academy Award-winning production renderer.

Accelerating Machine Learning for Video Systems
See how Adobe has automated the repetitive, time-consuming parts of content creation with machine learning and AI.

Click here to watch all the AI for Graphics sessions from GTC 21.

Misc

Que haja luz: More light for torch!

Post author By
Post date June 17, 2021
No Comments on Que haja luz: More light for torch!

Today, we’re introducing luz, a high-level interface to torch that lets you train neural networks in a concise, declarative style. In some sense, it is to torch what Keras is to TensorFlow: It provides both a streamlined workflow and powerful ways for customization.

Offsites

Que haja luz: More light for torch!

Post author By
Post date June 17, 2021
No Comments on Que haja luz: More light for torch!

Misc

Enabling GPU Acceleration in Near-Realtime RAN Intelligent Controllers

Post author By
Post date June 17, 2021
No Comments on Enabling GPU Acceleration in Near-Realtime RAN Intelligent Controllers

The time that it took to discover the COVID-19 vaccine is a testament to the pace of innovation in the healthcare industry. Pace of innovation can be directly linked to the thriving innovator ecosystem and the large number of AI-based healthcare startups. In comparison, the 5G wireless industry takes approximately a decade to introduce next … Continued

The O-RAN Alliance is pioneering one way of addressing the pace of innovation and post-deployment feature enhancements. The traditional model of opaque design is being disrupted by a transparent paradigm with open and standardized interfaces. Forget using closed and proprietary interfaces, along with having limited options for an ecosystem to introduce new capabilities into deployed equipment.

The new paradigm includes concepts such as the RAN Intelligent Controller (RIC), a key technology that enables third parties to add new capabilities to the network. This provides monetization opportunities for not only the developer ecosystem but also network operators.

Future of wireless

Softwarization, virtualization, and disaggregation are some of the foundational concepts of 5G-and-beyond communication networks. Softwarization of the RAN, and its realization using a software-defined radio (SDR) paradigm, is critical for supporting the three key use cases that are the hallmark of 5G:

Enhanced mobile broadband (eMBB)
Ultra-reliable low-latency communication (URLLC)
Massive machine type communication (mMTC)

A key differentiator between 4G and 5G is the capability, through software, to dynamically bring up and tear down network slices composed of eMBB, URLLC, and mMTC flows. Indeed, it’s a core value proposition of 5G.

Virtualization is an enabler for the efficient sharing of hardware and software assets in support of heterogeneous workloads in mobile edge computing (MEC). Disaggregation represents the dawn of a new ecosystem for the wireless industry. It opens the door to new business opportunities for a broad spectrum and a new generation of hardware developers.

Traditional, monolithic, opaque wireless infrastructure equipment is being disaggregated into the logical entities of the centralized unit (CU), distributed unit (DU), and radio unit (RU). This enables traditional network and emerging private 5G network operators with the flexibility to tailor a system architecture to meet their operational and business needs.

An equally important component of this new approach to wireless networking infrastructure is the standardization of interfaces, both physical and logical, between the hardware and software subsystems. Together with the development of an open software stack, these capabilities enable the rapid deployment of new network features through software. They also enable a new generation of software ecosystem developers to write application code for deployment in a network. These are applications that, by the virtue of these standardized interfaces and APIs, facilitate the control and interaction with entities running in the CU, DU, and RU.

RAN Intelligent Controller

The O-RAN Alliance is standardizing an open, intelligent, and disaggregated RAN architecture. The objective is to enable the construction of an operator-defined RAN using COTS hardware and provision for AI/ML-based intelligent control of 5G and future generation 6G wireless networks. Replace onventional RANs built using proprietary hardware, interfaces, and software with vRANs employing COTS hardware and open interfaces. The new architecture has options to support both proprietary software and applications developed by the ecosystem.

One of the most important elements of the O-RAN standard is the RAN Intelligent Controller (RIC) shown in Figure 1. The RIC consists of two main components:

Non-real-time RIC (Non-RT RIC): Supports network functions at time scales of >1 second.
Near real-time RIC (Near-RT RIC): Supports functions operating at time scales of 10 milliseconds–1 second.

As part of the SMO framework, some of the responsibilities of the Non-RT RIC include ML model lifecycle management and ML model selection. It also includes the marshaling, curation, and preprocessing of data gathered from the CU, DU, and even RU, in preparation for model training on the training host.

The Near-RT RIC introduced in the O-RAN architecture brings software-defined intelligence to the system. It includes advanced near-realtime analytics on data streamed from CU and DU, AI model inference, and online retraining of machine learning (ML) models.

Together, the SMO, Non-RT RIC, and Near-RT RIC bring ML techniques to all layers of the network architecture: layer-1 PHY, layer-2, and at the network level itself through AI-based self-organizing network (SoN) capabilities.

The complete O-RAN architecture picture and the interfaces that link the different entities for data and control functionalities. — Figure 1. O-RAN RIC architecture composed of the Non-RT RIC, Near-RT RIC, and various interfaces between these software entities. The interfaces permit the control, configuration, and data extraction from the CU, DU, and RU. Source: Open, Programmable, and Virtualized 5G Networks: State-of-the-Art and the Road Ahead.

To help understand the RIC in more detail, consider an LTE example. The approach is similar for 5G NR. This example employs RIC-enabled AI for cell capacity management by using a long short-term memory (LSTM) traffic prediction model. The objective is to predict traffic for all cells in the network and mitigate future congestion. For more information, see Intelligent O-RAN for Beyond 5G and 6G Wireless Networks.

A two-layer LSTM network employs 12 LSTM cells per layer. It is trained using UE throughput measurements and physical resource block (PRB) utilization from 17 LTE eNBs in a real-world, fully operational, wireless network. The inference operation predicts UE throughput and eNB downlink PRB utilization 1 hour into the future.

Figure 2 shows the ground truth (actual) and predictions (LSTM inference) for throughput and PRB utilization for one cell of one eNB. The average prediction accuracy of 92.64% is remarkable. With the ability to forecast cell loadings at up to 1 hour into the future, the eNB can take steps to avoid coverage outages, for example cell splitting.

Figure 2 shows how much IP throughout a user can receive and the PRB utilization that can be predicted for a cell. — *Figure 2. User-perceived IP throughput and PRB utilization prediction for a cell of a selected eNB in the network.* Source: Intelligent O-RAN for Beyond 5G and 6G Wireless Networks.

The role of the SMO in this example is to gather data from the O-CU/DU through the O1 interface (Figure 1) and deliver it to the Non-RT RIC. A Non-RT RIC rApp in turn queries the AI server associated with the SMO. The AI server runs a training process to update the LSTM model parameters based on fresh data collected from the operating network.

GPUs are the natural choice for ML training from both a programming model and compute capability perspective. The training workload is large due to the scale of the wireless network. We are not interested in model training for a single eNB with a few cells. Instead, we’re interested in training for a system that could have 100s to 1000s of base stations, with many 1000s of cells and 1000s to 10,000s of UEs. Having a GPU-powered AI training server provides the option of sharing that infrastructure over many SMO hosts. It is more cost– and power-efficient than a CPU AI training host. In other words, there are both CAPEX and OPEX advantages for the network operator.

After the training server has updated the LSTM model, the updated model parameters are returned to the Non-RT RIC rApp and the throughput/PRB prediction process continues with the updated model. Figure 3 shows the throughput gains. The vertical axis shows the fraction of the number of operating hours that user throughput for each band is indicated on the horizontal axis.

For example, you can see that, without cell splitting, throughput is in the range of 5-7.5 Mbps for approximately 1% of the time. With predictive cell splitting, throughput is in this same range for approximately 10%, a difference of a factor of 10.

Figure 3 shows how user throughput varies for different cell splitting configurations in an O-RAN setting. — *Figure 3. User throughput for different cell splitting configurations.* Source: Intelligent O-RAN for Beyond 5G and 6G Wireless Networks.

An xApp that NVIDIA is researching is to enable intelligent and predictive multicell joint resource management. This has the potential to significantly improve the energy efficiency of the network.

AI algorithms running on a Non-RT RIC can predict user density and traffic load in each cell within a prediction window (on a seconds-to-minutes time scale). Predictions are based on the traffic history provided by CUs and DUs. Each DU scheduler makes decisions to switch off certain cells with low predicted traffic load to reduce energy consumption. They also trigger coordinated multipoint transmission/reception (CoMP) from neighboring active cells to ensure effective coverage.

The Near-RT RIC can help achieve the efficient multiplexing of eMBB and URLLC data traffic on the same frequency band. Due to significantly diverse service requirements, eMBB and URLLC transmissions are scheduled on two different time scales: time slot and mini-slot levels for eMBB and URLLC, respectively.

An AI-based xApp at the Near-RT RIC could learn and predict URLLC packet arrival patterns based on traffic statistics streamed from the DU over the E2 interface (Figure 1). Such predictive knowledge is used at the DU scheduler to optimize the resource reservation for URLLC mini-slots on top of eMBB data flows. It is also used to minimize the loss of eMBB throughput caused by such multiplexing.

You could also envision an xApp for massive MIMO beamforming optimization to maximize spectral efficiency. In this case, the Non-RT RIC hosts an rApp to perform long-term data analytics. The rApp’s task is to collect and analyze antenna array parameters and continually update an ML model. The Near-RT RIC xApp is implementing ML inference to configure, for example, beam horizontal and vertical aperture and cell shape.

Why GPUs?

The signal processing requirements (MACs/second) of the 5G NR physical layer are immense. The massive parallelism of the GPU brings the hardware resources to bear that can support this class of workload. In fact, a single GPU can support the baseband processing requirements of many 10s of carriers. Specialized hardware accelerators would typically have been employed in previous generation systems. However, the parallel nature of the GPU enables the softwarization of the RAN by providing a C++ abstraction for programming advanced signal-processing algorithms.

However, the value of the GPU extends beyond vRAN signal processing. In a 5G and 6G systems where big-data-meets-wireless, where AI/ML is used to improve network performance, GPUs are the default standard for model training and inference.

A common GPU-based hardware platform can support the tasks of training, inference, and signal processing. However, it’s not only about GPU hardware. An equally important consideration is the software for programming GPUs and SDKs and libraries for application development.

GPUs are programmed using CUDA, the world’s only commercially successful C/C++–based parallel programming framework. There is also a rich set of GPU libraries for developing, for example, data analytics pipelines using the NVIDIA RAPIDS software suite. The data analytics pipeline could be one of the services that the SMO/Non-RT RIC engages to update and fine-tune inference models running under the Near-RT RIC.

VMware and NVIDIA partnership

In early 2021, VMware released the world’s first O-RAN standard–compliant Near-RT RIC for integration and testing with select RAN and xApp vendor partners. To facilitate development of xApps on its Near-RT RIC, VMware provides its xApp partners with a set of developer resources packaged as an SDK.

Today, VMware and NVIDIA are excited to announce that the Near-RT RIC SDK now enables xApp developers to leverage GPU acceleration in their applications. This is an exciting milestone for the industry. It opens the doors for the larger industry to build AI/ML-powered capabilities for modern RANs, including those based on the NVIDIA Aerial gNB stack. Eventually, the VMware RIC and NVIDIA Aerial stack combination will enable the development and monetization of new and innovating xApps that enhance or expand the capabilities of a deployed network.

Conclusion

Openness and intelligence are the two core pillars of the O-RAN initiatives. As the 5G rollout and the ramp of 6G research continues, intelligence will be all-encompassing for the deployment, optimization, and operation of wireless networks.

Transitioning away from the opaque approach historically employed in cellular networks opens the door to a new era of swift innovation and time-to-market of new RAN features. NVIDIA vRAN (NVIDIA Aerial) and AI technology, combined with the VMware RIC, will foster a new generation of wireless and open up new monetization and innovation opportunities.

Misc

Performing the vRAN Benchmark with GPUs Using the NVIDIA Aerial SDK

Post author By
Post date June 17, 2021
No Comments on Performing the vRAN Benchmark with GPUs Using the NVIDIA Aerial SDK

Virtualization is key to making networks flexible and data processing faster, better, and highly adaptive with network infrastructure from Core to RAN. You can achieve flexibility in deploying 5G services on commercial off-the-shelf (COTS) systems. However, 5G networks bring support for ultra-low latency, high-bandwidth applications, and scalable networks with network slicing and software defined networking … Continued

However, 5G networks bring support for ultra-low latency, high-bandwidth applications, and scalable networks with network slicing and software defined networking (SDN). 5G networks, especially virtualized RAN (vRAN), require both performance based on fast data processing and flexibility by virtualization at the same time.

The bottleneck of vRAN is the data processing in the PHY layer. Figure 2 shows the functional block diagram with the Option 7.2x architectural split. The PHY layer converts bits to radio waves in downlink by using various algorithms for scrambling, channel encoding, equalization, rate matching, and the reverse for uplink flow. Fronthaul interfaces with external radio units using the eCPRI protocol to reduce latency and jitter during data transfer.

In the 5G gNB protocol stack, there’s PHY-High in the middle. This layer is broken into the signal processing pipeline comprised of compute intensive functional blocks. — *Fig 2: Functional block diagram with Option 7.2x architectural split*

GPU-based vRAN processing handles massive computations and heavy workloads without compromising on speed. The NVIDIA Aerial SDK is a cloud-native 5G vRAN solution running on NVIDIA GPUs to bring high performance computing (HPC) and signal processing all in one package.

Motivation for the benchmark of GPU-based vRAN

In addition to high speed and high-capacity communications, 5G technology is expected to offer low-latency transmission, offering more advanced signal processing. Because previous vRAN designs fail to deliver the communication performance anticipated for 5G, validation tests are being carried out with a view to boosting processing speed. SoftBank, along with NVIDIA, has been conducting such validation tests since 2019.

The NVIDIA Aerial SDK conforms to the standard specified by 3GPP and O-RAN Alliance, making this software highly compatible with the generalization and virtualization of 5G base stations. With in-line acceleration and software defined implementation, the performance improves with advances in GPUs while no design changes are required. Massive MIMO type communication being an integral part of RAN, NVIDIA Aerial can support higher bandwidths and data rates without sacrificing flexibility from virtualization.

To be precise, GPU performance is determined by its clock frequency and number of cores. These are virtualized by the CUDA C/C++ platform. The hardware structure is well hidden, and automatically schedules multiprocess operations to run with sufficient parallelism. If the Aerial SDK runs on different GPUs (GPUs with different numbers of cores), there is no need for additional development due to hardware changes. The advantages of the CUDA platform enable you to run the existing code on those GPUs without modification.

Softbank was interested in the flexibility of this GPU-based vRAN and collaborated with NVIDIA to conduct this validation to investigate the performance and features.

Performance testing conditions

In this test, signal processing was conducted to simulate uplink and downlink data communication for latency and power consumption. The benchmark was run on the server equipped with an NVIDIA V100 GPU (Figure 3).

generate test vectors with MATLAB script, and measure high PHY performance using them — *Fig 3: Experimental conditions*

Chipset	Intel Xeon CPU Platinum 8258(24 core 2.9GHz)
Accelerator	100 MHz
Bandwidth	100 MHz
#MIMI layers	UL: 1-8 Layers DL: 1-16
Modulation	UL: 64 QAM DL: 256 QAM

Table 1: Configuration used for the experiment

Tables 2 and 3 show the configuration details of the uplink and downlink test vectors.

Bandwidth (MHz)	100	100	100	100
Cells	1	2	4	8
Total Layers	1	2	4	8
Total Users	1	1	2	8
Layers per user	1	2	2	2
Max length	1	1	2	2
QAM	64	64	64	64
Target code rate (R x 1024)	948	948	948	948

Table2. configurations of uplink test vectors

	Downlink Test Case 1	Downlink Test Case 2	Downlink Test Case 3	Donlink Test Case 4
Bandwidth (MHz)	100	100	100	100
Cells	1	1	1	1
Total layers	2	4	8	16
Total users	2	4	4	4
Layers per user	1	1	2	4
max length	1	1	2	2
QAM	256	256	256	256
Target code rate (R x 1024)	948	948	948	948

Table3. Configurations of downlink test vectors

Results

The benchmark resulted in significant advantages in signal processing time and power consumption.

100MHz signal processing time

Figure 4 shows that GPU-based vRAN (NVIDIA Aerial) showed a remarkable advantage in processing time. The x-axis shows the total number of layers, and the y-axis shows the signal processing time. As the number of layers increases, the complexity of the signal processing increases. However, the increase in GPU latency for PHY processing is not proportional to the actual processing time required, making it more efficient for a higher number of layers.

Graph showing signal processing time for each test case of uplink and downlink. It was measured under 100MHz single cell conditions. — *Fig 4. Single cell* 100Mhz processing time*

This testing was limited to a single cell. Multicell performance on GPU parallel processors cannot be estimated from single cell performance, as the results would greatly underestimate the gains from processing additional cells in parallel.

For example, in the case of uplink, when the number of processed layers is increased from one layer to two layers, the processing time increases by only 1.18 times. SoftBank concluded that this is because the processing is performed efficiently by the massive parallel computing ability of the NVIDIA GPU. They also observed that the GPU can easily meet the 5G TTI budget with a single cell, so performant multicell processing would be within scope.

Power consumption

In Figure 5, the green bars represent the average power consumption of GPU card, and an error bar shows the range of GPU power consumption (representation from minimum power consumption to maximum power consumption). The x-axis shows the total number of layers (uplink) in the same way we showed the signal processing graph in the previous section. The y-axis shows the power consumption.

As explained earlier, as the number of layers increases, PHY processing becomes more computationally intensive, which puts more strain on the GPU. However, focusing on the average power consumption represented by the green bar, the increase in GPU power consumption is gradual, despite the increasing GPU load. It increases 1.41 times from one layer to two layers, and 1.56 times from one layer to four layers.

As the previously mentioned experiment showed, the power consumption of GPU-based vRAN demonstrated two important results:

A rise in power consumption is not in proportion to the increase in the number of layers. This is going to be an advantage in case SoftBank deploys vRAN with a higher number of MIMO-layer configurations.
Power consumption increased only when the GPU ran particular PHY signal processings. In other words, there is an evident correlation between GPU usage and the power consumption. This signifies that operators could cut down on the total power consumption if the vRAN system workload fluctuated a lot.

Importance of flexibility in deployment at edge

In 5G and beyond, we should expect the unexpected, especially when it comes to realizing new applications. In these potential applications, there will probably be some that require ultra-low latency to be meaningful, such as online gaming, AR/VR, and other typical MEC applications. You may not be able to put dedicated servers for these applications at edge, due to unavoidable limitations in terms of space, power draw, and other economic or engineering factors.

To tackle this type of adverse scenario, consider hosting latency-sensitive applications previously mentioned on a RAN system alongside 5GC functions necessary. This is the so-called “coexistence of 5G RAN and MEC” deployment scenario at which worldwide operators are aiming.

Single converged server hosts all workload required at edge locations - including RAN(DU/CU), MEC as well as UPF. — *Fig 6. 5G RAN and MEC coexistence on a converged server*

Because SoftBank is interested in the coexistence of 5G RAN and MEC, they are continuously verifying some applications using GPUs on MEC in parallel with this vRAN verification. For deployment scenarios that share resources at edge locations, they believe that there are some key requirements that cannot be ignored such as flexible programmability, cloud-native architecture, and proper multitenancy.

Through this vRAN validation, we confirmed the flexibility and high computational performance. Together with the fact that GPUs have a high affinity for AI processing, they felt that GPUs have an affinity for the coexistence of 5GRAN and MEC.

Because we both share the same views in the must-have edge requirements and ideal platforms in the future, SoftBank and NVIDIA are continuing to pursue this scenario in continuation of the successful vRAN benchmark.

Conclusion

Based on the key findings mentioned in this post, here are the conclusions SoftBank made in their GTC talk:

RAN virtualization won’t stop. vRAN enables the adoption of open hardware with a software-defined RAN running on it, alongside other mobile capabilities such as MEC. These functional blocks are interconnected using open standard interfaces.
Choose the right accelerator. To cope with computationally intensive PHY signal processing in open hardware platforms, you must choose the right accelerator. There are a couple of choices currently available on the market, GPU and others. The important factors to keep in mind are the hardware platform’s versatility, computing performance, cost, and its inherent programmability.
GPUs are an ideal commercial vRAN solution in 5G networks. GPUs have valuable benefits, such as edge hardware resource sharing (MEC applications to be hosted on the same converged platform as gNB along with part of 5GC components) and its intrinsic programmability powered by CUDA.
GPUs consume power proportionally. From time to time, the traffic to be processed by a gNB can be low. The way that GPUs consume power proportionally is beneficial in such cases, unlike other accelerators.

For more information, see the following resources:

NVIDIA Aerial SDK
[TITLE], SoftBank GTC webinar (Japanese/English)