The NVIDIA UFM Cyber-AI platform helps to minimize downtime in InfiniBand data centers by harnessing AI-powered analytics to detect security threats and operational issues, as well as predict network failures. This post outlines the advanced features that system administrators can use to quickly detect and respond to potential security threats and upcoming failures, saving costs and ensuring consistent customer service.
Today’s data centers host many users and a wide variety of applications. They have even become the key element of competitive advantage for research, technology, and global industries. With the increased complexity of scientific computing, data center operational costs also continue to rise. In addition to the operational disruption of security threats, keeping a data center intact and running smoothly is critical.
What’s more, malicious users may exploit data center access to misuse compute resources by running prohibited applications resulting in unexpected downtimes and higher operating costs. More than ever, data center management tools that quickly identify issues while improving efficiency are a priority for today’s IT managers and the developers who support them.
NVIDIA may be best known for stunning graphics capabilities and unmatched GPU compute performance used in nearly every area of research. However, for many years, it has also been the leader in secure and scalable data center technologies, including flexible libraries and tools to maximize world-class infrastructures.
NVIDIA recognizes that providing a full-stack solution for what might be the most critical component of today’s research and business includes more than world-class server platforms, GPUs, and the broadest software portfolio deployed throughout the data center. NVIDIA also knows that security and manageability are key pillars on which datacenter infrastructure is built.
NVIDIA UFM Cyber-AI revolutionizes the InfiniBand data center
The NVIDIA Unified Fabric Manager (UFM) Cyber-AI platform offers enhanced and real-time network telemetry, combined with AI-powered intelligence and advanced analytics. It enables IT managers to discover operational anomalies and even predict network failures. This improves both security and data center uptime while decreasing overall operating expenses.
The unique advantage of UFM Cyber-AI is its ability to capture rich telemetry information and employ AI techniques to identify hidden correlations between events. This enables it to detect abnormal system and application behavior, and even identify performance degradations before they lead to component or system failure. UFM Cyber-AI can even take corrective actionsin real time. The platform learns the typical operational modes of the data center and detects abnormal use based on network telemetry data, including traffic patterns, temperature, and more.
Fundamentals of UFM Cyber-AI
UFM Cyber-AI contains three different layers, as shown in Figure 1.
- Input telemetry: Collects information and learns from the network in various ways:
- Telemetry of all elements in the network
- Network topology (connectivity and resource allocation for tenants or applications)
- Features and capabilities of network equipment
- Processing models: Contains several models, such as an extraction, transformation, and loading (ETL) processing engine for data preparation. It also contains aggregation, data storage, and analytical models for comparison. UFM Cyber-AI uses machine learning (ML) techniques and AI models for anomaly detection and prediction to learn the lifecycle patterns of data center network components (cable, switch, port, InfiniBand adapter).
- Output dashboard: A visualization layer that exposes a central dashboard for network administrators and cloud orchestrators to see alerts and recommendations for improving network utilization and efficiency, and solving network health issues. The dashboard offers two main categories: Suspicious Behavior and Link Analysis, each including sections for alerts and predictions (Figure 2).
A feature-rich, intuitive, and customizable fabric manager
UFM Cyber-AI also supports customizable network alerts or viewing triggered anomalies over time and in different time dimensions. By using aggregated network statistics based on hour or day-of-the-week parameters, you can set thresholds and configure notifications based on measurements that might deviate from typical operational use. For example, you could use predefined thresholds to identify problematic cables.
Built-in analytics compares current telemetry information against time-based aggregated information to detect any suspicious increase or decrease in use or traffic patterns and immediately notify the system administrator. UFM Cyber-AI also provides data center tenant or application alerts through link or port telemetry information to identify low-level partition key (PKEY) associated statistics along with their associated nodes.
Only UFM Cyber-AI offers features like link failure prediction, whichsupports predictive maintenance. By detecting performance degradation cases in the early stages, UFM Cyber-AI can predict potential link or port failures. This enables administrators to perform maintenance and eliminate data center downtime.
Future enhancements with NVIDIA Morpheus
Bringing the most robust fabric management solution for InfiniBand requires constant innovation to keep pace with the complexities of managing today’s complex data center. We plan to integrate NVIDIA Morpheus with UFM Cyber-AI (Figure 3), bringing more telemetry information from other data center elements, such as server or rack-based component-based telemetry or DPU, GPU, and application counters.
We could even provide an additional layer that can interface directly with other APIs such as Kafka, an open-source distributed event streaming platform used for high-performance data pipelines, streaming analytics, and data integration. You could use that integration for specific detection of developer-defined operational system exceptions, such as crypto-mining detection on a system dedicated for life-science research.
Morpheus is an open AI application framework that provides cybersecurity developers with a highly optimized AI pipeline and pretrained AI capabilities. These capabilities enable you to inspect all network traffic instantaneously across your data center fabric. Morpheus brings a new level of security to data centers by providing the following:
- Dynamic protection
- Real-time telemetry
- Adaptive policies
- Cyber defenses for detecting and remediating cybersecurity threats
As Morpheus integrates into the UFM Cyber-AI appliance, we can offer the best and most complete solution that is also flexible and extendable for mission-critical data centers and supporting developers. With customizable anomaly detection and interfaces to other standardized APIs, UFM Cyber-AI is a flexible asset for any data center or cloud-native infrastructure supporting multitenancy.
For more information, see NVIDIA Unified Fabric Manager.