Finding Out Where Your Application and Network Intersect

Modern data centers can run thousands of services and applications. When an issue occurs, as a network administrator, you are guilty by default. You have to…

Modern data centers can run thousands of services and applications. When an issue occurs, as a network administrator, you are guilty by default. You have to prove your innocence on a daily basis, as it is easy to blame the network. It is an unfair world.

Correlating application performance issues to the network is hard to do. You can start by checking basic connectivity using simple pings or traceroutes, check your SNMP-based monitoring tools, sniffers, or even reading device counters to look for drops. In the meantime, users suffer from application slowness, poor performance, or even unavailability.

Unfortunately, all these classic network troubleshooting methods are time-consuming and don’t guarantee success, as it is sometimes nearly impossible to pinpoint problems using them.

NetQ to the rescue

To facilitate network troubleshooting, NVIDIA developed NetQ—a scalable, modern network operations toolset that provides network visibility in real time.

The NetQ team recently introduced the unique flow analysis tool to provide further visibility enhancements. Flow analysis allows network administrators to instantly correlate service traffic flows to the paths taken in the fabric, dramatically reducing the mean time to innocence (MTTI) or even ensuring there is no network issue.

Flow analysis enables you to discover and visualize all paths that a specific application’s traffic flow takes between endpoints in the fabric. It monitors the fabric-wide latency and buffer utilization statistics. With EVPN and multi-tenancy becoming the standard solution in most modern data centers, the flow analysis tool was designed to sample TCP or UDP data on overlay and underlay networks within different VRFs.

Flow analysis becomes even more powerful when used with What Just Happened (WJH) ASIC telemetry. While flows are being analyzed, flow-related WJH events from all switches in traffic paths are presented to help you discover if there were drops that caused the service issue. These two features working together maximize the probability of pinpointing the actual problem affecting an application.   

Screen shot of the dashboard showing latency results and a flow graph.
Figure 1. NetQ flow analysis dashboard

By the numbers

Flow analysis is supported on NVIDIA Spectrum 2 and later switches running Cumulus Linux 5.0 or later. It can also provide partial-path discovery for brownfield deployments with unsupported switches or switches running older versions of Cumulus Linux or SONiC.

Flow analysis samples traffic based on the packet’s four or five tuples, including VXLAN inner and outer headers. Its sampling lifetime is limited to 10, 15, 20, or 30 minutes. You can decide whether to run it on creation or schedule it for a later time.

The sample rate granularity is also configurable to low (1 per 10000), medium (1 per 1000), high (1 per 100), or all packets (1 per 1). The higher the sampling rate, the more accurate your analyzed data. A higher sampling rate results in higher CPU utilization, so I recommend setting lower sampling rates for heavy traffic flows.

Try it yourself in NVIDIA Air

NVIDIA Air is a tool for creating data center digital twins. With Air, you can build your own Cumulus Linux virtual data center, test it, validate it with NetQ, explore features, and learn some best practices. It is entirely free to use!

Try out flow analysis by spinning up the prebuilt NVIDIA Air Infrastructure Simulation Platform demo in the Air Marketplace. Follow the guided tour and see the significant benefits that flow analysis with NetQ can bring to your organization.

For more information, see the following resources:

Leave a Reply

Your email address will not be published. Required fields are marked *