Juniper advances AI networking software with congestion control, load balancing

GIXnews

6 months ago

Juniper Networks is advancing the software for its AI-Native Networking Platform to help enterprise customers better manage and support AI in their data centers. The HPE acquisition target is also offering a new validated design for enterprise AI clusters and has opened a lab to certify enterprise AI data center projects.

Juniper’s AI-Native Networking Platform is aimed at unifying its campus, branch and data center networking products under a common AI engine. Central to the platform are the firm’s cloud-based, natural language Mist AI and Marvis virtual network assistant (VNA) technology. Juniper’s Mist AI engine analyzes data from networked access points and devices so it can detect anomalies and offer actionable resolutions. Marvis can detect and describe countless network problems, including persistently failing wired or wireless clients, bad cables, access-point coverage holes, problematic WAN links, and insufficient radio-frequency capacity.

Juniper has now extended its platform with a package of new features dubbed Operations for AI (Ops4AI). The additions enable congestion control, load-balancing and management capabilities for systems controlled by the vendor’s core Junos and Juniper Apstra data center intent-based networking software.

Fabric autotuning for AI

For network congestion, the company added a feature called fabric autotuning for AI that gathers telemetry data from routers and switches to automatically calculate and configure optimal parameter settings for fabric congestion control, according to Amit Sanyal, Juniper product, solution and marketing lead, who wrote a blog post about the enhancements.

“Remote Dynamic Memory Access (RDMA) from GPUs drives massive network traffic in AI networks,” Sanyal wrote. “Despite congestion avoidance techniques like load-balancing, there are situations when there is congestion (e.g., traffic from multiple GPUs going to a single GPU at the last hop switch). When this occurs, customers use congestion control techniques such as Data Center Quantified Congestion Notification (DCQCN). DCQCN uses features like Explicit Congestion Notification (ECN) and Priority-Based Flow Control (PFC) to calculate and configure parameter settings to get the best performance for each queue for each port across all switches. Setting these across thousands of queues across all switches manually is difficult and error-prone.”

To address this problem, Juniper Apstra collects telemetry information and calculates the optimal ECN and PFC parameter settings for each queue for each port. Using closed-loop automation, the optimal settings are automatically configured on all the switches in the network, Sanyal wrote.

Apstra works by keeping a real-time repository of configuration, telemetry and validation information to ensure a network is doing what the organization wants it to do. Companies can use Apstra’s automation capabilities to deliver consistent network and security policies for workloads across physical and virtual infrastructures. In addition, Apstra performs regular network checks to safeguard configurations. It’s hardware agnostic, so it can be integrated to work with Juniper’s networking products as well as boxes from Cisco, Arista, Dell, Microsoft and Nvidia.

Load balancing and visibility improvements

On the load balancing front, Juniper has added support for dynamic load balancing (DLB) that selects the optimal network path and delivers lower latency, better network utilization, and faster job completion times. From the AI workload perspective, this results in better AI workload performance and higher utilization of expensive GPUs, according to Sanyal.

“Compared to traditional static load balancing, DLB significantly enhances fabric bandwidth utilization. But one of DLB’s limitations is that it only tracks the quality of local links instead of understanding the whole path quality from ingress to egress node,” Sanyal wrote. “Let’s say we have CLOS topology and server 1 and server 2 are both trying to send data called flow-1 and flow-2, respectively. In the case of DLB, leaf-1 only knows the local links utilization and makes decisions based solely on the local switch quality table where local links may be in perfect state. But if you use GLB, you can understand the whole path quality where congestion issues are present within the spine-leaf level.”

In terms of visibility, Sanyal pointed out limitations in existing network performance management technologies:

“Today, admins can find out where congestion occurs by observing only the network switches. But they don’t have any visibility into which endpoints (GPUs, in the case of AI data centers) are impacted by the congestion. This leads to challenges in identifying and resolving performance issues. In a multi-training job environment, just by looking at switch telemetry, it is impossible to find which training jobs have been slowed down due to congestion without manually checking the NIC RoCE v2 stats on all the servers, which is not practical,” Sanyal wrote.

Juniper is addressing the issue by integrating RoCE v2 streaming telemetry from the AI Server SmartNICs with Juniper Apstra and correlating existing network switch telemetry; that integration and correlation “greatly enhances the observability and debugging workflows when performance issues occur,” Sanyal wrote. “This correlation allows for a more holistic network view and a better understanding of the relationships between AI servers and network behaviors. The real-time data provides insights into network performance, traffic patterns, potential congestion points, and impacted endpoints, helping identify performance bottlenecks and anomalies.”

“This capability enhances network observability, simplifies debugging performance issues, and helps improve overall network performance by taking closed-loop actions. For instance, monitoring out-of-order packets in the SmartNICs can help tune the parameters in the smart load-balancing feature on the switch,” Sanyal wrote.

Launching the Ops4AI Lab

In addition to announcing the new Ops4AI capabilities, Juniper launched the Ops4AI Lab, where customers can test and validate AI data center configurations to ensure automated switching, routing, storage and compute operations will work smoothly, the vendor stated. The Ops4AI Lab is in Juniper’s Sunnyvale, Calif., headquarters.

Also new for customers are Juniper’s Validated Designs, which are aimed at helping customers architect well-tested and repeatable schemes based on specific platforms and software, according to the vendor. Juniper first pre-validated blueprint is specifically for AI data centers and defines Nvidia A100 and H100 compute/storage as well as data center leaf and spine switches from Juniper’s switch/router portfolio. Juniper also supports designs that include Broadcom, Intel, WEKA, and other partners.

AI is a key driver of Hewlett Packard Enterprise’s proposed agreement to buy Juniper Networks for $14 billion, a deal which might not close until early 2025. Networking will become the new core business and architecture foundation for HPE’s hybrid cloud and AI solutions delivered through the company’s GreenLake hybrid cloud platform, the companies stated.

AI infrastructure is expected to be a huge business in the future. IDC predicts that by 2025, the 2,000 most prominent companies in the world will allocate over 40% of core IT spending to AI initiatives, driving a double-digit increase in the rate of product and process innovations.

These new business use cases require new infrastructure, and 2024 will see the beginning of explosive growth in AI infrastructure, particularly in data center networking. IDC estimates that the market for generative AI data center ethernet switching will reach $9 billion in 2028 with a compound annual growth rate of 70%.

Source:: Network World