
Arista Networks has added load balancing and AI job-centric observability features to core software products in an effort to help enterprise customers grow and effectively manage AI networked environments.
A new AI cluster performance and load balancing feature is now part of Arista’s flagship Extensible Operating System (EOS) that runs across its networking portfolio. Arista has also bolstered its CloudVision management package to better troubleshoot AI jobs as they traverse the network.
The Cluster Load Balancing (CLB) feature, which is part of Arista’s EOS Smart AI Suite of tools for managing AI networks, is an Ethernet-based remote direct memory access (RDMA) aware package that ensures high bandwidth utilization and low-latency between AI clusters and the spine and leaf networks they are connected to.
“AI jobs cannot tolerate high latency or slow flows like traditional networks can handle,” said Praful Bhaidasna, director of product management for Arista. “AI jobs rely on everything getting finished before they move on to the next step, so one slow flow can grind everything to a halt.”
CLB works by looking at the RDMA over Ethernet connection to watch traffic flows and then using status information to create an optimal load balancing solution, Bhaidasna said.
“CLB will guarantee that latency will be low, so you will not have any slow flows and all your links will be utilized to the maximum. You will not have some links that are super congested because of big flows, and some not, because every flow is big in the AI world,” Bhaidasna said.
The other key feature of CLB is that it is GPU- and NIC-agnostic – CLB ensures balanced utilization, Bhaidasna said.
Updates to Arista CloudVision platform
To help enterprises manage AI and networking environments, Arista added AI job-centric observability for greater troubleshooting to its CloudVision Universal Network Observability (CV UNO) system. CV UNO is a licensed component of Arista’s CloudVision as-a-service platform. It’s designed to gather network telemetry and analytical data and meld it with AI and machine learning technologies to offer real-time network flow and application performance details, risk and incident analysis, and change impact management.
CV UNO will let customers correlate network data and AI job metrics to optimize AI job performance and pinpoint bottlenecks and hardware issues affecting AI workload performance, Bhaidasna said. The system can see AI job completion times, congestion indicators and buffer/link utilization to ensure uninterrupted, high-efficiency AI workload execution.
“Traditionally, when we look at network health, we look at it at as a point in time. Like you will have some SNMP data returning to you at a particular time interval, telling you the interface is up/down or a switch is down, etc. But what happened in the middle? You have absolutely no visibility,” Bhaidasna said. “CV UNO eliminates the guesswork about what is actually going on there, and AI can spot issues and offer suggestions about how to fix them before they become a problem.”
CLB is available now on Arista’s 7260X3, 7280R3, and 7500R3 switches and 7800R3 Etherlink platform. Support on 7060X6 and 7060X5 Etherlink platforms is scheduled for Q2 2025. Support for 7800R4 Etherlink 800G AI spine box is scheduled for the second half of 2025.
CV UNO is available now, and the observability enhancements for AI are in active customer trials, with general availability scheduled for Q2 2025, Arista stated.
Source:: Network World