Arista Networks is developing a software-based agent to help efficiently tie together the network and server systems in large AI clusters.
As part of that development, Arista collaborated with Nvidia to use its BlueField-3 SuperNIC, which specifically targets large-scale AI workloads and promises 400Gbps bandwidth using remote direct memory access (RDMA) over Converged Ethernet (RoCE) to maximize throughput between GPU servers.
The AI agent is based on Extensible Operating System (EOS), Arista’s flagship network operating system that runs and manages all of its switches and routers, and it integrates network features and connected GPUs into a single manageable package.
Running on Arista switches, the EOS AI agent can be extended to directly-attached NICs and servers to allow a single point of control and visibility across an AI data center, according to Arista CEO Jayshree Ullal, who wrote a blog about the new agent. “This remote AI agent, hosted directly on an Nvidia BlueField-3 SuperNIC or running on the server and collecting telemetry from the SuperNIC, allows EOS, on the network switch, to configure, monitor, and debug network problems on the server, ensuring end-to-end network configuration and QoS consistency,” Ullal stated.
“The remote agent deployed on the AI NIC/ server transforms the switch to become the epicenter of the AI network to configure, monitor and debug problems on the AI Hosts and GPUs,” Ullal stated. “This allows a singular and uniform point of control and visibility. Leveraging the remote agent, configuration consistency including end-to-end traffic tuning can be ensured as a single homogenous entity.”
With tracking and reporting of host and network behaviors, failures can be isolated for communication between EOS running in the network and the remote agent on the host, Ullal stated. “This means that EOS can directly report the network topology, centralizing the topology discovery and leveraging familiar Arista EOS configuration and management constructs across all Arista Etherlink platforms and partners,” Ullal stated.
Arista’s Etherlink technology will be supported across a range of products, including 800G systems and line cards, and will be compatible with specifications from the Ultra Ethernet Consortium.
The need for such a technology package is being driven the explosion of AI datasets, in addition to helping customers solve the challenge of coordinating the complex web of components like GPUs, NICs, switches, optics, and cables in large AI clusters, according to Ullal:
“As the size of large language models (LLMs) increases for AI training, data parallelization becomes inevitable. The number of GPUs needed to train these larger models cannot keep up with the massive parameter count and the dataset size. AI parallelization, be it data, model, or pipeline, is only as effective as the network that interconnects the GPUs. GPUs must exchange and compute global gradients to adjust the model’s weights. To do so, the disparate components of the AI puzzle have to work cohesively as one single AI Center: GPUs, NICs, interconnecting accessories such as optics/cables, storage systems, and most importantly the network in the center of them all.“
The whole system rises together for optimum performance, rather than foundering in isolation as with prior network silos, Ullal stated. “The AI Center shines by eliminating silos to enable coordinated performance tuning, troubleshooting, and operations, with the central network playing a pivotal role to create and power the linked system.”
Arista said it will demonstrate the AI agent technology at the 10th anniversary of its IPO at the NYSE on June 5th, with customer trials expected in the second half of 2024.
In a related AI development, Arista said it has partnered with Vast Data to help customers build high-performance infrastructure for AI development.
Launched in 2019, Vast offers an integrated storage, database and computing package aimed at managing large-scale development of AI workloads in data centers and cloud environments, according to the vendor.
Under terms of the agreement, Arista switches have been certified to work within the Vast environment, and together the companies will work to integrate security and management technologies for use AI-based infrastructures. Customers will be able to see, for example, how AI data is flowing from edge-to-core-to-cloud, according to Vast.
“The VAST Data Platform optimizes organizations’ data sets using proprietary Similarity data reduction and compression capabilities that help significantly reduce power consumption and improve efficiency,” the vendor stated in a release.
Beyond Arista, Vast has variety of partnerships including with Nvidia to utilize the Nvidia’s Bluefield-3 SuperNIC, and with Hewlett Packard Enterprise, which uses Vast’s technology in its HPE for GreenLake File Storage offering.
Source:: Network World