Nvidia turns to software to speed up its data center networking hardware for AI

Nvidia wants to make long-haul GPU-to-GPU communication over Ethernet faster and more reliable, and hopes to achieve that with new Ethernet algorithms introduced on Friday.

The Spectrum-XGS algorithms are software protocols baked into Nvidia’s latest Ethernet gear. The algorithms automatically adjust long-distance networking performance so distributed GPUs in servers across multiple data centers operate as a single, unified AI supercomputer.

“It’s not a new hardware element, but it’s leveraging the Spectrum-X infrastructure, and the new algorithms effectively move more data across longer distances between sites,” Gilad Shainer, senior vice president of networking at Nvidia, told Network World in an interview.

Shainer is sharing details about the technology on Aug. 26 at the Hot Chips conference in Palo Alto, California.

Companies are spreading out data-center installations because of size and power caps, and as a result, distributing GPUs over longer distances, Shainer said.

XGS algorithms adjust long-distance network performance by analyzing real-time telemetry that includes distances between data centers, traffic patterns, congestion levels and performance metrics. The algorithms then adjust congestion control, routing, and load balancing.

Traditional Ethernet typically treats all connections the same, while XGS “auto attunes the algorithm based on the distance that you need to cover,” Shainer said.

Spectrum-XGS implementations are underway for data centers that are over hundreds of kilometers apart. It is in Spectrum-X switches, ConnectX-8 SuperNICs, and in systems with Blackwell GPUs.

“Those algorithms are different than the ones that run inside a data center,” Shainer said.

Customizing the standard

Ethernet is an industry standard, but vendors typically make their own adjustments in their Ethernet gear.

Spectrum-XGS is possibly Nvidia’s first custom Ethernet enhancement for long-distance GPU and AI communication, said Jim McGregor, principal analyst at Tirias Research.

“If you can estimate distance, it improves overall performance. It’s one thing doing it inside data centers, it’s a whole different thing estimating performance between data centers,” McGregor said.

GPUs will eventually spread out over longer distances because of power and cost constraints, McGregor said.

“This may work for modular data centers, like in shipping containers, which customers plop down and connect them with scale-across networks,” McGregor said.

The technology could help companies running multi-campus training clusters and constrained by available power in a deployment region, said Leonard Lee, executive analyst at Next Curve.

“It appears to be primarily for training at the moment… but there is little doubt that XGS will find opportunities in inference,” Lee said.

Shainer said vendor-based Ethernet customization in gear depends on the implementations. Virtualized data centers typically focus on small packets; hyperscale providers focus on throughput; and service providers target deeper buffers for longer distances.

Nvidia’s XGS adjustments include “fine-grain adaptive routing, packet by packet,” which eliminates the issues of dropped packets or deep buffers, in which packets are backed up to prevent packet loss, Shainer said.

Typically chunks of AI tasks are distributed across GPUs, which then coordinate to provide a unified output. Adaptive routing ensures the network and GPUs over long distances are in sync when running AI workloads, Shainer said.

Jitter bugs

“If I retransmit the packet, I create jitter, which means one GPU out of many will be delayed and all the others have to wait for that GPU to finish,” Shainer said.

The congestion control improvements remove bottlenecks by balancing transmissions across switches.

Nvidia tested XGS algorithms in its server hardware and measured a 1.9x improvement in GPU-to-GPU communication compared to off-the-shelf networking technology, executives said during a briefing on the technology.

Cloud providers already have long-distance high-speed networks. For example, Google’s large-scale Jupiter network uses optical switching for fast communications between its AI chips, which are called TPUs.

It is important to separate the physical infrastructure from the software algorithms like XGS, Shainer said.

The fiber networks that span the continent already exist to connect different systems, but the evolving software protocols that run on top of those networks determine actual performance, he said.

A change from InfiniBand

Ethernet has a fifty-year history, but isn’t a typical hunting ground for Nvidia, a promotor of InfiniBand networking technology, for long-distance GPU communications.

But the industry is increasingly moving toward Ethernet, which is an open standard, for reasons that include cost, Tirias Research’s McGregor said.

Buying XGS technology will likely lock customers into other Nvidia products, Next Curve’s Lee said.

“Nvidia wants to provide a full-stack for its hardware but mix and match optionality with products such as NVLink Fusion,” Lee said.

Networking is becoming an important market for Nvidia, generating $5 billion for the most recent earnings quarter ending on April 27, up 56% from a year-ago quarter.

But the competition is also growing with Ethernet players such as Arista, Cisco, Ciena, Broadcom and others adapting their campus and regional optical networking products, Lee said.

Source:: Network World