Benchmarking NVIDIA Spectrum-X for AI Network Performance, Now Available from Supermicro

NVIDIA Spectrum-X is swiftly gaining traction as the leading networking platform tailored for AI in hyperscale cloud infrastructures. Spectrum-X networking…

NVIDIA Spectrum-X is swiftly gaining traction as the leading networking platform tailored for AI in hyperscale cloud infrastructures. Spectrum-X networking technologies help enterprise customers accelerate generative AI workloads. NVIDIA announced significant OEM adoption of the platform in a November 2023 press release, along with an update on the NVIDIA Israel-1 Supercomputer powered by Spectrum-X.

NVIDIA now announces that Supermicro has joined as an OEM partner for the Spectrum-X platform. Spectrum-X will be incorporated into Supermicro GPU SuperServers, available in 4U, 5U, and 8U form factors, and will support NVIDIA GPUs in the NVIDIA HGX H100, NVIDIA H100, and NVIDIA L40S PCIe form factors.

These Supermicro systems significantly reduce the training and inference durations of large transformer-based generative AI models by delivering exceptional network performance, ensuring multi-tenant performance isolation, and enhancing energy efficiency. These advancements are achieved while adhering to Ethernet networking standards and leveraging the NVIDIA Spectrum-4 Ethernet switch and NVIDIA BlueField-3 SuperNIC.

NVIDIA looks forward to working with Supermicro to bring enhanced value to our joint AI cloud and hyperscale infrastructure customers.

NVIDIA Spectrum-X performance benchmarks

With the ongoing development of the NVIDIA Israel-1 data center, we have performed a variety of benchmarks that highlight the performance benefits of Spectrum-X. The initial results are excellent, as detailed below.

Basic network health (RDMA)

This first benchmark showcases the basic network health of the system. AI workloads are built around utilizing GPUs, and require high-bandwidth, low-latency communication between the GPU (and its onboard memory) and the network adapter connecting the server to the fabric.

RDMA bisection is a key indicator that the network is ready for AI, and Spectrum-X excels in this category. Compared to traditional Ethernet, it delivers over 4x higher effective bandwidth and over 4x lower latency. Traditional Ethernet includes RDMA and optimizations such as congestion notification and flow control.

Figure 1. RDMA bisection cross-scalable units. NVIDIA Spectrum-X achieves 4.6x higher bandwidth and 4.5x lower latency than traditional Ethernet

AI collective performance

Beyond RDMA performance, NVIDIA also tested the performance of AI primitives from the NVIDIA Collective Communications Library (NCCL). AI workloads running across multiple systems leverage NCCL operations such as all-to-all and all-reduce to update model parameters across individual GPUs and ensure synchronization for scale-out training and inference.

With Spectrum-X, ‌NCCL operations showed significant gains over traditional Ethernet. They also demonstrated consistent and predictable performance in noisy AI cloud scenarios where multiple workloads were communicating across the network at the same time.

In fact, Spectrum-X demonstrated consistently high performance in both non-noisy and noisy scenarios. In comparison, traditional Ethernet showed performance variations of up to 20% from run to run.

Figure 2. AI cloud performance for NCCL all-to-all or all-reduce isolation. Spectrum-X provides noise isolation, ensuring almost identical performance to non-noisy scenarios

Large language model performance

While RDMA bisection and AI collective operations are important, the most important results are at the application level. Does Spectrum-X accelerate large language model (LLM) training workloads? In fact, it does. For both NVIDIA NeMo and FSDP Llama LLMs, Spectrum-X provides significant performance gains, shrinking step iteration times while providing faster time to train and faster time to insight.

Figure 3. AI cloud workload performance isolation. Spectrum-X accelerates iteration times for training the most common AI models

Network resiliency

Spectrum-X accelerates AI through network optimizations, but it’s also important to consider how resilient the network is. AI workloads are tightly coupled, and require high effective bandwidth to all nodes for optimal performance.

When network link or switch failures occur, AI training can suffer in a significant way. Network communication must be swiftly rerouted, or huge percentages of GPU infrastructure will sit idle, costing time and money and potentially requiring the job to restart from the previous checkpoint.

With Spectrum-X routing mechanisms, flows are diverted from downed links and efficiently allocated to healthy ones, resulting in minimal performance degradation. By contrast, traditional Ethernet is susceptible to significant and disproportionate slowdowns due to network issues, leading to inefficiency of GPU infrastructure.

Figure 4. Resilient adaptive routing performance. Spectrum-X uses re-routing to rebalance NCCL flows and avoid failed paths

Summary

As shown in these initial benchmarks, Spectrum-X represents a groundbreaking approach to constructing multi-tenant, hyperscale AI clouds using Ethernet. This solution enables organizations to enhance the performance and energy efficiency of AI clouds while achieving greater predictability and consistency. This, in turn, leads to accelerated TTM and a strengthened competitive advantage.

Learn more

Want to learn more? Join us in person or virtually for NVIDIA GTC 2024 to experience the suite of NVIDIA networking platforms firsthand. Connect with industry luminaries, developers, researchers, and business strategists helping shape what’s next in AI and accelerated computing. The AI conference will feature exciting announcements, demonstrations, and educational sessions about NVIDIA networking advancements.

Check out these recommended networking sessions:

Best Practices in Networking for AI: Perspectives from Cloud Service Providers – Panel [S62447]
Getting the Storage Right for AI Applications – Panel [S62476]
Entering A New Frontier of Innovation with InfiniBand [S62293]
Enabling Enterprise Generative AI with Optimized Ethernet AI Networking [S62521]
Accelerating HPC and AI Applications with Offloading to BlueField DPUs: Strategies and Benefits [S61956]
Connect with the Experts: Choosing the Right Network for the Era of AI: The Network Defines the Data Center [CWE61202]

Source:: NVIDIA