Independent IT lab The Tolly Group compared the cloud, AI, and storage performance of an NVIDIA Ethernet switch to the performance of a comparable switch built with commodity silicon.
Does the switch matter?
The network fabric is key to the performance of modern data centers. There are many requirements for data center switches, but the most basic is to provide equal amounts of bandwidth to all clients so that resources are shared evenly. Without fair networking, all workloads experience unpredictable performance due to throughput deterioration, delay, slow distributed workloads, and so on.
To answer the question of whether the switch matters, the Tolly Group benchmarked the cloud, AI, and storage workload performance of the NVIDIA Spectrum-3 12.8Tbps Switch. It compared the results to the performance of a typical (commodity) 12.8 Tbps data center switch, in an apples-to-apples comparison.
The Tolly Group
The Tolly Group, a third-party, independent IT industry lab, has been conducting performance tests and hands-on evaluations of IT products for more than 30 years. The Tolly Group is positioned to provide evidence that products meet or exceed marketing claims and they won’t produce reports that conflict with The Tolly Group’s Fair Testing Charter. This proof-of-performance lets customers know they can deploy with confidence.
Distributed workload performance (AI and SPARK)
Every switch has a buffer to prevent packet loss. The buffer also protects application performance by absorbing packet bursts whenever more traffic is sent into the switch than can be sent out of the switch. This is sometimes referred to as incast traffic patterns. Distributed workloads like AI and Spark, by their nature, are plagued by incast traffic patterns.
Both switches claimed identical buffer sizes on their datasheets. However, The Tolly Group found that NVIDIA Spectrum-3 was able to absorb 4-8x more packets than the typical data center switch. Eight commodity switches would be needed to provide the packet absorption capabilities equal to a Spectrum-3 switch.
Maximum absorption capability is important but not enough. It is crucial that the switch evenly absorbs the microburst from all senders, because slowing down one node slows down the entire cluster.
The Tolly Group found that Spectrum-3 evenly absorbed microburst traffic from all senders in all cases while the commodity switch slowed down multiple nodes, resulting in under-utilized compute resources.
Public and private cloud performance
The noisy neighbor problem crops up in public and private cloud environments, where multiple tenants use a shared resource, like CPU cycles or network bandwidth, and a “noisy neighbor” tenant shows up and hogs those resources.
The result of the noisy neighbor problem could be that one tenant can degrade the experience of another tenant due to the inadequate capability of the switch to isolate between them. A data center switch must protect tenants from the activities of other tenants, both from nefarious attacks as well as noisy neighbors.
The Tolly Group found that the Spectrum-3 switch fully protected each tenant. The competitor switch failed to protect tenants, allowing the bandwidth of some tenants to be victimized and get fully starved by the traffic pattern of the noisy neighbor.
When scaling out multitenant environments, Spectrum-3 protected each tenant. However, the noisy neighbor problem-scale was way bigger with the commodity switch and can be expanded to half the total number of switch ports. In other words, up to 70 ports can be victimized and starved.
If a switch is not capable of protecting tenants from a noisy neighbor, that switch is not matching a basic requirement from a cloud fabric switch.
Figure 2. Noisy neighbor isolation
(ALT: With Spectrum-3, there is no effect from noisy neighbor traffic patterns. With the commodity switch, victim tenants are starved for bandwidth.)
Today, most storage traffic in the data center runs on Ethernet. More specifically, storage typically uses 9-KB jumbo frames. As a result, this packet size has become more important than ever, and most every switch now supports a default packet size of 9 KB.
However, just because a typical data center switch supports 9-KB packets, doesn’t mean they are optimized for storage workloads. To measure and compare the storage performance levels of each switch, The Tolly Group used 9 KB packets with standard network test tools from IXIA.
The Tolly Group found that Spectrum-3 provided predictable and fair performance across all storage nodes in all cases. The commodity switch showed unfair traffic sharing with 9-KB packets, forcing one storage node to run 17x slower than the other storage nodes. These unpredictable results harshly affect storage performance.
This has real-world implications. Think about the time it takes to run a storage backup. What if your planned and expected 2-hour backup time starts taking 34 hours to complete?
Mixed application performance
Most data centers run many different applications, each with their own packet sizes. Even a single application uses a variety of packet sizes. Adding in control traffic patterns, you will probably end up encountering an even greater variety of packet sizes on your fabric.
The Tolly Group found that Spectrum-3 provided fairness regardless of the packet size, while the commodity switch tended to starve applications that used smaller packet sizes. Even worse, as the difference in packet sizes increased, the worse it got for the smaller packets.
For the commodity switch, mixed packet size starvation adversely affects the cloud, storage, and distributed workloads.
Architecture. Simple as that.
The Spectrum switches have a modern fully shared buffer architecture and flexible pipeline architecture that were designed to optimize data center application performance and security. For more information about the results, download the new Tolly Group Performance Evaluation report. It explains the architecture of Spectrum switches and commodity switches, along with their advantages and disadvantages.
Architecture is indeed a zero-sum game. However, unlike many other vendors, NVIDIA develops both the ASIC and the switch. As a result, we have managed to eliminate tradeoffs and provide the superior results that The Tolly Group has verified.
The switch matters and it can make a huge difference, either leveraging your workloads or adversely affecting them. For more information, join the Tolly Report webinar, download the Tolly Group Performance Evaluation report, or see The Tolly Group website.