AWS upgrades its 10p10u network to handle massive AI clusters

GIXnews

3 weeks ago

While virtualized compute is the foundation of cloud computing, enabling all that compute to transmit data is the job of the network. So how does the world’s largest cloud scale its network to meet the increased demands of AI?

At the AWS re:Invent conference this week, the cloud giant detailed a series of enhancements to its network infrastructure aimed at handling those challenges. The networking infrastructure innovations are designed to support the demanding requirements of modern AI workloads, including new interconnect technologies, routing protocols and network cabling improvements.

These networking breakthroughs come at a crucial time, as AWS prepares to deploy Project Rainier, a massive AI training cluster containing hundreds of thousands of new Trainium2 chips that will power the next generation of Anthropic’s Claude AI models. The network bandwidth requirements of AI workloads are vastly larger than typical cloud workloads.

“A great AI network shares a lot in common with a great cloud network, although everything gets ratcheted up massively,” said Peter DeSantis, senior vice president of AWS Utility Computing, during his re:invent keynote. “If this were a Vegas fight, it wouldn’t even be a close fight.”

AWS 10-petabyte network explained

At the heart of AWS’s networking updates is its 10p10u network fabric.

“We call it 10p10u because it enables us to provide ten petabytes of network capacity to thousands of servers with under ten microseconds of latency,” DeSantis explained.

The demands on AI networks are particularly intense. DeSantis noted that during training, every server needs to talk to every other server at exactly the same time. The 10p10u network fabric is being specifically deployed in support of AWS’ UltraServer compute technology, which is being built out to run massive AI training workloads. Each Trainium2 UltraServer has almost 13TB of network bandwidth, requiring a massive network fabric to prevent bottlenecks.

“The 10p10u network is massively parallel, densely interconnected, and [the] 10p10u network is elastic,” DeSantis explained. “We can scale it down to just a few racks, or we can scale it up to clusters that span several physical data center campuses.”

How the 10p10u network increases optical networking density

Patch panels are a common sight in many data center networks, with a stream of cables connecting into a panel. With the complexity of the 10p10u network, AWS found that its existing patch panel approach wasn’t going to be enough. So it created something new. AWS developed a proprietary trunk connector that combines 16 separate fiber optic cables into a single connector.

“What makes this game changing is that all that complex assembly work happens at the factory, not on the data center floor, and this dramatically streamlines the installation process and virtually eliminates the risk of connection errors,” DeSantis said. “Now, while this might sound modest, its impact was significant. Using trunk connectors speeds up our install time on AI racks by 54%, not to mention making things look way neater.”

AWS also developed the Firefly optical plug, which further helps to improve the 10p10u network. The Firefly optical plug acts as a miniature signal reflector that allows AWS to test and verify network connections before the rack arrives on the data center floor. “That means we don’t waste any time [debugging cabling] when our servers arrive. And that matters, because in the world of AI clusters, time is literally money,” DeSantis said.

The Firefly optical plugs also act as a protective seal, which prevents dust particles from entering the optical connections. “This might sound minor, but even tiny dust particles can significantly degrade the integrity and create network performance problems,” DeSantis said.

Scalable Intent Driven Routing (SIDR) protocol manages the 10p10u network fabric

With the scale of the 10p10u network, routing is somewhat complex. To manage this complex network fabric, AWS developed Scalable Intent Driven Routing (SIDR), a new protocol that combines centralized planning with decentralized execution. The protocol enables the network to respond to failures in under one second, which is ten times faster than alternative approaches, according to AWS.

“An easy way to think about SIDR is you get the central planner doing the work to distill the network into a structure that can be pushed down to all the switches in the network, so that they can make quick, autonomous decisions when they see failure,” DeSantis explained.

NeuronLink: high-speed chip-to-chip communication

The 10p10u network is all about creating the high-throughput, low-latency fabric for connecting servers. With NeuronLink, AWS is going deeper, improving the interconnect between servers, specifically for its Trainium2 AI-based infrastructure.

NeuronLink is a proprietary interconnect technology that enables multiple Trainium2 servers to function as a single logical server. NeuronLink provides two terabytes per second of bandwidth between servers with just one microsecond of latency. The UltraServers combine 64 Trainium2 chips to provide what DeSantis said is five times more compute capacity than any current EC2 AI server and ten times more memory.

“Unlike a traditional high-speed networking protocol, NeuronLink servers can directly access each other’s memory, enabling us to create something special, something we call an UltraServer,” DeSantis said.

Source:: Network World