Meta taps Arista for Ethernet-based AI clusters

Arista Networks is partnering with Meta Platforms to deploy its Ethernet technology in AI clusters designed to handle large language models, high-bandwidth systems, and cloud communications.

Meta Platforms, formerly known as Facebook, supports many large data centers and handles massive amounts of data traffic over high-bandwidth network connections worldwide. Meta will deploy Arista’s 7700R4 Distributed Etherlink Switch in its Disaggregated Scheduled Fabric (DSF), which features a multi-tier network that supports around 100,000 DPUs, according to reports.

Arista said that the 7700R4 DES was developed with input from Meta. Based on its previous experience with the Arista 7800R3, Meta knew the benefits of the R-Series architecture for AI workloads but wanted a much larger scale solution that offered the same benefits and a smooth path to 800G, according to Martin Hull, vice president of cloud and platform product management at Arista, who wrote a blog post about the news. The Arista 7800R3 is a data center switch that supports up to 48 100GbE ports. The vendor’s R Series switches feature a range of high-density, low-latency networking components.

“The 7700R4 behaves like a single system, with dedicated deep buffers to ensure system-wide lossless transport across the entire Ethernet-based AI network,” Hull wrote. “DES is topology agnostic, [Ultra Ethernet Consortium (UEC)] ready, optimized for both training and inference workloads, with a 100% efficient architecture, and offers the rich telemetry and smart features that the modern AI Center needs.”

The UEC was founded last year by AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta and Microsoft, and it now includes more than 75 vendors. The consortium is developing technologies aimed at increasing the scale, stability, and reliability of Ethernet networks to satisfy AI’s high-performance networking requirements. UEC specifications will define a variety of scalable Ethernet improvements, including better multi-path and packet delivery options as well as modern congestion and telemetry features.

“Network performance and availability play an important role in extracting the best performance out of our AI training clusters. It’s for that reason that we’ve continued to push for disaggregation in the backend network fabrics for our AI clusters,” according to a Meta blog.

“Over the past year we have developed a Disaggregated Scheduled Fabric (DSF) for our next-generation AI clusters to help us develop open, vendor-agnostic systems with interchangeable building blocks from vendors across the industry. DSF-based fabrics allow us to build large, non-blocking fabrics to support high-bandwidth AI clusters,” Meta stated.

The DSF-based fabrics will also include Meta’s own fabric network switch, the MiniPack 3, as well as the Cisco 8501 Series, both of which are backward compatible with previous 200G and 400G switches and will support upgrades to 400G and 800G, Meta stated.

“The Minipack3 utilizes Broadcom’s latest Tomahawk5 ASIC, while the Cisco 8501 is based on Cisco’s Silicon One G200 ASIC. These high-performance switches transmit up to 51.2 Tbps with 64x OSFP ports, and the design is optimized without the need of retimers to achieve maximum power efficiency. They also have significantly reduced power per bit compared with predecessor models,” Meta stated.

Read more about Ethernet-based AI networking