Everyone but Nvidia joins forces for new AI interconnect

GIXnews

7 months ago

A clear sign of Nvidia’s dominance is when Intel and AMD link arms to deliver a competing product. That’s what happened this week when AMD and Intel – along with Broadcom, Cisco, Google, Hewlett Packard Enterprise, Meta and Microsoft – formed the Ultra Accelerator Link (UALink) Promoter Group to develop high-speed interconnections between AI processors.

Nvidia has NVLink, which allows its processors to talk to each other and share data at an extremely high rate of speed. With the UALink high-speed accelerator interconnect technology, Nvidia’s competitors are teaming up to make the same thing for their own chips. Microsoft, Meta, and Google may seem unlikely partners, but they are all making custom processors for their cloud services.

The first step for UALink is to define and establish an open industry standard that will enable AI accelerators to communicate more effectively, the group says. By creating an interconnect based on open standards, UALink supporters believe this will enable system OEMs, IT professionals and system integrators to create a pathway for easier integration, greater flexibility and scalability of their AI-connected data centers.

AI and high-performance computing (HPC) require a considerable amount of data to be moved around between cores and memory. Nvidia tapped high-speed networking technology it gained in its $6.9 billion acquisition of Mellanox in 2019 to build its NVLink high-speed interconnects.

“When we look at the needs of AI systems across data centers, one of the things that’s very, very clear is the AI models continue to grow massively,” said Forrest Norrod, executive vice president and general manager of the data center solutions group at AMD, during a conference call with the media.

“…this means that for the most advanced models, many accelerators need to work together in concert for either inference or training. And being able to scale those accelerators is going to be critically important for driving the efficiency and the performance and the economics of large-scale systems going out into the future.”

The UALink group plans to develop a specification to define a high-speed, low-latency interconnect for scale-up communications between accelerators and switches in AI computing pods. The 1.0 specification will enable the connection of up to 1,024 accelerators within an AI computing pod and allow for direct loads and stores between the memory attached to accelerators, such as GPUs, in the pod, according to the group.

Norrod pointed out that the UALink members are also backers of the Ultra Ethernet Consortium, which was formed to develop technologies aimed at increasing the scale, stability, and reliability of Ethernet networks to satisfy AI’s high-performance networking requirements. The UEC was founded last year by AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta and Microsoft, and it now includes more than 50 vendors. Later this year, it plans to release official specifications that will focus on a variety of scalable Ethernet improvements, including better multi-path and packet delivery options as well as modern congestion and telemetry features.

“And so by coming together, we believe that this promoters group is filling in an important element of future … scaled out AI systems architectures with this pod-level interconnect. And in concert with Ultra Ethernet, [it] will enable systems of hundreds of thousands or millions of accelerators to efficiently work together,” Norrod said.

J Metz, chair of the Ultra Ethernet Consortium, touted opportunities for collaboration among UALink and UEC backers in a statement announcing the new group’s formation: “In a very short period of time, the technology industry has embraced challenges that AI and HPC have uncovered. Interconnecting accelerators like GPUs requires a holistic perspective when seeking to improve efficiencies and performance. At UEC, we believe that UALink’s scale-up approach to solving pod cluster issues complements our own scale-out protocol, and we are looking forward to collaborating together on creating an open, ecosystem-friendly, industry-wide solution that addresses both kinds of needs in the future.”

The UALink Promoter Group expects the 1.0 specification is expected to be available in the third quarter of this year and made available to companies that join the Ultra Accelerator Link (UALink) Consortium. Products could appear next year, with implementation potentially around 2026.

Source:: Network World