Enfabrica looks to accelerate GPU communication

Networking startup Enfabrica is making the rounds at trade shows to demonstrate its new networking products, which are specifically targeted to handle the heavy data throughput required for AI.

Enfabrica’s Accelerated Compute Fabric SuperNIC (ACF-S) silicon is designed to deliver higher bandwidth, greater resiliency, lower latency and greater programmatic control to data center operators running data-intensive AI and HPC.

The company came out of stealth mode last year, announcing a $125 million funding round led by Atreides Management with support from Nvidia – which is also in the smartNIC business with its BlueField line – as well as several venture firms.

Shrijeet Mukherjee, who previously headed up networking platforms and architecture at Google, started the company in 2020 with CEO Rochan Sankar, previously a director of engineering at Broadcom. The two zeroed in on what they say is a problem with networking hardware: that it is built on 20-year-old designs that are just fine for CPUs but not adequate for GPU networking.

“If you look at what happened with data center networking, it sort of evolved into this kind of design where you had traffic that was coming in from one direction, and what you wanted is to be able to share it and distribute it to a whole bunch of nodes. But AI and ML systems break the mold a little bit,” said Mukherjee, chief development officer.

Enfabrica contends that in traditional data center environments, there’s a problem with server networking component sprawl and stovepipe connections that limit bandwidth and fault tolerance. In an AI environment, data movement across GPUs requires multiple hops, is prone to congestion, and results in unpredictable load distribution. Failure of a GPU link stalls the entire job.

“The design of today’s supercomputers is not very fault tolerant, and they have to really go through a lot of effort to handle failures correctly,” Mukherjee said.

Enfabrica brings fault tolerance to networking design. Rather than point-to-point, there are multiple paths from any point to any other point, so the load can be distributed. In the case of a failure, the system will redistribute the load to a lesser number of links.

“If you look at data centers today, it’s built around this model that a two-socket system is your working set. If things fit in that two-socket server, life is great. The moment it’s outside [those boundaries], it’s not that efficient,” said Mukherjee.

“We finally concluded that the architecture itself needs to change, and the way you solve that problem needs to be addressed,” Mukherjee said. “We said it has to be a silicon company. It has to be something that that builds around this idea of what the modern system needs to look like and enables that in a fast and complete way.”

ACF-S delivers multi-terabit switching and bridging between heterogeneous compute and memory resources in a single silicon die without changing physical interfaces, protocols or software layers above device drivers. It reduces the number of devices, I/O latency hops, and device power usage in today’s AI clusters consumed by top-of-rack network switches, RDMA-over-Ethernet NICs, Infiniband HCAs, PCIe/CXL switches, and CPU-attached DRAM.

CXL memory bridging allows it to deliver headless memory scaling to any accelerator, enabling a single GPU rack to have direct, low-latency, uncontended access to local CXL.mem DDR5 DRAM at more than 50 times greater memory capacity versus GPU-native High-Bandwidth Memory (HBM) used on GPUs.

Enfabrica displayed its technology at a number of recent tech conferences, including Hot Chips, AI Summit, AI Hardware & Edge AI Summit, and Gestalt IT AI Tech Field Day. Next up is SuperComputing 2024, being held Nov. 17-22 in Atlanta.

Enfabrica has not said when it will ship its products.

Source:: Network World