DriveNets extends AI networking fabric with multi-site capabilities for distributed GPU clusters

AI is becoming a critical driver for high-performance networks, but it comes with more than its fair share of challenges as connecting GPUs together isn’t quite the same as connecting CPUs.

Among the many networking vendors that are trying to help solve the challenges of AI networking is DriveNets. The company, which got its start back in 2015, has steadily gained traction over the last decade for its networking fabric and now has big providers including AT&T, Comcast, Telefonica and Orange among its customers.

In addition to service provider routing technology, DriveNets has also built out an Ethernet-based AI networking fabric it calls Network Cloud-AI. That fabric is targeted at hyperscalers, neoclouds, and enterprises building GPU clusters.

This week, DriveNets announced significant enhancements to its Network Cloud-AI solution, introducing multi-tenancy and multi-site capabilities that enable GPU clusters to span geographic locations up to 80 kilometers apart—addressing power constraints that increasingly limit AI deployments.

“Because of the electricity concerns and ability to support the requirements of large data centers, we are talking to quite a few customers about building one cluster that is distributed across multiple locations,” explained Inbar Lasser-Raab, chief marketing officer at DriveNets. “We have a distributed architecture for switches over white boxes that allows us to balance between deep buffer and shallow buffers and be able to really support the high performance of the overall cluster, even if it’s distributed on multiple locations.”

Cell-based fabric architecture drives performance

At the core of DriveNets’ approach is a networking architecture fundamentally different from traditional data center networks. Rather than using standard Clos Ethernet architectures, DriveNets employs a distributed fabric with a cell-based protocol.

“We use the same physical architecture as anyone with top of rack and then leaf and spine switch,” Dudy Cohen, vice president of product marketing at DriveNets, told Network World. “But what happens between our top of rack, which is the switch that connects NICs (network interface cards) into the servers and the rest of the network is not based on Clos Ethernet architecture, rather on a very specific cell-based protocol. [It’s] the same protocol, by the way, that is used in the backplane of the chassis.”

Cohen explained that any data packet that comes into an ingress switch from the NIC is cut into evenly sized cells, sprayed across the entire fabric and then reassembled on the other side. This approach distinguishes DriveNets from other solutions that might require specialized components such as Nvidia BlueField DPUs (data processing units) at the endpoints.

“The fabric links between the top of rack and the spine are perfectly load balanced,” he said. “We do not use any hashing mechanism… and this is why we can contain all the congestion avoidance within the fabric and do not need any external assistance.”

Multi-site implementation for distributed GPU clusters

The multi-site capability allows organizations to overcome power constraints in a single data center by spreading GPU clusters across locations.

This isn’t designed as a backup or failover mechanism. Lasser-Raab emphasized that it’s a single cluster in two locations that are up to 80 kilometers apart, which allows for connection to different power grids.

The physical implementation typically uses high-bandwidth connections between sites. Cohen explained that there is either dark fiber or some DWDM (Dense Wavelength Division Multiplexing) fibre optic connectivity between the sites. Typically the connections are bundles of four 800 gigabit ethernet, acting as a single 3.2 terabit per second connection.

Enhanced multi-tenancy for AI workloads

For providers offering GPU-as-a-service or enterprises running multiple AI workloads using Kubernetes, DriveNets has enhanced its traffic isolation capabilities. Kubernetes is increasingly being used by organizations of all sizes as the cloud-native technology for running AI workloads.

“If you use Kubernetes, you usually use multiple workloads on the same cluster, and perhaps even multiple tenants on the same cluster,” Cohen said. “It is very important in such environments to maintain the quality of service per workload or per tenant, even if you have a noisy neighbor.”

He explained that the cell-based fabric provides critical isolation. As such no noisy neighbor can affect any other workload or tenant that resides in a different Kubernetes container, for example, on the same infrastructure.

AI-powered operations and products

While DriveNets builds networks for AI workloads, the company is also implementing AI capabilities within its own products and operations. “We usually just joke that we do both networking for AI and AI for networking in different aspects of our product,” Cohen said.

On the product side, DriveNets is integrating AI for network management. The company is implementing AI capabilities into its management and orchestration system. DriveNets trained an AI model over a massive volume of network log to learn about dependencies and issues. The trained model now powers an assisted root cause analysis capability.

Internally, the company has heavily invested in AI tools. “We’ve invested a lot of money and effort in bringing in AI tools for developing in all departments,” Lasser-Raab said.

Source:: Network World