The Open Networking User Group (ONUG) set its sights on world of AI networking this week.
As AI usage grows, the core foundation that enables current and future potential is the power of the network. That’s a message that was echoed time and again during the ONUG AI Networking Summit, held in New York (and webcast).
Enabling the network for AI is both a technical and a process challenge. AI isn’t just another form of data traffic; it’s also a technology that can improve the operation of networks as well, which is another key theme that was explored at the event.
“We have entered the engineered miracle economy afforded by AI to improve the human condition,” Nick Lippis, cofounder and co-chairman of ONUG, said during the opening keynote. “We are infrastructure professionals, this is our golden age, we get to improve the human experience by building and enabling these miracles to happen.”
The challenges of AI on WAN connectivity
With the immense hardware and bandwidth requirements of AI, the challenges for AI connectivity across the WAN are numerous.
In a panel session on the shift from WAN to AI-enhanced solutions, Rajarshi Purkayastha, vice president of customer strategy and presales at Tata Communications, noted that GPUs and AI workloads have extremely high bandwidth requirements, often in the range of hundreds of terabits per second. Connecting these workloads across a traditional WAN is not feasible or cost-effective.
Purkayastha emphasized the need for new standards and reference architectures to support the integration of GPUs into a wide range of devices, from phones to IoT.
“We are going to have GPUs on at the end, devices, which is your phones, your laptops, your IoT devices, and so on and so forth, which means that the network of today is going to shift dramatically to accommodate what the GPUs are going to ask in the future,” he said.
AI will also help to accelerate the deployment and provisioning of WAN capabilities.
llwyn Sequeira, founder and CEO of Highway 9 Networks, shared how this company used AI and machine learning techniques to significantly reduce the implementation and configuration time for their private mobile cloud solution on a campus.
“We recently did a complete build out at MIT, one of the key buildings, eight stories, multiple radios interfacing with the macro towers,” Sequeira said. “After going through the initial round of configuration, we used AI/ ML techniques to then do the day one configuration, and that brought down the implementation and configuration from a matter of weeks to a matter of four or five hours.”
How AI is improving NOC automation
The topic of using AI to improve network automation overall was a core theme at the ONUG event. ONUG actually has an AI-Driven NOC/SOC Automation Project that detailed its progress during a panel session.
The group recently completed a study on AI automation for the NOC that has not yet been publicly released. During the session, a few top-level findings of the report were revealed. A key use case of generative AI in the NOC is using chatbot for helping users. When asked what the primary benefits of generative AI in the NOC, the top response was that it can help improve productivity and effectiveness of operations teams.
The survey findings were echoed by the panelists with some real-world examples. Parantap Lahiri, vice president of network and data center engineering at eBay said that his organization is using AI today for a networking monitoring system. That system is able to sort through and analyze a large volume of log and alert data, helping the humans to prioritize needs.
Xiaobo Long, head of backbone network services at Citi, noted that her organization is using AI chatbots to help address resource constraints on the network team.
“I believe that chatbots will eliminate a lot of hours, so that our team can focus on solving more complicated problems for the customers,” she said.
The impact of AI on network configuration
The future of networking is moving towards autonomous, self-driving capabilities, but the path to get there is not without its challenges. The ONUG session on network configuration featured discussions on the evolution of network automation and AI integration.
Automation has already been present in networking, but the key difference with AI is the shift from pure automation to augmentation.
Mark Berly, CTO, data center networking at Aruba, a Hewlett Packard Enterprise company, noted that things like zero touch provisioning have expanded the use of automation in recent years. He noted that for known, existing processes, automation is already well-established. Where AI goes beyond is that it can augment existing processes. The distinction is that earlier automation was more about codifying specific use cases and workflows.
In contrast, the panelists see AI-powered network automation as a shift towards more adaptive, autonomous capabilities that can handle unexpected situations, rather than just predefined tasks.
The path toward fully autonomous networks, where everything is automated, might be coming in the future, but it’s going to take time. Berly noted that he has a self-driving car and joked that it has tried to kill him at least once, which is why he now only uses the feature for parking the car.
“I do think we’re getting closer and closer to that autonomous state and honestly, it scares me. Like I said, my car almost killed me, I’m sure my network will try to crash itself,” Berly said.
Impact of adopting genAI on existing network capacity
As the adoption of generative AI (GenAI) continues to accelerate, the impact on existing network capacities and topologies is becoming a pressing concern.
In a panel session, Gerald de Grace, cloud architect and technical product manager at Microsoft, highlighted the immense scale of the challenge, noting that the company is looking at clusters with over 300,000 GPUs. The sheer number of components involved means that failures are inevitable, and de Grace emphasized the need for automated systems to quickly detect, isolate, and mitigate these issues.
“So we have to build in autonomous systems and AI and other things to identify the components that are failing, isolate and get rid of them from the network as quickly as possible,” he said.
The panel also looked at the impact of InfiniBand vs Ethernet.
De Grace acknowledged that currently, Microsoft is not opposed to using InfiniBand, as it is the established standard for the GPU stack. However, he noted that operationally, InfiniBand presents some challenges for the company. He explained that Microsoft is seeing some initial Ethernet-based solutions that support RoCEv2 (RDMA over Converged Ethernet), and he expects these to be proven in smaller data centers over the next year or two. After that, he believes Microsoft will likely transition away from InfiniBand towards more Ethernet-based networking for AI workloads.
The key drivers for this shift, according to Grace, are the operational simplicity and cost-effectiveness of Ethernet compared to InfiniBand. He noted that training all of their engineers on InfiniBand is “just another thing that we have to do,” and the company would prefer to focus on more Ethernet-centric solutions that can be more easily integrated into their existing infrastructure.
Whether it’s InfiniBand, GPU connectivity or other technical elements of AI networking at scale, Citi’s Long, emphasized that she wants there to standardization of the protocols and interface for AI networking.
“We keep doing standardization, always keeping the simplified technology across our different environments, so that’s always the best practice to follow, no matter if we want to support AI or not,” she said.
Source:: Network World