
In the modern AI era, networks are being stressed in new ways that are putting strain on multiple components of infrastructure.
Back in June, the Ultra Ethernet Consortium (UEC) published its 1.0 specification to address many of the networking challenges. What remains to be solved, though, are issues on how data storage is accessed and managed. That’s the challenge SNIA is looking to address with its Storage.AI open standards effort that is officially getting underway today.
Storage.AI is chartered to be an open standards project targeting AI-specific data infrastructure problems. Fifteen major vendors including AMD, Cisco, Dell, IBM, Intel, NetApp, Pure Storage, Samsung, Seagate and WEKA are joining the effort as founding members to develop vendor-neutral solutions for AI data services.
The initiative coordinates multiple technical specifications that exist separately but lack integration for AI workloads. The initial set of specifications includes:
- SDXI (Smart Data Accelerator Interface) for hardware-level memory movement
- GPU Direct Access protocols
- GPU-Initiated I/O standards
- File and Object over RDMA implementations
- Compute-near-storage frameworks
- NVM (non-volatile memory) programming models.
“If I’ve got to build entire nuclear power plants to be able to run one workload, then I really want that workload to be as efficient as I possibly can,” SNIA Chair J Metz, told Network World.
Ultra Ethernet is only one part of AI network improvement
In addition to his work at SNIA, Metz is also the chair of the steering committee at the Ultra Ethernet Consortium. Metz noted that while UEC is a core part of improving networks for AI workloads, it still relies on data storage networks which might not be as optimized.
“The problem is, if you’ve got a roadblock at the other end of the wire, then Ultra Ethernet isn’t efficient at all,” Metz explained. “When you start to piece together how the data moves through buffers, both in and out of a network, you start to realize that you are piling up problems if you don’t have an end-to-end solution.”
Storage.AI targets these post-network optimization points rather than competing with networking protocols. The initiative focuses on data-handling efficiency after packets reach their destinations, ensuring that advanced networking investments translate into measurable application performance improvements.
AI data typically resides on separate storage networks rather than the high-performance fabrics connecting GPU clusters. File and Object over RDMA specifications within Storage.AI would enable storage protocols to operate directly over Ultra Ethernet and similar fabrics, eliminating network traversal inefficiencies that force AI workloads across multiple network boundaries.
“Right now, the data is not on Ultra Ethernet, so we’re not using Ultra Ethernet at all to its maximum potential to be able to get the data inside of a processor,” Metz noted.
Why AI workloads break traditional storage models
AI applications challenge assumptions about data access patterns that network engineers take for granted.
Metz noted that machine learning pipelines consist of distinct phases, including ingestion, preprocessing, training, checkpointing, archiving and inference. Each of those phases requires different data structures, block sizes and access methods. Current architectures force AI data through multiple network detours.
Storage systems typically connect to separate networks from high-performance GPU clusters. Every data request travels from GPU through CPU, across management networks, through storage controllers, and back, multiplying latency and consuming bandwidth on each hop.
“Most people don’t realize that the data doesn’t actually exist inside of the networks that they think it does,” Metz explained. “They just sort of assume that it’s a server talking to a storage device and they don’t understand that it’s actually a lot of detours to get to it.”
CPU-GPU bottleneck creates architectural mismatch
The fundamental problem stems from GPU dependency on CPU-mediated I/O operations.
GPUs cannot initiate storage requests independently, they must request permission from CPUs for every data access, creating a massive computational bottleneck.
“If I’ve got a CPU with 200 cores feeding a GPU with 15,000 cores, that means that you’re off by several orders of magnitude,” Metz noted. “You want to have the GPU initiate the I/O specifically for the number of processes that it wants to be fed.”
Current systems force every I/O operation through CPU handshaking processes. Data travels from storage devices through CPU kernels, up through software abstraction layers, into CPU memory, then down to GPU memory locations. This packet-walking process repeats for every single data request.
Proprietary solutions exist to help solve the problem. Nvidia already has GPU Direct for Storage, AMD provides Infinity Storage, but Metz noted that there aren’t open standards to enable GPU-initiated I/O or direct storage access across vendor ecosystems.
Technical standards target multiple infrastructure layers
Storage.AI coordinates existing SNIA specifications rather than creating entirely new protocols. This approach accelerates deployment by building on proven technologies that already handle production workloads.
SDXI (Smart Data Accelerator Interface) enables applications to directly command hardware for memory-to-memory data movement, bypassing multiple software abstraction layers. For AI workloads, GPUs could use their native hardware to transform data into vector mathematics formats required for processing, eliminating separate preprocessing steps that consume additional CPU and network resources.
“If I’ve got a memory mover, and I can do the mutation of the data in the hardware, I can use the hardware of a GPU in theory to mutate the data into the vector mathematics that are necessary to do the native GPU processing by using the GPU native hardware itself,” Metz explained.
GPU Direct Access standards would enable storage devices to transfer data directly into GPU memory without CPU involvement. Combined with GPU-Initiated I/O protocols, GPUs could request and receive data independently, matching their massive parallel processing capabilities with appropriate I/O parallelism.
Compute-near-storage frameworks position processing capabilities closer to data repositories, reducing data movement requirements for preprocessing operations that currently consume significant network bandwidth.
Implementation strategy avoids previous standards pitfalls
While Storage.AI is still very new, the goal is to get it implemented quickly. Part of how Metz hopes that will be achieved is via a modular approach.
The Storage.AI effort learns from previous failed storage networking initiatives that required complete specification completion before any implementation could begin. The modular approach allows organizations to implement individual components as they become available.
“There’s a lot of really good meaty things that exist, but nobody knows how to put them together,” Metz observed. “It’s been basically a solution, looking for a problem and not realizing that the problem has been staring us in the face the whole time.”
Source:: Network World