Site icon GIXtools

Delivering Efficient, High-Performance AI Clouds with NVIDIA DOCA 2.5

Image shows the range of applications available for delivery on an NVIDIA BlueField networking platform with the NVIDIA DOCA SDK and acceleration framework.

As a comprehensive software framework for data center infrastructure developers, NVIDIA DOCA has been adopted by leading AI, cloud, enterprise, and ISV…Image shows the range of applications available for delivery on an NVIDIA BlueField networking platform with the NVIDIA DOCA SDK and acceleration framework.

As a comprehensive software framework for data center infrastructure developers, NVIDIA DOCA has been adopted by leading AI, cloud, enterprise, and ISV innovators. The release of DOCA 2.5 marks its third anniversary. And, due to the stability and robustness of the code base combined with several networking and platform upgrades, DOCA 2.5 is the first NVIDIA BlueField-3 long-term support (LTS) version for AI cloud deployments.

Alongside NVIDIA switches, BlueField DPUs, and SuperNICs, DOCA 2.5 is an essential element of a co-designed platform created to support the most demanding AI workloads. Forming part of the NVIDIA full stack architecture, networking components from NVIDIA deliver optimum application performance and security and data center efficiency. When deployed alongside the NVIDIA computing platform and software tools, they offer additional benefits and synergies.

Here are some of the newest network offerings from NVIDIA and how DOCA 2.5 is an integral part of AI infrastructure.

Backbone of AI infrastructure

It’s now widely understood that a high-performance network is the backbone of efficient AI infrastructure. To achieve optimal AI performance, significant consideration must be given to its capabilities, implementation, and deployment for both generative AI and foundational models.

Due to their distinct properties and significant computational demands, modern AI workloads require specialized network infrastructure to operate at peak efficiency. Leading the way in AI and accelerated computing, we created the NVIDIA Spectrum-X Ethernet networking platform to meet this requirement and improve the effectiveness and performance of AI clouds.

The Spectrum-4 Ethernet switch and BlueField-3 SuperNIC from NVIDIA form the basis of the Spectrum-X platform and the foundation of our accelerated computing fabric for artificial intelligence. The BlueField-3 SuperNIC offers numerous technology benefits for a wide range of industries. When deployed in our flagship AI systems, BlueField-3 SuperNICs not only enhance performance but also provide deterministic and isolated performance for tenant jobs.

Pictures of an NVIDIA Spectrum-4 Switch and NVIDIA BlueField-3 DPU.Figure 1. NVIDIA Spectrum-X and BlueField-3 hardware

NVIDIA synergy

The Spectrum-X platform combines co-designed, best-in-class hardware to deliver unparalleled performance synergies and an unmatched customer experience. Integral to the design, BlueField-3 SuperNICs take Ethernet networking to new heights for AI systems running on a cluster of GPU-based servers. 

In contrast, conventional network interface cards lack the required features for AI workloads. BlueField SuperNICs ensure that the required processes for effectively executed cloud-based AI workloads are delivered with efficiency and speed.

When combined with an NVIDIA GPU, this marriage of technologies (available for most enterprise-class servers) creates an optimized solution for AI cloud computing, delivering matchless levels of efficiency, performance, and flexibility.

Validated across the full stack of NVIDIA hardware and software, Spectrum-X and NVIDIA GPUs create a truly peerless Ethernet solution for AI clouds. With such broad levels of integration available, the opportunity for fine-tuning provides custom-like levels of modification for truly unique solutions, dedicated to the delivery of precision workloads.

As a component of the full stack, DOCA is a critical piece of the AI puzzle and ties together compute, networking, storage, and security.

Diagram includes SONic, Cumulus, NetQ, DOCA Services, NVIDIA Air, SAI/SPSDK, DOCA, and Magnum IO.Figure 2. NVIDIA hardware and software stack

New features for AI clouds and data center infrastructure

DOCA helps to enable the most advanced, GPU-accelerated, AI workloads today. For systems that include a GPU and NVIDIA BlueField-3 DPUs or BlueField-3 SuperNICs, there are further advantages for developers.

 
BlueField-3 DPU
BlueField-3 SuperNIC
 

Mission
> Cloud infrastructure processor
> Offload, accelerate, and isolate data center infrastructure
> Optimized for N-S in GPU-class systems
> Accelerated networking for AI computing
> Best-in-class RoCE networking
> Optimized for E-W in GPU-class systems
 

Shared Capabilities
> VPC network acceleration
> Network encryption acceleration
> Programmable network pipeline
> Precision timing
> Platform security

 

Unique Capabilities
> Powerful computing
> Secure, zero-trust management 
> Data storage acceleration
> Elastic infrastructure provisioning
> 1-2 DPUs per system
> Powerful networking
> AI networking feature set
> Full-stack NVIDIA AI optimization
> Power-efficient, low-profile design
> Up to 8 SuperNICs per system
 

Table 1. NVIDIA BlueField-3 DPU and SuperNIC comparison

Specifically, DOCA capitalizes on the numerous NVIDIA-led development, integration, and testing programs that enable and optimize the entire range of AI application frameworks. The convergence of NVIDIA technologies fuels data center innovation and rapid AI application deployment.

Released in December 2023, DOCA 2.5 offers several enhancements that boost performance within the data center. There’s a continuing increase in both the number of virtual functions and the volume of ‘east-west’ network traffic. In response, the use of DOCA and BlueField-3 SuperNICs is imperative to optimize the network and establish its function as the backbone of a modern AI infrastructure.

Diagram shows an application layer (including networking, security, and storage), DOCA services (including Orchestration, Telemetry, and Firefly), libraries (including Crypto, App Shield, and Rivermax), and drivers (including UCX, UCC, and RDMA).Figure 4. DOCA 2.5 architecture

DOCA-PCC now available

Within multi-tenant AI cloud environments where multiple AI jobs run simultaneously, there is a potential for network congestion to arise. 

The DOCA PCC library, now GA, provides a high-level programming interface that enables partners to implement customized congestion control (CC) algorithms. This library uses the NVIDIA BlueField-3 SuperNIC acceleration for CC management and provides an API that abstracts hardware complexity to simplify programming. Partners can focus on the functionality of your CC algorithm and implement it quickly with BlueField hardware acceleration.

DOCA PCC also gives you the flexibility to develop an optimal solution to handle congestion in your clusters. Customized congestion control is critical for AI workflows, enabling performance isolation, improving fairness, and preventing packet drop on lossy networks. 

NVIDIA Spectrum-X is a breakthrough Ethernet networking solution for building multi-tenant, hyperscale AI clouds. It uses DOCA PCC to implement congestion control.

DOCA Flow: New and enhanced features for cloud deployments

DOCA Flow is an essential programming tool used to develop DOCA services. DOCA 2.5 adds additional support for the development of NVIDIA OVS-DOCA, an innovative and performant virtual switch that is native to NVIDIA NICs and DPUs, and NVIDIA DOCA HBN services.

With NVIDIA DOCA Flow, you can define and control the flow of network traffic, implement network policies, and manage network resources programmatically. It offers network virtualization, telemetry, load balancing, security enforcement, and traffic monitoring. 

These capabilities are beneficial for processing high-packet workloads with low latency, conserving CPU resources, and reducing power usage. Fundamentally, DOCA Flow is a key enabler for multiple use cases in cloud networking. Used for the development of custom software-defined networking (SDN), this is a key building block for CSPs designing the networks of tomorrow.

DOCA services

The following are some examples of DOCA services that have been upgraded in the DOCA 2.5 release:

Host-based networking

Upgraded in DOCA 2.5, host-based networking (HBN) is a DOCA service that enables network architects to design networks based purely on L3 protocols, enabling routing to run on the servers of the network. In the case of BlueField, the HBN solution packages a set of network functions inside a container that is packaged as a service pod running on the DPU.

DOCA HBN gives network architects the ability to create controller-less virtual private clouds (VPCs). This is ideal for CSPs, telcos, and enterprise customers deploying bare-metal as a service (BMaaS) infrastructures.

Compared to conventional networking solutions, using DOCA HBN presents you with a number of benefits. In addition to improving the scalability and efficiency of deployment, DOCA HBN offers enhanced security options, a simplified underlay network fabric, and reduced OPEX. If used in conjunction with a third-party switch manufacturer, DOCA HBN shifts several top-of-rack (ToR) switch functions to the BlueField-3 DPU or SuperNIC, leading to a reduction in third-party license costs.

For more information about the new HBN functions, including support of RoCE, Routing, and ACL enhancements, see the DOCA 2.5 release notes.

DOCA Firefly

This feature provides Precision Time Protocol (PTP)–based time synchronization services that use the hardware acceleration of NVIDIA DPUs and SuperNICs.

Industry-specific PTP use cases include the following:

New to DOCA 2.5, DOCA Firefly now includes industry-specific profiles to improve the user experience and simplify deployment. Profiles currently include Media and Telco, which are configured to include industry-specific functions and performance parameters.

Storage SNAPv4

The DOCA SNAPv4 service on BlueField-3 adds inline AES-XTS, the default cryptographic algorithm for protecting the confidentiality of data-at-rest on storage devices. SNAP now accelerates AES-XTS encryption in hardware, which optimizes and improves the encryption process while benefiting from a reduced CPU overhead.

The SNAPv4 service for virtio-blk now offers Recovery/Hot-Upgrade/LM without force-in-order. This new feature improves support for Recovery, Hot-Upgrade, and Live-Migration functions and means that it’s no longer necessary to operate using force-in-order traffic. This equates to a more practical tool for customers in real-world settings whereby typical customers, such as CSPs, can now offer improved uptime and uninterrupted performance for end users undertaking vital storage tasks.

More updates

For more information about the following list of updates and features, see the DOCA 2.5 release notes:

Conclusion

Modern AI workloads require sophisticated network solutions to operate effectively at peak efficiency. Today, organizations across the globe are facing a similar significant challenge when trying to embed AI into their existing operational and technical infrastructure.

To meet this requirement, NVIDIA, as the leader in AI and accelerated computing, has created an optimized networking platform to drive the performance of AI cloud computing. Central to the effectiveness of this platform are the synergies gained from complementary technologies employed by the various NVIDIA-branded hardware and software solutions. 

In their full-stack architecture, NVIDIA implemented several design considerations to ensure increased operational effectiveness between the various platforms. When combined with NVIDIA GPUs, Spectrum-X, a solution comprised of NVIDIA Ethernet Switches and BlueField SuperNICs, creates a truly peerless Ethernet platform for AI clouds. With the latest release of NVIDIA DOCA SDK, NVIDIA has made additional strides to further enable the most advanced, GPU-accelerated, AI workloads today.

To begin your development journey with all the benefits DOCA has to offer, download NVIDIA DOCA today. For more information, see the following resources:

Source:: NVIDIA

Exit mobile version