Site icon GIXtools

New MLPerf Inference Network Division Showcases NVIDIA InfiniBand and GPUDirect RDMA Capabilities

Image of Infiniband with decorative images in front.

In MLPerf Inference v3.0, NVIDIA made its first submissions to the newly introduced Network division, which is now part of the MLPerf Inference Datacenter…

In MLPerf Inference v3.0, NVIDIA made its first submissions to the newly introduced Network division, which is now part of the MLPerf Inference Datacenter suite. The Network division is designed to simulate a real data center setup and strives to include the effect of networking—including both hardware and software—in end-to-end inference performance.

In the Network division, there are two types of nodes: Frontend nodes generate the queries, which are sent over the network to be processed by the Accelerator nodes, which perform inference. These are connected through standard network fabrics such as Ethernet or InfiniBand.

Figure 1. Closed division vs. Network division nodes

Figure 1 shows that the Closed division runs entirely on a single node. In the Network division, queries are generated on the Frontend node and transferred to the Accelerator node for inferencing.

In the Network division, the Accelerator nodes incorporate the inference accelerators as well as all networking components. This includes the network interface controller (NIC), network switch, and network fabric. So, while the Network division seeks to measure the performance of the Accelerator node and network, it excludes the impact of the Frontend node as the latter plays a limited role in the benchmarking.

NVIDIA performance in the MLPerf Inference v3.0 Network division

In MLPerf Inference v3.0, NVIDIA made Network division submissions on both the ResNet-50 and BERT workloads. The NVIDIA submissions achieved 100% of single-node performance on ResNet-50 by using the extremely high network bandwidth and low latency of GPUDirect RDMA technology on NVIDIA ConnectX-6 InfiniBand smart adapter cards.

BenchmarkDGX A100 (8x A100 80GB)Performance of the Network division compared to the Closed divisionResNet-50 (Low Accuracy)Offline100%Server100%BERT (Low Accuracy)Offline94%Server96%BERT (High Accuracy)Offline90%Server96%Table 1. ResNet-50 and BERT performance achieved in the Network division compared to the Closed division, with minimum network bandwidth required to sustain the achieved performance

The NVIDIA platform also showcased great performance on the BERT workload, with only a slight performance impact relative to the corresponding Closed division submission observed due to host-side batching overhead.

Technologies used in the NVIDIA Network division submission

A host of full-stack technologies enabled the strong NVIDIA Network division performance:

Network division implementation details

Here’s a closer look at the implementation details of the MLPerf Inference Network division:

InfiniBand for high-throughput, low-latency communication

The Network division requires the submitter to implement a query dispatch library (QDL) that takes the queries from the load generator and dispatches them to the Accelerator nodes in a way suited to the submitter’s setup.

Architecture diagram: IBDevice captures SmartNIC discovered in the system. IBCfgs provides user configuration about the NIC device to use, and its parameters such as buffer size, queue size, and NUMA information. IBResources maintains local and remote buffers as well as memory regions registered for RDMA and the associated IBDevice. IBConnection establishes InfiniBand queue pairs and provides communication hooks.Figure 2. InfiniBand data exchange component inside the QDL

Figure 2 shows the data exchange component built in the QDL with InfiniBand technology.

Diagram shows the connection of the Frontend node and the Accelerator node through their respective network interface cards (NICs).Figure 3. Example connection established between the Frontend and Accelerator nodes

Figure 3 shows how connections are established between two nodes using this data exchange component.

InfiniBand queue pairs (QPs) are the basic connection point between the nodes. The NVIDIA implementation uses the lossless reliable connection (RC), which is similar to TCP, and transfer mode, while relying on an InfiniBand HDR optical fabric solution to sustain up to 200 Gbits/sec throughput.

When the benchmarking begins, the QDL initialization discovers the InfiniBand NICs available in the system. Following the configuration stored in IBCfgs, the NICs designated for use are populated as IBDevice instances. During this population, memory regions for RDMA transfers are allocated, pinned, and registered as RDMA buffers and kept in IBResources, together with proper handles.

RDMA buffers of the Accelerator node reside in GPU memory to leverage GPUDirect RDMA. The RDMA buffer information, as well as the proper protection keys, are then communicated with the peer node through a TCP socket over Ethernet. This creates the IBConnection instance for the QDL.

The QDL implementation is NUMA-aware and maps the closest NUMA host memory, CPUs, and GPUs to each NIC. Each NIC uses the IBConnection to talk to a peer’s NIC.

Network division inference flow

Diagram shows the key components of the Frontend node and Accelerator nodes, as well as how an inference request is transmitted from the Frontend node to the Accelerator node.Figure 4. Inference request flow from Frontend node to Accelerator node using DirectGPU RDMA

Figure 4 shows how the inference request is sent from the Frontend node and processed at the Accelerator node:

  • LoadGen generates a query (inference request), which contains input tensors.
  • The QDL redirects this query to the appropriate IBConnection based on the arbitration scheme.
  • The query sample library (QSL) may be already registered within the RDMA buffer. If not, the QDL stages (copies) the query to the RDMA buffer.
  • QDL initiates the RDMA transfer with the associated QP.
  • The InfiniBand network transfer happens through the network switch.
  • The query arrives at the peer’s QP.
  • The query is then transferred to the destination RDMA buffer through Direct Memory Access.
  • RDMA completion is acknowledged in the Accelerator node QDL.
  • The QDL enables the Accelerator node to batch this query. The QDL tags the batch of queries to be issued to one of the Accelerator node’s accelerators.
  • The Accelerator node’s accelerator performs inference using CUDA and TensorRT and produces a response in the RDMA buffer.
  • When the inference is ultimately performed as in step 10, the output tensors are generated and populated in the RDMA buffer. Then the Accelerator node starts transferring the response tensors to the Frontend node in a similar fashion but in the opposite direction.

    Performance optimizations

    The NVIDIA implementation uses InfiniBand RDMA Write and takes advantage of the shortest latency. For the RDMA Write to happen successfully, the sender must explicitly manage target memory buffers.

    Both Frontend and Accelerator nodes manage buffer trackers to make sure that each query and response is kept in memory until consumed. As an example, ResNet-50 requires that up to 8K transactions be managed per connection (QP) to sustain the performance.

    Some of the following key optimizations are included in the NVIDIA implementation.

    The following key optimizations support better scalability:

    The following key optimizations improve InfiniBand resource efficiency:

    The following key optimizations improve memory system efficiency:

    Each vendor’s QP implementation details the maximum number of completion queue entries (CQEs) supported, as well as the supported maximum QP entry sizes. It’s important to scale the number of QPs per NIC to cover the latency while sustaining enough transactions on-the-fly to achieve maximum throughput.

    Host CPUs can also be stressed significantly if an extremely large number of transactions are handled in a short time from CQ by polling. Event-based CQ management, together with a reduction in the number of notifications, helps greatly in this case. Maximize the memory access efficiency by aggregating data in contiguous space as much as possible and, if possible, in the RDMA registered memory space. This is crucial to achieving maximum performance.

    Summary

    The NVIDIA platform delivered exceptional performance in its inaugural Network division submission on top of our continued performance leadership in the MLPerf Inference: Datacenter Closed division. These results were achieved using many NVIDIA platform capabilities:

    The results provide a further demonstration of the performance and versatility of the NVIDIA AI platform in real data center deployments on industry-standard, peer-reviewed benchmarks.

    Source:: NVIDIA

    Exit mobile version