
Nvidia on Wednesday released a set of MLPerf Inference V5.0 benchmark results for its Blackwell GPU, the successor to Hopper, saying that its GB200 NVL72 system, a rack-scale offering designed for AI reasoning, set a series of performance records.
The benchmarks included the latest updates to MLPerf Inference, including the addition of Llama 3.1 405B, which Dave Salvator, director of accelerated computing products at Nvidia, called “one of the largest and most challenging-to-run open-weight models”, and the new Llama 2 70B Interactive benchmark, which features much stricter latency requirements compared with the original Llama 2 70B benchmark, more closely modeling how chatbots work.
According to Salvator, the system, which connected 72 Blackwell GPUs to act as a single GPU, delivered up to 30 times higher throughput on the Llama 3.1 405B workload compared to the company’s H200 NVL8, which is based on the Hopper architecture.
That isn’t to say that Hopper is lagging. In the latest round of benchmarks, the three year old architecture still managed to achieve a 60% performance boost over last year on the Llama 2 70B workload. “What it shows,” he said, “is that the Hopper architecture, despite being in market for some time now, still has some headroom in terms of performance.”
Salvator noted that the company has almost tripled its real-time large language model (LLM) inference throughput on the Llama 2 70B Interactive workload, delivering almost 60,000 tokens per second (TPS) on a per server basis on the Blackwell platform.
In addition, he said, Blackwell achieved a nearly 5x performance gain on DeepSeek R1 inference in just one month, with more than 250 TPS per user minimum latency and more than 30,000 TPS per server maximum throughput on the DGX B200.
Fifteen partners, including Cisco, Fujitsu, Hewlett Packard Enterprise, Dell Technologies, Oracle, and Google Cloud, were involved in the latest round of MLPerf testing, which, he noted, was the largest number of Nvidia partners submitting to the benchmark in any given round.
When asked about his overall impression on the latest results, Jim McGregor, principal analyst at Tirias Research, said, “the first thing is that MLPerf continues to adjust to the demands of the market and now reaches to the level of genAI. It still doesn’t reach agentic AI, but it will get there in future releases.”
The second, he said, is “everyone benefits from continued optimizations. We see this on both CPUs and GPUs, especially with the move to FP4 by Nvidia. Note that they are the only ones supporting FP4 at this time. And third, Nvidia keeps pushing the limits with each generation while improving the performance, through continued optimizations, of older generations.”
“What DeepSeek did in terms of changing the game was show the world what was possible,” he said. “This is another big innovative step forward. Some, frankly incorrectly, surmised that because of some of their innovations that you wouldn’t need nearly as much infrastructure to deliver AI, but that is not really a correct understanding of what DeepSeek did. Yes, they delivered algorithmic advances which can reduce the amount of time required to do very complex operations. But you know, if you’ve seen the trend line in AI over the years, AI does not take a step back.”
AI moves forward, Salvator agreed, “and what does that enable? It enables us to do more on the infrastructure we have today, and then even more on the infrastructure we’ll have tomorrow. That has been the typical sort of trend line you see when algorithmic innovations come, they enable the next round of activity, the next round of sophistication to come to AI. In this case, it’s agentic AI.”
Although it wasn’t ready for the current round of benchmarking, Salvator pointed out that the company’s Dynamo open source inferencing software, introduced at GTC 2025, will up the ante even further.
McGregor described Dynamo as significant: “Think of an OS for an entire data center. This gets to [Nvidia CEO Jensen Huang’s] view that the new unit of compute is the data center, or to put it another way, you have to think of the entire data center as a single server.”
Everything being in done in AI hardware and software today, he said, “is about improving performance efficiency, and one of the key areas to do that is to make the entire solution, or in this case, the data center, work more efficiently. It even leverages a software cache structure, KV cache, to improve performance efficiency. So, this is huge.”
As for the touted “up to 40X” AI factory productivity hike Nvidia cited it achieved with Blackwell, he said that this was accomplished using FP4 optimization, and the latest Blackwell solutions. “But remember that even a 2X increase in performance is significant,” he pointed out. “So, if you get 10X or more, it’s a huge jump in performance. With the push toward more cost-efficient processing, AI providers are going to be looking to squeeze as much performance efficiency out of these data centers as possible. So, even that 40x number may be low as other improvements in models, optimizing, and processing are introduced.”
As for the tricky issue of how best to balance the need for more computing power and the constrained supply of the electricity needed to run it, McGregor described it an “ongoing problem. We have to make everything operate more efficiently to maximize the power consumption and reduce costs, and we have to find better power solutions like small modular reactors.”
Power generation, he said, “will continue to be an issue for the foreseeable future. Another issue is the data center infrastructure, because the power and cooling solutions for these high-end data centers are expensive and custom. They need to be more modular to allow for future modifications and to reduce to time to operation.”
A Nvidia spokesperson said the company approaches power efficiency a couple of ways: “The Blackwell architecture is a more efficient architecture than the previous generation, meaning we get more performance within a given power budget with the Blackwell-based GPUs. In addition, reduced precisions like FP8 and now FP4 with Blackwell bring performance increases that allow more work to be done using less infrastructure, also increasing overall efficiency.”
Source:: Network World