New MLCommons benchmarks to test AI infrastructure performance

offering a closer look at how current-generation data center and edge hardware performs under increasingly demanding AI workloads.

The updated MLPerf Inference V5.0 comes as infrastructure teams grapple with surging demand from generative AI applications like chatbots and code assistants, which require rapid processing of large and complex queries.

The benchmarks provide standardized ways to compare the speed and responsiveness of hardware platforms powering these tools.

One of the new tests is built around Meta’s Llama 3.1 405-billion-parameter model, used to evaluate a system’s ability to execute intensive tasks like math, question answering, and code generation. Another benchmark focuses on low-latency inference, simulating real-time interaction scenarios using Meta’s Llama 2 70B model.

MLCommons has published MLPerf Inference v5.0 benchmarking results submitted by 23 organizations, including AMD, Broadcom, Cisco, CoreWeave, Dell, Fujitsu, Google, Hewlett-Packard Enterprise, Intel, Nvidia, Oracle, and Supermicro.

Nvidia had already shared its results for the updated benchmarks, highlighting its new Blackwell GPU as a major leap over the previous Hopper architecture.

The latest release also broadens its scope beyond chatbot benchmarks. A new graph neural network (GNN) test targets datacenter-class hardware and is designed for workloads like fraud detection, recommendation engines, and knowledge graphs. It uses the RGAT model based on a graph dataset containing over 547 million nodes and 5.8 billion edges.

Judging performance

Analysts suggest that these benchmarks will make it easier to judge the performance of various hardware chips and clusters based on documented models.

“As every chipmaker seeks to prove that its hardware is good enough to support AI, we now have a standard benchmark that shows the quality of question support, math, and coding skills associated with hardware,” said Hyoun Park, CEO and Chief Analyst at Amalgam Insights. 

Chipmakers can now compete not just on traditional speeds and feeds, but in mathematical skill and informational accuracy. This benchmark provides a rare opportunity to add new performance standards on cross-vendor hardware, Park added.

“The latency in terms of how quickly tokens are delivered and the time for the user to see the response is the deciding factor,” said Neil Shah, partner and co-founder at Counterpoint Research. “This is where players such as NVIDIA, AMD, and Intel have to get the software right to help developers optimize the models and bring out the best compute performance.”

Benchmarking and buying decisions

Independent benchmarks like those from MLCommons play a key role in helping buyers evaluate system performance, but relying on them alone may not provide the full picture.

“These benchmarks still fall short of benchmarking real-world work in detail,” Park said. “For instance, the question-answering benchmark may help to benchmark a specific aspect of customer service, does not replace a customer service performance analysis.”

This means that while the new benchmarks mark a significant step forward, they are likely to be just one of several factors guiding enterprise hardware procurement decisions.

“When it comes to enterprise hardware procurement decisions, it will be shaped by a multitude of factors of which one will be compute power,” said Abhishek Sengupta, practice director at Everest Group. “Hardware manufacturers that may not be top of the line on some performance benchmarks may offer better commercial terms to offset this, offering a better price to performance trade-offs.”

The performance of an AI use case depends on a range of factors across the technology stack, as well as human input. Narrow benchmarks targeting isolated components may fall short of capturing the full impact on real-world outcomes, Sengupta added.

Source:: Network World