Nvidia Blackwell chips face serious heating issues

Nvidia’s next-generation Blackwell data center processors have significant problems with overheating when installed in high-capacity server racks, forcing redesigns of the racks themselves, according to a report by The Information.

These issues have reportedly led to design changes, meaning delays in shipping product and raising concern that its biggest customers, including Google, Meta, and Microsoft, will be able to deploy Blackwell servers according to their schedules.

According to insiders familiar with the situation who spoke with The Information, Nvidia’s Blackwell GPUs overheat in ultra-dense servers with 72 processors. Each Blackwell processor draws more than 1000 W of power, so that’s a whole lot of heat and power in a relatively small space.

Nvidia is said to be working closely with suppliers and partners to develop revisions and make design changes to address the overheating issues. Such redesigns are not uncommon, but in this case, it is pushing back the expected ship date, which was supposed to be in this quarter.

These are not the first rumors to plague Blackwell. In August, word came out that Nvidia and its manufacturing partner TSMC were dealing with yield issues due to the packaging design of the processor. But that was quickly addressed and more or less dismissed on the quarterly earnings call.

Nvidia reports earnings on Wednesday, November 20, after the close of trading on the stock market. For now, a company spokesperson said this:

“Nvidia GB200 systems are the most advanced computers ever created. Integrating them into a diverse range of data center environments requires co-engineering with our customers. Our engineering iterations are in line with expectations. Some of our partners including Dell Technologies and CoreWeave are promoting new Nvidia GB200 NVL72 designs here at SC and on social media.”

Anshel Sag, principal analyst with Moor Insights & Strategies, isn’t completely sold on the claims. “I think it’s too early to tell if this is a widespread issue or a configuration problem. I can’t imagine that Nvidia would ship a part that overheats, especially with the amount of cooling that’s already necessary,” he said, adding that the timing of this news is suspect. The Supercomputing 24 conference is taking place, and he wouldn’t put it past a Nvidia competitor to try and kneecap the company.

“Supercomputing is when everyone who’s anyone in the HPC world is meeting up and talking rumors and shop and today would be the day to drop a big rumor like this to get it to spread across the industry like wildfire,” he said. “If it were more organic, it would’ve spread after the show as people talked privately and gossiped. This almost feels like a leak the competition would spread to get more eyeballs on the competitive platforms.”

Source:: Network World