
Most enterprises deploying AI technology have been turning to hyperscalers such as Amazon, Google and Microsoft to provide all the necessary infrastructure.
According to IDC data released this month, cloud and shared environments account for most AI server spending — 72% in the first half of 2024. That’s because enterprises have been lagging behind on adopting on-premises infrastructure, the research firm says.
On-premises AI does offer some benefits. Companies can keep their data local, for example, or reduce lag by putting their computing capacity close to where it is needed. And if companies know they’re going to need a predictable amount of computing every month, on-premises deployments can even reduce costs.
But many companies are still in the experimental phase, and don’t know how much compute they’ll need.
And there’s another problem — traditional infrastructure just doesn’t cut it when it comes to AI. These applications require AI-optimized servers, storage, and networking and all the components need to be configured so that they work well together. It’s a brand-new skill set.
“The technology stack here is completely alien,” said Neil MacDonald, EVP and GM for high performance computing and AI at HPE, in a presentation late last year. “It doesn’t look anything like the technology stack that even well-experienced enterprises have.”
“One of the challenges for AI — for any brand new technology — is putting the right combination of infrastructure together to make the technology work,” says Zeus Kerravala, founder and principal analyst at ZK Research. “If one of those components isn’t on par with the other two, you’re going to be wasting your money.”
Time is taking care of the first problem. More and more enterprises are moving from pilot projects to production, and getting a better idea of how much AI capacity they actually need.
And vendors are stepping up to handle the second problem, with packaged AI offerings that integrate servers, storage and networking into one convenient package, ready to deploy on-prem or in a colocation facility.
All the major vendors, including Cisco, HPE, and Dell are getting in on the action, and Nvidia is rapidly striking deals to get its AI-capable GPUs into as many of these deployments as possible.
For example, Cisco and Nvidia just expanded their partnership to bolster AI in the data center. The vendors said Nvidia will couple Cisco Silicon One technology with Nvidia SuperNICs as part of its Spectrum-X Ethernet networking platform, and Cisco will build systems that combine Nvidia Spectrum silicon with Cisco OS software.
That offering is only the latest in a long string of announcements by the two companies. For example, Cisco unveiled its AI Pods in October, which leverage Nvidia GPUs in servers purpose-built for large-scale AI training, as well as the networking and storage required.
Other companies are also getting in on the act.
HPE, for example, announced the shipment of its rack-scale system, also using Nvidia GPUs, in mid-February. What sets HPE’s offering apart is that it uses direct liquid cooling, which makes it suitable for very large, complex AI clusters.
Dell also announced a new integrated rack for AI, offering both air and liquid cooling options, in late 2024.
These and other packaged AI solutions will make it easier for more enterprises to deploy AI, experts say, though Cisco and HPE have the strongest offerings and the most robust ecosystems.
“I’d take the HPE validated solution and the Cisco validated solution and test them against each other,” says ZK Research’s Kerravala.
According to Gartner, enterprises spent $28 billion on AI-optimized servers in 2024. This year, that number will jump to $34 billion, predicts Gartner analyst Tony Harvey, and will grow to $44 billion by 2028.
The timing is critical. According to a Cisco survey of nearly 8,000 senior business leaders with AI responsibility, 98% report increased urgency to deliver on AI but only 21% say they have enough GPUs.
And not all enterprises want to run all their AI workloads in public clouds.
“There are enterprises that want to implement AI today but don’t want their data to go into a public environment,” says Harvey. “The data is their lifeblood.”
Similarly, they might not trust the big AI vendors with their data, either, he says. “We know they’ve violated copyrights and done all sorts of things.”
Then there’s the cost issue.
“You get to a certain point where it’s cheaper to run it on your own hardware instead of running it in the cloud,” Harvey says.
Specialized infrastructure environments
So, what makes AI deployment different?
The most obvious, of course, is that AI workloads require specialized processors. Most commonly, Nvidia GPUs.
Then, AI training requires a lot of data, but fine-tuning and RAG embedding also have data needs. And they aren’t the same as what’s typically deployed.
“It’s going to be a different type of storage — probably object storage rather than file block storage,” says Harvey.
Finally, all the GPUs have to talk to each other. AI creates what Harvey calls “elephant flows” between GPUs that can overwhelm standard networks.
“The networking in AI is very, very different to the networking in a standard environment,” says Harvey. “If I drop my AI cluster on a core backbone network, my core backbone network would stop working.”
The packaged AI solutions address this problem by creating a separate network just for the AI.
“In a large scale GPU cluster, you have a front-end network which is how you connect into the cluster, and the back end network which is how all the GPUs in the cluster connect to each other,” says Kevin Wollenweber, SVP & GM for Cisco Networking. “It’s designed to be congestion-free and lossless.”
The final difference is that, when it comes to the most intensive AI use cases, such as training large language models, traditional air cooling might not be enough.
“Large, powerful AI systems need direct liquid cooling as liquid removes more than 3,000 times more heat based on volume compared to air,” says Trish Damkroger, SVP and GM for high performance computing and AI infrastructure solutions at HPE.
Is too much Nvidia ever enough?
One common thread of these packaged AI solutions is the reliance on Nvidia GPUs.
That can raise concerns about vendor lock-in and supply chain resilience.
“They all require the same core Nvidia building blocks,” says John Sheehy, SVP of research and strategy at IOActive, a cybersecurity firm. That creates a monoculture almost entirely reliant on one company for a critical node in the supply chain for AI and “exposes everyone in society to an unacceptable concentration of risk.”
“Similarly, TSMC is a similarly critical node in the supply chain for the manufacture of the Nvidia chips,” he adds. “This dependence on a single chip foundry just offshore of an expansionist, revisionist power should worry everyone.”
When it comes to AI, other chip makers have been lagging far behind Nvidia in capability, scale, or both.
“They are the market leader, and will continue to be,” says Jason Carolan, chief innovation officer at colocation provider Flexential. “Nvidia is a core player, and will continue to build some of the most sophisticated and capable platforms for years to come.”
That’s particularly true for training large language models.
“When inference is of higher importance, it opens up others to participate,” he says.
But even in AI training, there might be options.
“Looking at DeepSeek, it opens new innovations that will impact the market, and continue to open up new ways of optimization,” he says.
DeepSeek is a Chinese startup that released an open source AI model last month that shook the world because of its leading-edge reasoning capabilities — and lower computing requirements. Nvidia immediately lost $600 billion in market value, the largest single-day decline in history.
Source:: Network World