
Whether it is a hyperscaler or enterprise endeavor, powering a data center is proving to be much more challenging than building one.
For hyperscalers like AWS and Google, the primary challenge is to simply to get enough power to the data center. It’s gotten so bad some organizations are looking to build small nuclear power plants right next to the data center so as to not be dependent on the local power grid.
But for enterprises, such as the Fortune 500, a different challenge exists. That is simply managing the power. Whereas a few years ago the typical 42U server rack had about 150 kW power draw, today in the age of AI organizations are looking at racks with one mW of power consumption.
The explosion in power density comes from three sources: one, the power draws of server CPUs have significantly increased from around 200 watts a few years ago to 500 watts more recently; two, the addition of GPU accelerators for AI processing has introduced up to 8 new cards into a server, each drawing 1 mW of power; and three, an overall increase in density to reduce latency.
AI workloads don’t behave well with latency. That’s why NVIDIA developed NVLink to keep its GPUs fed with data. With LLM training taking weeks, customers take every opportunity to reduce the amount of processing time. That means squeezing all the equipment together, so developers don’t have to send it out over a wire if possible.
“There were some it guys ten years ago that said, ‘Hey, let’s do a 50 kW rack.’ And the infrastructure manager says, ‘Hey, thank you for your suggestion. I’m going to do ten racks of five kW and that’s going to meet your needs the same way,’ because people did not see being spread out through more space as a problem at all,” said Alex Cordovil, research director for data center physical infrastructure and liquid cooling at the Dell’Oro Group.
” Now, with AI, GPUs need data to do a lot of compute and send that back to another GPU. That connection needs to be close together, and that is what’s pushing the density, the chips are more powerful and so on, but the necessity of everything being close together is what’s driving this big revolution,” he said.
That revolution in new architecture is new data center designs. Cordovil said that instead of putting the power shelves within the rack, system administrators are putting a sidecar next to those racks and loading the sidecar with the power system, which serves two to four racks. This allows for more compute per rack and lower latency since the data doesn’t have to travel as far.
The problem is that 1 mW racks are uncharted territory and no one knows how to manage the power, which is considerable now. ”There’s no user manual that says, hey, just follow this and everything’s going to be all right. You really need to push the boundaries of understanding how to work. You need to start designing something somehow, so that is a challenge to data center designers,” he said.
And this brings up another issue: many corporate data centers have power plugs that are like the ones that you have at home, more or less, so they didn’t need to have an advanced electrician certification. “We’re not playing with that power anymore. You need to be very aware of how to connect something. Some of the technicians are going to need to be certified electricians, which is a skills gap in the market that we see in most markets out there,” said Cordovil.
A CompTIA A+ certification will teach you the basics of power, but not the advanced skills needed for these increasingly dense racks. Cordovil admits the issue has yet to be fully addressed.
“I don’t think the industry has converged in a direction yet. Data center operators have struggled with shortage of well-trained and experienced technical workforce for years, and this is only adding pressure to an already existing structural challenge,” he said.
In the end, he believes compromise will need to come from all directions: IT technicians will need to be skill up, physical infrastructure vendors’ designs will need to become easier to maintain and operate, follow clearer standards, and systems and controls will need to get smarter to relieve the burden on personnel.
Source:: Network World