Network bloat: AI-driven data movements cause cloud overspend

Enterprise cloud spending is spiraling out of control, with over half of organizations estimating that more than 40% of their cloud budget is wasted on preventable mistakes and inefficient processes.

And the accelerating growth in AI adoption isn’t helping. Cloud networking is often an overlooked area, and the growing demands of AI-driven data movement are compounding the challenge. Meanwhile, the rise of agentic AI could increase traffic exponentially—and make it far less predictable.

Unexpected cloud costs

More than half of all enterprise workloads are now running in public clouds, and cloud spending is projected to increase by 28% this year, according to Flexera’s state of the cloud report, which was released in late March. The company’s survey of more than 700 cloud decision-makers found that 40% of enterprises spend more than $12 million per year on public cloud, up from 36% last year.

But 27% of cloud spending is wasted, the report shows, and that’s down from a high of 32% four years ago, but still unreasonable. And cloud networking costs are a big part of the picture. “The hidden costs of cloud aren’t the compute, but the networking and storage,” says Kevin Mortimer, head of operations at University of Reading in England’s Berkshire county. “It’s very easy to create very large networks in public cloud.”

In theory, cloud has an advantage over traditional deployments because it can scale up and down as needed, he says. “But even when you do turn it off, it is using resources behind the scenes and driving the cloud bills,” Mortimer explains. The university itself has had cases in which very large networks were created, he says. This creates network bloat and security concerns, he adds. Then there are the costs of moving data into and out of the cloud, he says. “Nobody talks about the costs of egress.”

“Cloud networking is a silent budget killer,” says Matt Biringer, CEO and co-founder at North, a cloud spending optimization company. “You think you’re spending on compute, then discover half your bill is cross-region data transfer. It’s the kind of cost that flies under the radar until it’s too late.”

According to a survey released last October by Stacklet, a cloud governance company, half of the companies polled say that 40% or more of their cloud spending is wasted—and the larger the organization, the more wasteful the cloud spending.

There are a lot of reasons for the wasted spending, says Scott Wheeler, cloud practice lead at Asperitas, a cloud consultancy. Many companies, even after all these years, still haven’t adapted to the switch from capital expenses to operating expenses, he says. For example, they might not have a specific individual or group tasked with managing cloud costs.

“Or sometimes the person or group that’s responsible doesn’t have it as their primary responsibility,” Wheeler explains. “They’ll have several other things they’re responsible for, and, oh, by the way, cost is there.”

For instance, a business unit might allocate a certain budget for a project and then, once it’s up and running, nobody ever goes back and checks that the full allocation is still needed. “It’s a hassle to go back and look at it,” Wheeler says.

There are plenty of tools available to monitor costs, data retention policies, misconfigurations, and other cloud issues. “There are tools native to Azure, Google, and AWS,” Wheeler says. “And there are third-party tools for cost management, but sometimes people don’t set these things up.”

One client recently declined to spend $200,000 on a project that would have saved $2 million per year in cloud costs by reducing the amount of logging data that was stored, Wheeler says. “It just wasn’t a priority. Leadership has other things they have to get done.”

Finally, people might be reluctant to cut back on cloud spending because there’s a risk to doing so.

“Scaling down can be a tricky thing,” says Erik Peterson, founder and CTO at CloudZero, a cloud cost optimization company. “Sometimes people say, I don’t want to put my career on the line to bring systems down if we don’t need them. Am I going to be fired for having a system deny user access, or be fired for spending a couple of extra bucks?”

But that argument only holds true when the economy is going well, he adds. During a downturn, this kind of waste can’t be tolerated. According to a Censuswide survey of 300 CIOs (released in late March and conducted on behalf of Azul, a Java platform company) showed that 83% are spending more than they expected on cloud.

Still, cloud infrastructure costs less than the alternative. According to the survey, 80% of CIOs said that they see net cost savings from moving to the cloud. But they plan to move even more workloads to the cloud. Today, 68% of infrastructure and application workloads are hosted in public, private, or hybrid clouds. Within five years, that percentage will increase to 75%, the CIOs said, which creates even more opportunities for cloud waste.

Meanwhile, if the economy continues to get worse, there will be increased pressure on technology managers to make sure that none of this spending is wasted. Cost efficiency is only the second-most important business driver for cloud migration. The top one? AI and data analytics. When it comes to AI, spending overruns have the potential to be even more dramatic—and even harder to predict.

The trouble with AI

AI will have a cumulative impact of $22 trillion on the global economy by 2030, according to an April report by IDC.

“Organizations around the world are signaling a growing commitment to AI investment,” said IDC analyst Carla La Croce in the report.

Generative AI spending is expected to reach $644 billion in 2025, an increase of 76% compared to last year, according to Gartner. In April, in a Wakefield Research survey of more than 1,000 IT executives, 63% said their organizations have “fully integrated” generative AI into their company, while another 24% have deployed it—and 62% said they’ve seen more than 100% ROI on their investments.

When it comes to agentic AI, the survey results are even more dramatic. Nearly all respondents—94%—say they expect to adopt agentic AI even faster than genAI, with an average expected return of 171%. Cloud accounts for 72% of AI infrastructure spending, according to IDC. But AI is a data-hungry technology, and moving large amounts of data around can get very expensive, very quickly.

“Prior to the AI world, data had gravity and pulled everything towards it,” CloudZero’s Peterson says. “But the equation has flipped, and the AI now has a stronger gravitational force. It’s pulling all the data towards it—and that has implications for network design.”

Many companies don’t realize the impact this will have, he says. Enterprise AI experiments are often run by different teams, pulling in data from everywhere they can.

“You can access it by API and start training systems,” says Peterson. “And you might not realize that they will trigger petabytes of data moving across the network. It’s one thing when it’s moving within one cloud provider, but a lot of teams are using frameworks that live outside the provider.”

For example, some business units might be experimenting with OpenAI, some with Google’s AI services, some with Anthropic. “Now I have data moving across the Internet outside my cloud provider, and I’m getting hit by egress traffic costs,” he says. “That’s driving a lot of surprises.”

With traditional genAI, data moves around when the AI is trained or fine-tuned, and then again when RAG (retrieval-augmented generation) embedding is used to add context to genAI queries. As the context windows get bigger, so does the amount of information that can be passed along in each question-and-answer interaction with a large language model.

“The super-massive AI black hole just gets bigger,” Peterson says. “The more it ingests, the stronger it becomes.”

Agentic AI increases this by an order of magnitude. Instead of a single question-and-answer interaction, agentic AI systems use armies of agents working in cooperation with one another to accomplish business tasks in a non-deterministic way. Some steps in a process could be repeated until they’re done correctly, and other systems pulled in to help when needed. Then, to make sure that the agents don’t go all SkyNet on us, there’s an entire layer of guardrail infrastructure, which is, itself, often powered by genAI.

“It’s going to get very complicated very fast,” Peterson says.

How to manage cloud networking costs

At the University of Reading, Mortimer can control networking costs by buying fixed capacity, deduplicating data, and tuning workflows.

The university serves more than 19,000 students at sites in the U.K. and Malaysia. Plus, researchers generate terabytes of new data every month from simulations, modeling, imaging, and other activities. All this work is supported by on-premises computing and by Azure cloud deployments, with some workloads in AWS and Oracle cloud.

“We try to channel most of our Azure cloud services to come back to campus via an ExpressRoute so we reduce egress costs,” Mortimer says. “It’s a fixed charge for fixed bandwidth. With a VPN, you pay for consumption, and it can go up and down. But with an ExpressRoute you have SLA-based performance around your throughput, and a fixed price as well.”

Azure ExpressRoute is a service that lets organizations create private connections between their on-prem infrastructure and Microsoft data centers. The university has also cut its cloud storage needs by 70% by using Rubrik Cloud Vault as its backup solution and eliminating duplicated files—which also reduces network traffic costs. And then there’s the fundamentals, Mortimer says.

“What we see in ourselves—our peers across the sector—is that they’re good at using DevOps to create things but forget the basics about making sure the data servers are optimized,” he says. “All the good things you’ve done on-prem, you have to make sure you’re doing it in Azure as well. And by asking those questions, you’ll find that you’ll be reducing costs, deleting data you don’t need anymore—all those housekeeping practices.”

He describes it as tuning. “It’s not a silver bullet,” he says. “We’re trying to find the most cost-effective way, that critical balance.”

When it comes to cloud networking costs in particular, there are several key levers that enterprises can use, says Nikhil Roychowdhury, principal and cloud FinOps leader at Deloitte Consulting. They include intelligent routing, tiered storage solutions, and regular audits. But network architecture—deciding where the data actually is—is key, he says.

“Most cloud providers offer no-cost data uploads,” he says. “However, transferring data across services to another provider or back to an on-premises data center incurs costs.”

Data storage should be aligned with where the data will be processed, he says. “This ensures that data is processed and stored near its point of use,” he says. That lowers egress costs, and, as a bonus, improves performance as well. This will be even more critical for AI, North’s Biringer says.

“The age of agentic AI will push data movement into overdrive,” he says. “Apps will be smarter, more autonomous—and far chattier. That means more unexpected network traffic, higher storage churn, and even more blurred lines between infrastructure layers.”

Enterprises should start looking at building intelligence into the infrastructure layer itself, he says, so that they can have visibility even when the behavior of their systems gets harder and harder to predict.

Source:: Network World