
Power-related outages remain the leading cause of major data center downtime, accounting for more than half (54%) of all cases tracked in Uptime Institute’s 7th annual outage analysis. Network-related and IT systems issues accounted for 12% and 11% of impactful data center outages, respectively, with networking/connectivity accounting for 30% of end-to-end IT service outages, the report found.
Uptime conducted analysis using the latest data from several of its reports and surveys in both 2024 and 2025, and the research shows that the frequency of outages is decreasing. In a 2024 survey, 53% of operators reported an outage in the past three years. In 2023, 55% reported an outage. In 2022, 60% reported downtime, and 69% said the same in 2021. In 2020, 78% of operators reported an outage over the past three years. (Read more: Top 8 outages of 2024)
“Outages overall have slowed down,” said Andy Lawrence, founding member and executive director of Uptime Intelligence, in a statement. “Data center operators are facing a growing number of external risks beyond their control, including power grid constraints, extreme weather, network provider failures, and third-party software issues. And despite a more volatile risk landscape, improvements are occurring.”
While outages are becoming less frequent, power remains the leading cause of impactful outages. Among the reasons behind the power failure are:
- UPS failure: 42%
- Transfer switch failure: 36%
- Generator failure: 28%
- Transfer switch between paths (A/B) failure: 23%
- Controls failure: 15%
- Single corded IT device failure: 11%
- Power distribution unit failure: 11%
“Power has been the leading cause. Power is going to be the leading cause for the foreseeable future. And one should expect it because every piece of equipment in the data center, whether it’s a facilities piece of equipment or an IT piece of equipment, it needs power to operate. Power is pretty unforgiving,” said Chris Brown, chief technical officer at Uptime Institute, during a webinar sharing the report findings. “It’s fairly binary. From a practical standpoint of being able to respond, it’s pretty much on or off.”
Also, on a positive note, there are signs that outage severity is decreasing, with only 9% of reported incident in 2024 being classified as serious or severe, according to Uptime. “Just over half said that they have had an outage in the last three years, and for those who said yes … overall about three-quarters of those are not significant,” Lawrence explained.
Still, IT and networking issues increased in 2024, according to Uptime Institute. The analysis attributed the rise in outages due to increased IT and network complexity, specifically, change management and misconfigurations.
“Particularly with distributed services, cloud services, we find that cascading failures often occur when networking equipment is replicated across an entire network,” Lawrence explained. “Sometimes the failure of one forces traffic to move in one direction, overloading capacity at another data center.”
The most common causes of major network-related outages were cited as:
- Configuration/change management failure: 50%
- Third-party network provider failure: 34%
- Hardware failure: 31%
- Firmware/software error: 26%
- Line breakages: 17%
- Malicious cyberattack: 17%
- Network overload/congestion failure: 13%
- Corrupted firewall/routing tables issues: 8%
- Weather-related incident: 7%
Configuration/change management issues also attributed for 62% of the most common causes of major IT system-/software-related outages. Change-related disruptions consistently are responsible for software-related outages.
Human error continues to be one of the “most persistent challenges in data center operations,” according to Uptime’s analysis. The report found that the biggest cause of these failures is data center staff failing to follow established procedures, which has increased by about 10 percentage points compared to 2023.
“These are things that were 100% under our control. I mean, we can’t control when the UPS module fails because it was either poorly manufactured, it had a flaw, or something else. This is 100% under our control,” Brown said. The most common causes of major human error-related outages were reported as:
- Data staff failing to follow procedure: 58%
- Incorrect staff processes/procedures: 45%
- Installation issues: 24%
- In-service issues: 19%
- Insufficient staff: 18%
- Preventative maintenance frequency issues: 16%
- Data center design or omissions: 14%
Brown explained that data center operators are having trouble creating processes and providing adequate training, especially with the speed at which data centers are expanding and the limited experience of new staff. Uptime analysts suggested that human error would be the easiest and most cost-effective area to improve in preventing outages.
“This is probably the low-hanging fruit. This is probably the cheapest way to reduce the likelihood of outages, for instance, better training, better processes, better communication of those processes,” Lawrence added.
Source:: Network World