Google Cloud outage disrupts over 50 services globally for over 7 hours

Google Cloud suffered a major outage on Thursday, affecting countless apps and services that power both industries and consumers.

The incident, which lasted over 7 hours, started at 2:51 PM UTC and was resolved by 10:18 pm UTC, as per the Google Cloud Service Health report. This global outage disrupted 54 Google Cloud Platform products,  including API Gateway, Agent Assist, Cloud Data Fusion, Cloud Workstations, Contact Center AI Platform, Database Migration Service, Google App Engine, Google Cloud Console, and Vertex Gemini API.

“Following a disruption to a number of Google Cloud services, all products have now been fully restored,” said a Google spokesperson. Google Cloud CEO Thomas Kurian in a post on X stated, “We have been hard at work on the outage today and we are now fully restored across all regions and products. We regret the disruption this caused our customers.”

According the Google Cloud’s mini incident report, the issue occurred due to an invalid automated quota update to the API management system, which was distributed globally, causing external API requests to be rejected.

To recover, Google Cloud bypassed the offending quota check, which allowed recovery in most regions within two hours. However, the quota policy database in us-central1 became overloaded, resulting in much longer recovery in that region.

The company also acknowledged that the incident should not have happened, and going ahead is taking measures including preventing its API management platform from failing due to invalid or corrupt data, preventing metadata from propagating globally without appropriate protection, testing and monitoring in place and improving system error handling and comprehensive testing for handling of invalid data.

“Resilience isn’t a feature you layer on. It’s an architectural commitment. Performance under adversity — not in perfect conditions — is the real benchmark now. If your system can’t absorb failure without taking your customers down with it, you’re not production-ready in 2025 — especially not in the AI era,” said Spencer Kimball, CEO & Co-Founder, Cockroach Labs.

Fragile dependencies across the internet infrastructure

A global outage that originated within Google Cloud’s internal infrastructure paralyzed services across major platforms, highlighting a systemic vulnerability that underpin hyperscaler ecosystems.

“The domino effect from Google Cloud’s internal IAM failure was felt across dependent platforms like Cloudflare, Spotify, Snapchat, and Discord—not due to hardware failure, but because control-plane dependencies paralyzed core administrative functions,” said Sanchit Vir Gogia, chief analyst and CEO at Greyhound Research.

Cloudflare acknowledged suffering a significant service outage that affected a large set of its critical services, including Workers KV, WARP, Access, Gateway, Images, Stream, Workers AI, Turnstile and Challenges, AutoRAG, Zaraz, and parts of the Cloudflare Dashboard.

According to the company blog post, the outage lasted 2 hours and 28 minutes and impacted all Cloudflare customers using the affected services globally. As a part of infrastructure used by Cloudflare Workers KV service is backed by a third-party cloud provider, which experienced an outage that directly impacted the availability of the KV service, the company said.

“This incident underscores the deep interdependence of today’s internet infrastructure. While cloud providers appear technically independent, they often share critical elements such as routing protocols, DNS services, and edge delivery systems. These shared components create systemic risks, where a failure in one area can ripple across multiple platforms,” said Manish Rawat, analyst, TechInsights. The event exposes the fragility of cloud redundancy, particularly when core internet protocols suffer misconfigurations or outages.

Downdetector recorded increased user reports hinting at potential service issues on AWS and Microsoft Azure around the same time. Yet, according to the official health status pages of AWS and Azure, no outages or disruptions were acknowledged.

In response to Network World’s query, AWS stated that it “didn’t experience an outage, and we continue to operate normally,” adding that “the only accurate source of data on the status of AWS services is the  AWS Health Dashboard.”

Contrary to early reports, AWS and Microsoft Azure remained operational; however, internet-wide disruptions and CDN-layer spillovers created the illusion of a multi-cloud failure, explained Gogia. This event confirms that cloud resilience today is not simply about infrastructure uptime — it’s about the structural integrity of orchestration systems that underpin modern workloads.

Similar service disruptions were observed recently when IBM Cloud encountered three significant outages within two weeks.

Enterprises must build for failure

Cloud outages can severely impact enterprises heavily dependent on cloud infrastructure, particularly those lacking multi-cloud or edge-layer failovers, resulting in revenue loss, brand damage, operational disruptions, and data integrity issues. For sectors like fintech, healthcare, and e-commerce, the effects are immediate and critical and highlighting the urgent need for enterprises to strengthen resilience.

“To build true digital resilience, enterprises must move beyond single-cloud or same-vendor failover strategies and adopt multi-cloud, hybrid-cloud, and distributed service models. Key steps include distributing critical workloads across multiple providers, implementing geo-distributed failovers, and using independent third-party services for DNS, authentication, and monitoring,” added Rawat.

Enterprises should also embrace chaos engineering and conduct regular resiliency drills to test real-world failure scenarios. Additionally, leveraging edge computing and local caching can reduce dependency on centralized cloud infrastructure.

Source:: Network World