IBM Cloud stumbles again: second major outage in two weeks

IBM Cloud suffered its second major outage in two weeks on Monday, leaving users around the world unable to log in, manage resources, or access essential services.

The incident disrupted 41 services, including IBM Cloud, AI Assistant, DNS Services, Watson AI, Global Search Service, Hyper Protect Crypto Services, Databases, and the Security and Compliance Center.

The incident, which lasted for several hours, began at 9:05 AM UTC and was resolved by 11:10 PM UTC on Monday. According to IBM’s status update report, users were unable to log into IBM Cloud via the console, CLI, or API. During this period, they were also unable to manage or provision cloud resources. In addition, IAM authentication failures occurred, access to the support portal was disrupted, and the data paths for customer applications may have been affected.

IBM began its investigation and initiated preliminary mitigation efforts, and at 07:42 PM UTC on June 2, the company commenced a controlled recovery process to restore the system. By 11:12 PM UTC, IBM had completed its core recovery actions, allowing users to perform health checks on their applications.

Classified as a Severity One (Sev-1) event, the incident led to customers receiving emails about failed IAM authentication, an inability to access the support portal for support cases, and potential impacts on customer application data paths.

IBM did not immediately respond to a request for comment.

More than an authentication bug?

“Cloud login disruptions—even if short-lived— delay access to key applications, slow internal coordination, and interfere with automated workflows. Cloud outages that affect user login or platform access don’t always trigger immediate chaos—but they introduce friction that compounds quickly,” said Sanchit Vir Gogia, chief analyst and CEO at Greyhound Research.

Gogia said that a multi-region impact suggests more than an authentication bug—it typically points to a shared backend component like a global DNS resolution layer, orchestration controller, or telemetry service. “Unlike compute or storage failures that tend to be localised, control plane weaknesses ripple across zones, making the outage harder to contain and more disruptive to enterprise teams managing distributed workloads. The lack of regional decoupling in core platform functions remains a concern for CIOs navigating compliance, performance, and isolation trade-offs,” Gogia said.

A similar incident occurred just a fortnight earlier, on May 20, lasting two hours and ten minutes. It affected 14 services, including IBM Cloud, Client VPN for VPC, Code Engine, and Kubernetes Service, among others. During this global cloud platform outage, users faced failures when attempting to log in via the user interface (UI), Command Line Interface (CLI), and even API key–based authentication.

When login or IAM services fail, mission-critical workloads can grind to a halt, triggering cascading disruptions across services and regions, said Prabhu Ram, VP for Industry Research Group at CMR. 

Such recurring disruptions underscore the broader implications for enterprise IT strategy, often resulting in enterprises focusing on improving their cloud resilience beyond vendor contracts.

“To attain true resilience, organizations must prioritize robust technical safeguards—such as multi-cloud strategies and geo-distributed architectures, as well as, strong contractual protections, including comprehensive SLAs. While a single outage may not immediately drive change, repeated failures or inadequate incident response can compel enterprises to diversify their cloud providers,” Ram said.

Gogia pointed out that building resilience today goes well beyond backup storage and secondary data centres. “Enterprises are now investing in multi-layer observability, cross-platform orchestration tooling, and secondary access routes that remain available even during vendor platform disruption. This could mean hosting lightweight admin portals outside the primary provider, deploying mirrored telemetry in a separate region, or using independent DNS management.”

These recent cloud outage examples — while not catastrophic — serve as useful stress tests that help identify soft spots in architecture and policy.

Source:: Network World