Top 8 outages of 2024

Network outages continued to plague organizations of all sizes in 2024, with major incidents exposing vulnerabilities in digital infrastructure and the growing complexity of IT environments, according to ThousandEyes’ recent Internet outage report.

ThousandEyes’ recap of the most impactful and memorable outages highlighted key trends, including the frequency of major outages that stemmed from backend configuration changes and automated system failures. The report also notes that the ratio of cloud service provider (CSP) outages to ISP outages shifted to a significant degree in 2024: CSP outages climbed from 17% to 27% of all outages, while ISP outages decreased from 83% to 73%.

“Outages are going to continue happening to everyone, regardless of how mature your operations practices are. The reality is that visibility and proactive monitoring are more critical than ever to quickly identify and resolve these inevitable disruptions,” said Kemal Sanjta, principal Internet analyst at ThousandEyes, during a webinar sharing the details of the report.

Notable 2024 outages included CrowdStrike’s boot loop, Cloudflare’s HTTP timeouts, and OpenAI’s telemetry service deployment. Here we share the details of the top eight outages in 2024, according to ThousandEyes.

Microsoft Teams service disruption: January 26

  • Duration: 7+ hours
  • Symptoms: The disruption began with frozen apps, login errors, and users left hanging in meeting waiting rooms.
  • Cause: A problem inside Microsoft’s own network affected the collaboration service.
  • Takeaways: “ThousandEyes’ own observations during the incident indicated that the failure was consistent with issues in Microsoft’s own network. Failover didn’t appear to relieve the issue for many users, although further ‘network and backend service optimization efforts’ did eventually restore service.”
  • For more information: Check out the ThousandEyes outage analysis.

Meta outage: March 5

  • Duration: 2+ hours
  • Symptoms: Services such as Facebook, Instagram, Messenger, and Threads were inaccessible to users, and users were unable to proceed beyond the login or authentication process.
  • Cause: Likely a backend failure in one of the dependencies that the login system relies on.
  • Takeaways: “There are certain scenarios in which users might encounter problems with an application that could cause the service to appear unresponsive even though it is still accessible. In this case, it appeared that some Facebook users received login rejections, while some Instagram users were unable to refresh their feeds. Both cases appear to be related to authentication problems.”
  • For more information: Check out the ThousandEyes outage analysis.

Atlassian Confluence disruption: March 26

  • Duration: 1+ hours
  • Symptoms: Customers had problems accessing the service and receiving HTTP 502 bad gateway errors.
  • Cause: The root cause of this outages was traced back to an application-level issue on a server, not a networking or infrastructure-related issue.
  • Takeaways: “ThousandEyes’ analysis revealed it affected users all over the globe. By tracing the network paths to the application’s front-end web servers, hosted in AWS, it was clear that this was a backend issue rather than network connectivity itself.”

Google.com outage: May 1

  • Duration: 1 hour
  • Symptoms: Users encountered HTTP 502 error messages instead of the expected search results.
  • Cause: Likely a problem with the connectivity or linkage between the webpage and the search engine, rather than an issue with the search engine itself. Google has not provided an official explanation.
  • Takeaways: “ThousandEyes’ analysis revealed a ‘lights on/lights off’ scenario, where service suddenly dropped, suggesting a problem with backend name resolution or something connected to policy/security verification, rather than an issue with the search engine itself.”
  • For more information: Check out the ThousandEyes outage analysis.

CrowdStrike sensor update incident: July 19

  • Duration: 1+ hours
  • Symptoms: Windows hosts started experiencing a ‘vicious boot loop’ leading to the blue screen of death (BSOD), Windows machines crashed at an alarming rate, and critical services, including Microsoft 365, were offline.
  • Cause: A faulty Falcon Sensor configuration update that specifically affected Windows systems.
  • Takeaways: “The scale and breadth of the outages that flowed from this single shared dependency have potential impacts for digital economies worldwide. … Recovery was not a simple task, requiring IT staff to physically attend machines to get them functional. At one point, Microsoft reported up to 15 reboots per machine may be needed.”
  • For more information: Check out the ThousandEyes outage analysis.

Cloudflare disruption: September 16

  • Duration: 2 hours
  • Symptoms: Performance issues that led to reachability issues for some applications, such as HubSpot RingCentral, and Zoom, leveraging Cloudflare’s CDN and networking services.
  • Cause: Cloudflare acknowledged that the outages was caused by a new telemetry service deployment that had overwhelmed their Kubernetes control plane.
  • Takeaways: “The ThousandEyes platform show the impact on these third-party applications clearly, with agents in the U.S., Canada, and India all failing to connect to the various applications during the outage. This is a good example of how you can avert the ‘is it just me?’ problem. By tracking the entire service delivery process of your applications, you can follow the network paths taken by your apps—and the suppliers they are connected to.”
  • For more information: Check out the ThousandEyes outage analysis.

Microsoft outage: November 25

  • Duration: Intermittent with 24+ hours of technical delays
  • Symptoms: The outage manifested with timeout errors, resolution failures, and HTTP 503 status code, indicating that the backend service or system was unavailable and affecting services such as Outlook Online.
  • Cause: Microsoft later confirmed that the problems stemmed from a change that caused an influx of retry requests routed through servers, impacting service availability.
  • Takeaways: “While the services were not responsive, errors on the receiving side, along with server-side status codes, indicated that while the service front end was reachable, subsequent requests for components, objects, or other services were not consistently available. The intermittent nature of the problem also meant that it was not always obvious to end users, often presenting as slow or lagging responses.”
  • For more information: Check out the ThousandEyes outage analysis.

OpenAI outage: December 11

  • Duration: 4+ hours
  • Symptoms: Users reported difficulties logging in to the platform, witnessed partial page loads, with requests for further information prompting HTTP 403 error messages, and API calls were returning, which ultimately expanded to all of OpenAI’s services being rendered unavailable.
  • Cause: The issue “stemmed from a new telemetry service deployment that unintentionally overwhelmed the Kubernetes control plane, causing cascading failures across critical systems,” according to an OpenAI post-mortem update.
  • Takeaways: “Comprehensive, component-level monitoring of web applications is essential, beyond just looking at top-level metrics like overall page load time. This granular visibility can make the difference in quickly pinpointing the root cause of complex application outages.”
  • For more information: Check out the ThousandEyes outage analysis.

Source:: Network World