Cloudflare Dashboard and API Outage on April 15, 2020

By GIXnews

Starting at 1531 UTC and lasting until 1952 UTC, the Cloudflare Dashboard and API were unavailable because of the disconnection of multiple, redundant fibre connections from one of our two core data centers.

This outage was not caused by a DDoS attack, or related to traffic increases caused by the COVID-19 crisis. Nor was it caused by any malfunction of software or hardware, or any misconfiguration.

What happened

As part of planned maintenance at one of our core data centers, we instructed technicians to remove all the equipment in one of our cabinets. That cabinet contained old inactive equipment we were going to retire and had no active traffic or data on any of the servers in the cabinet. The cabinet also contained a patch panel (switchboard of cables) providing all external connectivity to other Cloudflare data centers. Over the space of three minutes, the technician decommissioning our unused hardware also disconnected the cables in this patch panel.

This data center houses Cloudflare’s main control plane and database and as such, when we lost connectivity, the Dashboard and API became unavailable immediately. The Cloudflare network itself continued to operate normally and proxied customer websites and applications continued to operate. As did Magic Transit, Cloudflare Access, and Cloudflare Spectrum. All security services, such as our Web Application Firewall, continued to work normally.

But the following were not possible:

Logging into the DashboardUsing the APIMaking any configuration changes (such as changing a DNS record)Purging cacheRunning automated Load Balancing health checksCreating Argo Tunnel connectionsCreating or updating Cloudflare WorkersTransferring domains to Cloudflare RegistrarAccessing Cloudflare Logs and AnalyticsEncoding videos on Cloudflare StreamLogging information from edge services (customers will see a gap in log data)

No configuration data was lost as a result of the outage. Our customers’ configuration data is both backed up and replicated off-site, but neither backups nor replicas were needed. All configuration data remained in place.

How we responded

During the outage period, we worked simultaneously to cut over to our disaster recovery core data center and restore connectivity.

Dozens of engineers worked in two virtual war rooms, as Cloudflare is mostly working remotely because of the COVID-19 emergency. One room dedicated to restoring connectivity, the other to disaster recovery failover.

We quickly failed over our internal monitoring systems so that we had visibility of the entire Cloudflare network. This gave us global control and the ability to see issues in any of our network locations in more than 200 cities worldwide. This cutover meant that Cloudflare’s edge service could continue running normally and the SRE team could deal with any problems that arose in the day to day operation of the service.

As we were working the incident, we made a decision every 20 minutes on whether to fail over the Dashboard and API to disaster recovery or to continue trying to restore connectivity. If there had been physical damage to the data center (e.g. if this had been a natural disaster) the decision to cut over would have been easy, but because we had run tests on the failover we knew that the failback from disaster recovery would be very complex and so we were weighing the best course of action as the incident unfolded.

At 1944 UTC the first link from the data center to the Internet came back up. This was a backup link with 10Gbps of connectivity.
At 1951 UTC we restored the first of four large links to the Internet.
At 1952 UTC the Cloudflare Dashboard and API became available.
At 2016 UTC the second of four links was restored.
At 2019 UTC the third of four links was restored.
At 2031 UTC fully-redundant connectivity was restored.

Moving forward

We take this incident very seriously, and recognize the magnitude of impact it had. We have identified several steps we can take to address the risk of these sorts of problems from recurring in the future, and we plan to start working on these matters immediately:

Design: While the external connectivity used diverse providers and led to diverse data centers, we had all the connections going through only one patch panel, creating a single physical point of failure. This should be spread out across multiple parts of our facility.Documentation: After the cables were removed from the patch panel, we lost valuable time identifying for data center technicians the critical cables providing external connectivity to be restored. We should take steps to ensure the various cables and panels are labeled for quick identification by anyone working to remediate the problem. This should expedite our ability to access the needed documentation. Process: While sending our technicians instructions to retire hardware, we should call out clearly the cabling that should not be touched.

We will be running a full internal post-mortem to ensure that the root causes of this incident are found and addressed.

We are very sorry for the disruption.

Source:: CloudFlare