Route leak incident on October 2, 2014
Today, CloudFlare suffered downtime which caused customers’ sites to be inaccessible in certain parts of the world. We take the availability of our customers’ web properties very seriously. Incidents like this get the absolute highest priority, attention, and follow up. The pain felt by our customers is also felt deeply by the CloudFlare team in London and San Francisco.
This downtime was the result of a BGP route leak by Internexa, an ISP in Latin America. Internexa accidentally directed large amounts of traffic destined for CloudFlare data centers around the world to a single data center in Medellín, Colombia. This was the result of Internexa announcing via BGP that their network, instead of ours, handled traffic for CloudFlare. This miscommunication caused a flood of traffic to quickly overwhelm the data center in Medellín. The incident lasted 49 minutes, from 15:08UTC to 15:57UTC.
The exact impact of the route leak to our customers’ visitors depended on the geography of the Internet. Traffic to CloudFlare’s customers sites dropped by 50% in North America and 12% in Europe. The impact on our network in Asia was isolated to China. Traffic from South America was also affected as data centers there had to cope with an influx of traffic normally handled elsewhere.
In the past, we’ve written about the inherent fragility of the Internet’s core routing system, and the problem of “route leakage”. Throughout 2014, we’ve seen numerous high profile leaks. In April an Indonesian ISP leaked routes for large swathes of the Internet for a two hour period. Then in September portions of the Internet went offline because of a route leak when a hosting company leaked 400,000 routes, and back in March Google’s DNS was inaccessible in parts of the world because of a route leak. Route leakage is a hard problem that impacts every Internet service providers with no obvious or quick solution.
Today’s incident was unrelated to our recent announcement of Universal SSL, and was not the result of an attack. We worked directly with Internexa to resolve the route leak, and all CloudFlare systems are now operating normally. For the time being, we have quarantined the Medellín data center and disabled connectivity with Internexa. CloudFlare still has plenty of capacity to continue operating our network without that data center while we work with Internexa to understand the exact cause of their route leak. We are beginning an internal post-mortem to ensure that our internal protocols were followed and to identify areas for improvement.
Finally, we plan to proactively issue service credits to accounts covered by SLAs. Any amount of downtime is completely unacceptable to us. The entire CloudFlare team remains focused on delivering the best service to customers worldwide.
Read more here:: CloudFlare