Mexican Standoff

Hi, I’m Filippo and today I managed to surprise myself! (And not in a good way.)

CC-BY-NC-ND image by Martin SoulStealer

I’m developing a new module (“filter” as we call them) for RRDNS, CloudFlare’s Go DNS server. It’s a rewrite of the authoritative module, the one that adds the IP addresses to DNS answers.

It has a table of CloudFlare IPs that looks like this:

type IPMap struct {  
    M map[string][]net.IP

It’s a global filter attribute:

type V2Filter struct {  
    name       string
    IPTable    *IPMap
    // [...]

The table changes often, so a background goroutine periodically reloads it from our distributed key-value store, acquires the lock (f.IPTable.Lock()), updates it and releases the lock (f.IPTable.Unlock()). This happens every 5 minutes.

Everything worked in tests, including multiple and concurrent requests.

Today we deployed to an off-production test machine and everything worked. For a few minutes. Then RRDNS stopped answering queries for the beta domains served by the new code.

What. That worked on my laptop™.

Here’s the IPTable consumer function. You can probably spot the bug.

func (f *V2Filter) getCFAddr(...) (result []dns.RR) {  
    // [... append IPs from f.IPTable.M to result ...]

f.IPTable.RUnlock() is never called. Whoops. But it’s an RLock, so multiple getCFAddr calls should work, and only table reloading should break, no? Instead getCFAddr started blocking after a few minutes. To the docs!

To ensure that the lock eventually becomes available, a blocked Lock call excludes new readers from acquiring the lock.

So everything worked and RLocks piled up until the table reload function ran, then the pending Lock call caused all following RLock calls to block, breaking RRDNS answer generation.

In tests the table reload function never ran while answering queries, so getCFAddr kept piling up RLock calls but never blocked.

No customers were affected because A) the release was still being tested on off-production machines and B) no real customers run on the new code yet. Anyway it was a interesting way to cause a deferred deadlock.

In closing, there’s probably space for a better tooling here. A static analysis tool might output a listing of all Lock/Unlock calls, and a dynamic analysis tool might report still [r]locked Mutex at the end of tests. (Or maybe these tools already exist, in which case let me know!)

Do you want to help (introduce :) and) fix bugs in the DNS server answering more than 50 billion queries every day? We are hiring in London, San Francisco and Singapore!

Read more here:: CloudFlare

Creative foot-shooting with Go RWMutex