In 2017, we launched Geo Key Manager, a service that allows Cloudflare customers to choose where they store their TLS certificate private keys. For example, if a US customer only wants its private keys stored in US data centers, we can make that happen. When a user from Tokyo makes a request to this website or API, it first hits the Tokyo data center. As the Tokyo data center lacks access to the private key, it contacts a datacenter in the US to terminate the TLS request. Once the TLS session is established, the Tokyo datacenter can serve future requests. For a detailed description of how this works, refer to this post on Geo Key Manager.
This is a story about the evolution of systems in response to increase in scale and scope. Geo Key Manager started off as a small research project and, as it got used more and more, wasn’t scaling as well as we wanted it to. This post describes the challenges Geo Key Manager is facing today, particularly from a networking standpoint, and some of the steps along its way to a truly scalable service.
Geo Key Manager started out as a research project that leveraged two key innovations: Keyless SSL, an early Cloudflare innovation; and identity-based encryption and broadcast encryption, a relatively new cryptographic technique which can be used to construct identity-based access management schemes — in this case, identities based on geography. Keyless SSL was originally designed as a keyserver that customers would host on their own infrastructure, allowing them to retain complete ownership of their own private keys while still reaping the benefits of Cloudflare.
Eventually we started using Keyless SSL for Geo Key Manager requests, and later, all TLS terminations at Cloudflare were switched to an internal version of Keyless SSL. We made several tweaks to make the transition go more smoothly, but this meant that we were using Keyless SSL in ways we hadn’t originally intended.
With the increasing risks of balkanization of the Internet in response to geography-specific regulations like GDPR, demand for products like Geo Key Manager which enable users to retain geographical control of their information has surged. A lot of the work we do on the Research team is exciting because we get to apply cutting-edge advancements in the field to Cloudflare scale systems. It’s also fascinating to see projects be used in new and unexpected ways. But inevitably, many of our systems use technology that has never before been used at this scale which can trigger failures as we shall see.
A trans-pacific voyage
In late March, we started seeing an increase in TLS handshake failures in Melbourne, Australia. When we start observing failures in a specific data center, depending on the affected service, we have several options to mitigate impact. For critical systems like TLS termination, one approach is to reroute all traffic using anycast to neighboring data centers. But when we rerouted traffic away from the Melbourne data center, the same failures moved to the nearby data centers of Adelaide and Perth. As we were investigating the issue, we noticed a large spike in timeouts.
The service that is first in line in the system that makes up TLS termination at Cloudflare performs all the parts of TLS termination that do not require access to the customers’ private keys. Since it is an Internet-facing process, we try to keep sensitive information out of this process as much as possible, in case of memory disclosure bugs. So it forwards the key signing request to keynotto, a service written in Rust that performs RSA and ECDSA key signatures.
We continued our investigation. This service was timing out on the requests it sent to keynotto. Next, we looked into the requests itself. We observed that there was an increase in total requests, but not by a very large factor. You can see a large drop in traffic in the Melbourne data center below. That indicates the time where we dropped traffic from Melbourne and rerouted it to other nearby data centers.
We decided to track one of these timed-out requests using our distributed tracing infrastructure. The Jaeger UI allows you to chart traces that take the longest. Thanks to this, we quickly figured out that most of these failures were being caused by a single, new zone that was getting a large amount of API traffic. And interestingly, this zone also had Geo Key Manager enabled, with a policy set to US-only data centers.
Geo Key Manager routes requests to the closest data center that satisfies the policy. This happened to be a data center in San Francisco. That meant adding ~175ms to the actual time spent performing the key signing (median is 3 ms), to account for the trans-pacific voyage
Below are graphs depicting the increase by quantile in RSA key signatures latencies in Melbourne. Plotting them all on the same graph didn’t work right with the scale, so the p50 is shown separately from p70 and p90.
To answer why unrelated zones had timeouts and performance degradation, we have to understand the architecture of the three services involved in terminating TLS.
Life of a TLS request
TLS requests arrive at a data center routed by anycast. Each server runs the same stack of services, so each has its own instance of the initial service, keynotto, and gokeyless (discussed shortly). The first service has a worker pool of half the number of CPU cores. There’s 96 cores on each server, so 48 workers. Each of these workers creates their own connection to keynotto, which they use to send and receive the responses of key-signing requests. keynotto, at that point in time, could multiplex between all 48 of these connections because it spawned a new thread to handle each connection. However, it processed all requests on the same connection sequentially. Given that there could be dozens of requests per second on the same connection, if even a single one was slow, it would cause head of line blocking of all other requests enqueued after it. So long as most requests were short lived, this bug went unnoticed. But when a lot of traffic needed to be processed via Geo Key Manager, the head of line blocking created problems. This type of flaw is usually only exposed under heavy load or when load testing, and will make more sense after I introduce gokeyless and explain the history of keynotto.
gokeyless-internal, very imaginatively named, is the internal instance of our Keyless SSL keyserver written in Go. I’ll abbreviate it to gokeyless for the sake of simplicity. Before we introduced keynotto as the designated key signing process, the first service sent all key signing requests directly to gokeyless. gokeyless created worker pools based on the type of operation consisting of goroutines, a very lightweight thread managed by the Go runtime. The ones of interest are the RSA, ECDSA, and the remote worker pool. RSA and ECDSA are fairly obvious, these goroutines performed RSA key signatures and ECDSA key signatures. Any requests involving Geo Key Manager were placed in the remote pool. This prevented the network-dependent remote requests from affecting local key signatures. Worker pools were an artifact of a previous generation of Keyless which didn’t need to account for remote requests. Using benchmarks we had noticed that spinning up worker pools provided some marginal latency benefits. For local operations only, performance was optimal as the computation was very fast and CPU bound. However, when we started adding remote operations that could block the workers needed for performing local operations, we decided to create a new worker pool only for remote ops.
When gokeyless was designed in 2014, Rust was not a mature language. But that changed recently, and we decided to experiment with a minimal Rust proxy placed between the first service and gokeyless. This would handle RSA and ECDSA signatures, which were about 99% of all key signing operations, while handing off the more esoteric operations like Geo Key Manager over to gokeyless running locally on the same server. The hope was that we could eke out some performance gains from the tight runtime control afforded by Rust and the use of Rust’s cryptographic libraries. Performance is incredibly important to Cloudflare since CPU is one of the main limiting factors for edge servers. Go’s RSA is notorious for being slow and using a lot of CPU. Given that one in three handshakes use RSA, it is important to optimize it. CGo seemed to create unnecessary overhead, and there was no assembly-only RSA that we could use. We tried to speed up RSA without CGo and only using assembly and made some strides, but it was still a little non-optimal. So keynotto was built to take advantage of the fast RSA implementation in BoringSSL.
The next section is a quick diversion into keynotto that isn’t strictly related to the story, but who doesn’t enjoy a hot take on Go vs Rust?
Go vs Rust: The battle continues
While keynotto was initially deployed because it objectively lowered CPU, we weren’t sure what fraction of its benefits over gokeyless were related to different cryptography libraries used vs the choice of language. keynotto used BoringCrypto, the TLS library that is part of BoringSSL, whereas gokeyless used Go’s standard crypto library crypto/tls. And keynotto was implemented in Rust, while gokeyless in Go. However, recently during the process of FedRAMP certification, we had to switch the TLS library in gokeyless to use the same cryptography library as keynotto. This was because while the Go standard library doesn’t use a FIPS validated cryptography module, this Google-maintained branch of Go that uses BoringCrypto does. We then turned off keynotto in a few very large data centers for a couple days and compared this to the control group of having keynotto turned on. We found that moving gokeyless to BoringCrypto provided a very marginal benefit, forcing us to revisit our stance on CGo. This result meant we could attribute the difference to using Rust over Go and implementation differences between keynotto and gokeyless.
Turning off keynotto resulted in an average 26% increase in maximum memory consumed and 71% increase in maximum CPU consumed. In the case of all quantiles, we observed a significant increase in latency for both RSA and ECDSA. ECDSA as measured across four different large data centers (each is a different color) is shown:
Remote ops don’t play fair
The first service, keynotto and gokeyless communicate over TCP via the custom Keyless protocol. This protocol was developed in 2014, and originally intended for communicating with external key servers aka gokeyless. After this protocol started being used internally, by increasing numbers of clients in several different languages, challenges started appearing. In particular, each client had to implement by hand the serialization code and make sense of all the features the protocol provided. The custom protocol also makes tracing more challenging to do right. So while older clients like the first service have tuned their implementations, for newer clients like keynotto, it can be difficult to consider every property such as the right way to propagate traces and such.
One such property that the Keyless protocol offers is multiplexing of requests within a connection. It does so by provisioning unique IDs for each request, which allows for out-of-order delivery of the responses similar to protocols like HTTP/2. While the first service and gokeyless leveraged this property to handle situations where the order of responses is different from the order of requests, keynotto, a much newer service, didn’t. This is why it had to process requests on the same connection sequentially. This led to local key signing requests — which took roughly 3ms — being blocked on the remote requests that took 60x that duration!
We now know why most TLS requests were degraded/dropped. But it’s also worth examining the remote requests themselves. The gokeyless remote worker pool had 200 workers in each server. One of the mitigations we applied was bumping that number to 2,000. Increasing the concurrency in a system when faced with problems that resemble resource constraints is an understandable thing to do. The reason not to do so is high resource utilization. Let’s get some context on why bumping up remote workers 10x wasn’t necessarily a problem. When gokeyless was created, it had separate pools with a configurable number of workers for RSA, ECDSA, and remote ops. RSA and ECDSA had a large fraction of workers, and remote was relatively smaller. Post keynotto, the need for the RSA and ECDSA pool was obviated since keynotto handled all local key signing operations.
Let’s do some napkin math now to prove that bumping it by 10x could not have possibly helped. We were receiving five thousand requests per second in addition to the usual traffic in Melbourne. 200 were remote requests. Melbourne has 60 servers. Looking at a breakdown of remote requests by server, the maximum remote requests one server had was ten requests per second (rps). The graph below shows the percentage of workers in gokeyless actually performing operations, “other” is the remote workers. We can see that while Melbourne worker utilization peaked at 47%, that is still not 100%. And we see that the utilization in San Francisco was much lower, so it couldn’t be SF that was slowing things down.
Given that we had 200 workers per server and at most ten requests per second, even if they took 200 ms each, gokeyless was not the bottleneck. We had no reason to increase the number of remote workers. The culprit here was keynotto’s serial queue.
The solution here was to prevent remote signing ops from blocking local ops in keynotto. We did this by changing the internal task handling model of keynotto to not just handle each connection concurrently, but also to process each request on a connection concurrently. This was done by using a multi-producer, single-consumer queue. When a request from a connection was read, it was handed off to a new thread for processing while the connection thread went back to reading requests. When the request was done processing, the response was written to a channel. A write thread for each connection was also created, which polled the read side of the channel and wrote the response back to the connection.
The next thing we added were carefully chosen timeout values to keynotto. Timeouts allow for faster failure and faster cleanup of resources associated with the timed out request. Choosing appropriate timeouts can be tricky, especially when composed across multiple services. Since keynotto is downstream of the first service but upstream of gokeyless, its timeout should be smaller than the former but larger than the latter to allow upstream services to diagnose which process triggered the timeout. Timeouts can further be complicated when network latency is involved, because trans-pacific p99 latency is very different from p99 between two US states. Using the worst case network latency can be a fair compromise. We chose keynotto timeouts to be an order of magnitude larger than the p99 request latency, since the purpose was to prevent continued stall and resource consumption.
A more general solution to avoid the two issues outlined above would be to use gRPC. gRPC was released in 2015, which was one year too late to be used in Keyless at that time. It has several advantages over custom protocol implementations such as multi-language client libraries, improved tracing, easy timeouts, load balancing, protobuf for serialization and so on, but directly relevant here is multiplexing. gRPC supports multiplexing out of the box which makes it unnecessary to handle request/response multiplexing manually.
Tracking requests across the Atlantic
Then in early June, we started seeing TLS termination timeouts in Chicago. This time, there were timeouts across the first service, keynotto, and gokeyless. We had on our hands a zone with Geo Key Manager policy set to the EU that had suddenly started receiving around 80 thousand remote rps, which was significantly more than the 200 remote rps in Australia. This time, keynotto was not responsible since its head of line blocking issues had been fixed. It was timing out waiting for gokeyless to perform the remote key signatures. A quick glance at the maximum worker utilization of gokeyless for the entire datacenter revealed that it was maxed out. To counteract the issue, we increased the remote workers from 200 to 20,000, and then to 200,000. It wasn’t enough. Requests were still timing out and worker utilization was squarely at 100% until we rate limited the traffic to that zone.
Why didn’t increasing the number of workers by a factor of 1,000 times help? We didn’t even have 250,000 rps for the entire datacenter, let alone every server.
File descriptors don’t scale
gokeyless maintains a map of ping latencies to each data center as measured from the data center that the instance is running in. It updates this map frequently and uses it to make decisions about which data center to route remote requests to. A long time ago, every server in every data center maintained its own connections to other data centers. This quickly grew out of hand and we started running out of file descriptors as the number of data centers grew. So, we switched to a new system of creating pods of 32 servers and shared the connections that each server needed to maintain with other data centers by the size of the pod. This was great for reducing the number of file descriptors being used by each server.
An example will illustrate this better. In a data center with 100 servers, there would be four pods: three with 32 servers each and one with four servers. If there were a total of 65 data centers in the world, then each server in a pod of 32 will be responsible for maintaining connections to two data centers, and each server in the pod of four will handle 16 servers. Remote requests are routed to the data center with the shortest latency. So in a pod of 32 servers, the server that maintains the connection with the closest data center will be responsible for sending remote requests from every member of the pod to the target data center. In case of failure or timeout, the data center which is second closest (and then the third closest) will be responsible. If all three closest data centers return failures, we give up and return a failure. So this means if a data center with 130 servers is receiving 80k remote rps, which is 615 rps/server, then there will be approximately four servers that are actually responsible for routing all remote traffic at any given time, and will each be handling ~20k rps. This was the case in Chicago. These requests were being routed to a data center in Hamburg, Germany and there could be a maximum of four connections at the same time between Chicago and Hamburg.
At that time, gokeyless used a worker pool architecture with a queue for buffering additional tasks. This queue blocked after 1,024 requests were queued, to create backpressure on the client. This is a common technique to prevent overloading — not accepting more tasks than the server knows it can handle. p50 RSA latency was 4 ms. The latency from Chicago to Hamburg was ~100 ms. After we bumped up the remote workers for each server to 200,000 in Chicago, we were surely not constrained on the sending end since we only had 80k rps for the whole data center and each request shouldn’t take more than 100 ms in the vast majority of cases. We know this because of Little’s Law. Little’s Law states that the capacity of a system is the arrival rate multiplied by the service time for each request. Let’s see how it applies here, and how it allowed us to prove why increasing concurrency or queue size did not help.
Consider the queue in the German data center, Hamburg in this case. We hadn’t bumped the number of remote workers there, so only 200 remote workers were available. Assuming 20k rps arrived to one server in Hamburg and each took ~5 ms to process, the necessary concurrency in the system should be 100 workers. We had 200 workers. Even without a queue, Hamburg should’ve been easily able to handle the 20k rps thrown at it. We investigated how many remote rps were actually processed per server in Hamburg:
These numbers didn’t make any sense! The maximum number at any time was a little over 700. We should’ve been able to process orders of magnitude more requests. By the time we investigated this, the incident had ended. We had no visibility into the size of the service level queue — or the size of the TCP queue — to understand what could have caused the timeouts. We speculated that while p50 for the population of all RSA key signatures might be 4 ms, perhaps for the specific population of Geo Key Manager RSA signatures, things took a whole lot longer. With 200 workers, we processed ~500 rps, which meant each operation would have to take 400 ms. p99.9 for signatures can be around this much, so it is possible that was how long things took.
We recently ran a load test to establish an upper bound on remote rps, and discovered that one server in the US could process 7,000 remote rps to the EU — much lower than Little’s Law suggests. We identified several slow remote requests through tracing and noticed these usually corresponded with multiple tries to different remote hosts. The explanation for why these retries are necessary to get through to an otherwise very functional data center is that gokeyless uses a hardcoded list of data centers and corresponding server hostnames to establish connections to. Hostnames can change when new servers are added or removed. If we are attempting to connect to an invalid host and waiting on a one second timeout, we will certainly need to retry to another host, which can cause large increases in average processing time.
After the Hamburg outage, we decided to remove gokeyless worker pools and switch to a model of one goroutine per request. Goroutines are extremely lightweight, with benchmarks that suggest that a 4 GB machine can handle around one million goroutines. This was something that made sense to do after we had extended gokeyless to support remote requests in 2017, but because of the complexity involved in changing the entire architecture of processing requests, we held off on the change. Furthermore, when we launched Geo Key Manager, we didn’t have enough traffic that might have prompted us to rethink this architecture urgently. This removes complexity from the code, because tuning worker pool sizes and queue sizes can be complicated. We have also observed, on average, a 7x drop in the memory consumed on switching to this one goroutine per request model, because we no longer have idle goroutines active as was in the case of worker pools. It also makes it easier to trace the life of an individual request because you get to follow along all the steps in its execution, which can help when trying to chase down per request slowdowns.
The complexity of distributed systems can quickly scale up to unmanageable levels, making it important to have deep visibility into it. An extremely important tool to have is distributed tracing, which tracks a request through multiple services and provides information about time spent in each part of the journey. We didn’t propagate gokeyless trace spans through parts of the remote operation, which prevented us from identifying why Hamburg only processed 700 rps. Having examples on hand for various error cases can also make diagnosis easier, especially in the midst of an outage. While we load tested our TLS termination system in the common case, where key signatures happened locally, we hadn’t load tested the less common and lower volume use case of remote operations. Using sources of truth that update dynamically to reflect the current state of the world instead of static sources is also important. In our case, the list of data centers that gokeyless connected to for performing remote operation was hardcoded a long time ago and never updated. So while we added more data centers, gokeyless was unable to make use of them, and in some cases, may have been connecting to servers with invalid.
We’re now working on overhauling several pieces of Geo Key Manager to make it significantly more flexible and scalable, so think of this as setting up the stage for a future blog post where we finally solve some of the issues outlined here, and stay tuned for updates!