Theoretically, there is no impediment to adding library or the compiled version of Golang with BoringCrypto. We added our implementation of the Kyber-512 algorithm (this algorithm can be eventually swapped by another one; we’re not picking any here. We are using it for our testing phase) to those libraries and implemented the “hybrid” mechanism as part of the TLS handshake. For the “classical” algorithm we used curve P-256. We then compiled certain services with these new TLS libraries.
Name of algorithm
Number of times loop executed
Average runtime per operation
Number of bytes required per operation
Number of allocations
Table 1: Benchmarks of the “key share” operation of Curve P-256 and Kyber-512: Scalar Multiplication and Encapsulation respectively. Benchmarks ran on Darwin, amd64, Intel(R) Core(TM) i7-9750H CPU @ 2.60 GHz.
Note that as TLS supports the described “negotiation” mechanism for the key exchange, the client and server have a way of mutually deciding what algorithms they want to use. This means that it is not required that both a client or a server support or even prefer the exact same algorithms: they just need to share support for a single algorithm for a handshake to succeed. In turn, herewith, even if we advertise post-quantum cryptography and a server/client does not support it, they will not fail but rather agree on some other algorithm they share.
A note on a matter we left on hold above: why are we not migrating the authentication phase of TLS to post-quantum? Certificate-based authentication in TLS, which is the one we commonly use at Cloudflare, also depends on systems on the wider Internet. Thus, changes to authentication require a coordinated and much wider effort to change. Certificates are attested as proofs of identity by outside parties: migrating authentication means coordinating a ceremony of migration with these outside parties. Note though that at Cloudflare we use a PKI with internally-hosted Certificate Authorities (CAs), which means that we can more easily change our algorithms. This will still need careful planning. We will not do this today, but we will in the near future.
The first step of our post-quantum migration is done. We have TLS libraries with post-quantum cryptography using a hybrid mechanism. The second step is to test this new mechanism in specific Cloudflare connections and services. We will look at three systems from Cloudflare that we have started migrating to post-quantum cryptography. The services in question are: Logfwrdr, Cloudflare Tunnel, and GoKeyless.
A post-quantum Logfwdr
Logfwdr is an internal service, written in Golang, that handles structured logs, and sends them from our servers for processing (to a subservice called ‘Logreceiver’) where they write them to Kafka. The connection between Logfwdr and Logreceiver is protected by TLS. The same goes for the connection between Logreceiver and Kafka in core. Logfwdr pushes its logs through “streams” for processing.
This service seemed an ideal candidate for migrating to post-quantum cryptography as its architecture is simple, it has long-lived connections, and it handles a lot of traffic. In order to first test the viability of using post-quantum cryptography, we created our own instance of Logreceiver and deployed it. We also created our own stream (the “pq-stream”), which is basically a copy of a HTTP stream (which was remarkably easy to add). We then compiled these services with the modified TLS library and we got a post-quantum protected Logfwdr.
What we found was that using post-quantum cryptography is faster than using “classical” cryptography! This was expected, though, as we are using a lattice-based post-quantum algorithm (Kyber512). The TLS latency of both post-quantum handshakes and “classical” ones can be noted in Figure 1. The figure shows more handshakes are executed than usual behavior as these servers are frequently restarted.
Note though that we are not using “only” post-quantum cryptography but rather the “hybrid” mechanism described above. This could increase performance times: in this case, the increase was minimal and still kept the post-quantum handshakes faster than the classical ones. Perhaps what makes the TLS handshakes faster in the post-quantum case is the usage of TLS 1.3, as the “classical” Logfwdr is using TLS 1.2. Logfwdr, though, is a service that executes long-lived handshakes, so in aggregate TLS 1.2 is not “slower” but it does have a slower start time.
As shown in Figure 2, the average batch duration of the post-quantum stream is lower than when not using post-quantum cryptography. This may be in part due to the fact that we are not sending the quantum-protected data all the way to Kafka (as the non-post-quantum stream is doing). We didn’t yet change the connection to Kafka post-quantum.
We didn’t encounter any failures during this testing that ran for about some weeks. This gave us good insight that putting post-quantum cryptography into our internal network with actual data is possible. It also gave us confidence to begin migrating codebases to modified TLS libraries, which we will maintain.
What are the next steps for Logfwdr? Now that we confirmed it is possible, we will first start migrating stream by stream to this hybrid mechanism until we reach full post-quantum migration.
A post-quantum gokeyless
gokeyless is our own way to separate servers and TLS long-term private keys. With it, private keys are kept on a specialized key server operated by customers on their own architecture or, if using Geo Key Manager, in selected Cloudflare locations. We also use it for Cloudflare-held private keys with a service creatively known as gokeyless-internal. The final piece of this architecture is another service called Keynotto. Keynotto is a service written in Rust that only mints RSA and ECDSA key signatures (that are executed with the stored private key).
How does the overall architecture of gokeyless work? Let’s start with a request. The request arrives at the Cloudflare network and we perform TLS termination. Any signing request is forwarded to Keynotto. A small portion of requests (specifically from GeoKDL or external gokeyless) cannot be handled by Keynotto directly, and are instead forwarded to gokeyless-internal. gokeyless-internal also acts as a key server proxy, as it redirects connections to the customer’s keyservers (external gokeyless). External gokeyless is both the server that a customer runs and the client that will be used to contact it. The architecture can be seen in Figure 3.
Migrating the transport that this architecture uses to post-quantum cryptography is a bigger challenge, as it involves migrating a service that lives on the customer side. So, for our testing phase, we decided to go for the simpler path that we are able to change ourselves: the TLS handshake between Keynotto and gokeyless-internal. This small test-bed means two things: first, that we needed to change another TLS library (as Keynotto is written in Rust) and, second, that we needed to change gokeyless-internal in such a way that it used post-quantum cryptography only for the handshakes with Keynotto and for nothing else. Note that we did not migrate the signing operations that gokeyless or Keynotto executes with the stored private key; we just migrated the transport connections.
Adding post-quantum cryptography to the rustls codebase was a straightforward exercise and we exposed an easy-to-use API call to signal the usage of post-quantum cryptography (as seen in Figure 4 and Figure 5). One thing that we noted when reviewing the TLS usage in several Cloudflare services is that giving the option to choose the algorithms for a ciphersuite, key share, and authentication in the TLS handshake confuses users. It seemed more straightforward to define the algorithm at the library level, and have a boolean or API call signal the need for this post-quantum algorithm.
We ran a small test between Keynotto and gokeyless-internal with much success. Our next steps are to integrate this test into the real connection between Keynotto and gokeyless-internal, and to devise a plan for a customer post-quantum protected gokeyless external. This is the first instance in which our migration to post-quantum will not be ending at our edge but rather at the customer’s connection point.
A post-quantum Cloudflare Tunnel
Cloudflare Tunnel is a reverse proxy that allows customers to quickly connect their private services and networks to the Cloudflare network without having to expose their public IPs or ports through their firewall. It is mainly managed at the customer level through the usage of cloudflared, a lightweight server-side daemon, in their infrastructure. cloudflared opens several long-lived TCP connections (although, cloudflared is increasingly using the QUIC protocol) to servers on Cloudflare’s global network. When a request to a hostname comes, it is proxied through these connections to the origin service behind cloudflared.
The easiest part of the service to make post-quantum secure appears to be the connection between our network (with a service part of Tunnel called origintunneld located there) and cloudflared, which we have started migrating. While exploring this path and looking at the whole life of a Tunnel connection, we found something more interesting, though. When the Tunnel connections eventually reach core, they end up going to a service called Tunnelstore. Tunnelstore runs as a stateless application in a Kubernetes deployment, and to provide TLS termination (alongside load balancing and more) it uses a Kubernetes ingress.
The Kubernetes ingress we use at Cloudflare is made of Envoy and Contour. The latter configures the former depending on Kubernetes resources. Envoy uses the BoringSSL library for TLS. Switching TLS libraries in Envoy seemed difficult: there are thoughts on how to integrate OpenSSL to it (and even some thoughts on adding post-quantum cryptography) and ways to switch TLS libraries. Adding post-quantum cryptography to a modified version of BoringSSL, and then specifying that dependency in the Bazel file of Envoy seems to be the path to go for, as our internal test has confirmed (as seen in Figure 6). As for Contour, for many years, Cloudflare has been running their own patched version of it: we will have to again patch this version with our Golang library to provide post-quantum cryptography. We will make these libraries (and the TLS ones) available for usage.
Changing the Kubernetes ingress at Cloudflare not only makes Tunnel completely quantum-safe (beyond the connection between our global network and cloudflared), but it also makes any other services using ingress safe. Our first tests on migrating Envoy and Contour to TLS libraries that contain post-quantum protections have been successful, and now we have to test how it behaves in the whole ingress ecosystem.
What is next?
The main tests are now done. We now have TLS libraries (in Go, Rust, and C) that give us post-quantum cryptography. We have two systems ready to deploy post-quantum cryptography, and a shared service (Kubernetes ingress) that we can change. At the beginning of the blog post, we said that “the life of most requests at Cloudflare begins and ends at the edge of our global network”: our aim is that post-quantum cryptography does not end there, but rather reaches all the way to where customers connect as well. Let’s explore the future challenges and this customer post-quantum path in this other blog post!