Network Performance Update: Full Stack Week

Network Performance Update: Full Stack Week

This blog was published on 20 November, 2021. As we continue to optimize our network we’re publishing regular updates, which are available here.

A little over two months ago, we shared extensive benchmarking results of last mile networks all around the world. The results showed that on a range of tests (TCP connection time, time to first byte, time to last byte), and on a range of measurements (p95, mean), that Cloudflare was the fastest provider in 49% of networks around the world. Since then, we’ve worked to continuously improve performance until we’re the fastest everywhere. We set a goal to grow the number of networks where we’re the fastest by 10% every Innovation Week. We met that goal during Birthday Week (September 2021).

Today, we’re proud to report we blew the goal away for Full Stack Week (November 2021). Cloudflare measured our performance against the top 1,000 networks in the world (by number of IPv4 addresses advertised). Out of those, Cloudflare has become the fastest provider in 79 new networks, an increase of 14% of these 1000 networks. Of course, we’re not done yet, but we wanted to share the latest results and explain how we did it.

However, before we go into more detail on our network performance, we wanted to share new performance metrics on our Workers platform (given it’s Full Stack Week!). We’ve crunched the numbers of Cloudflare Workers vs Fastly’s Compute@Edge, and the results are in: Workers is 196% faster.

Faster Network Means Faster Stack

A few months ago, we also discussed the performance of Cloudflare Workers, as compared to other similar offerings out there. We announced our performance was comparable to that of Lambda and Lambda@Edge, where Cloudflare Workers outperformed at 210% and 298% respectively.

At the time, we wanted to see how we measured up against all comparable offerings, but not all offerings were generally available. As a result, we weren’t able to report on how Workers compared to another edge-based solution, such as Fastly’s Compute@Edge.

Today, we’re excited to report that Cloudflare Workers is 196% faster than Fastly’s Compute@Edge based on the time to first byte from the tests we ran on 50 nodes using Catchpoint’s data from across the world.

As we have done in the past, we executed a function that simply returns the current time and measured wait time to first byte (the length of time between a client making an HTTP request to when the client receives the first byte of the request’s response, after DNS, connection, and TLS handshake). The tests were performed on November 8, 2021, using a free tier account for both Cloudflare Workers and Fastly’s Compute@Edge.

The code we ran on both providers was exactly identical — a small function that returns all request headers:

addEventListener('fetch', event => event.respondWith(handleRequest(event)));


async function handleRequest(event) {
  let requestHeaders = Object.fromEntries(event.request.headers)

  return new Response(JSON.stringify(requestHeaders), {status: 200})
};

Blue: Cloudflare Workers
Green: Compute@Edge

If you want to explore the results on your own, here is a link to the data.

By building on our global network that we’re constantly accelerating, leveraging isolates, and driving cold starts to zero, we’re able to offer our customers ludicrously fast speeds across the board.

Now, let’s move on to an update on how Cloudflare’s broader network performance has continued to improve!

Measuring What Matters

To quantify network performance, we have to get enough data from around the world across all manner of different networks comparing ourselves with other providers. We used Real User Measurements (RUM) to fetch a 100kb file from several different providers. Users around the world report the performance of different providers.  The more users who report the data, the higher fidelity the signal is.  The goal is to provide an accurate picture of where different providers are faster, and more importantly, where Cloudflare can improve. You can read more about the methodology in the original Speed Week blog post here.

In the process of quantifying network performance, it became clear where we were not the fastest everywhere. After Birthday Week, we found 601 country/network pairs where we were more than 100ms behind the leading provider (where a country/network pair is defined as the performance of a network within a particular country).

We are constantly going through the process of figuring out why we were slow — and then improve. The challenges we faced were unique to each network and highlighted a variety of different issues that are prevalent on the Internet. We’re going to deep dive into a couple of networks, and show how we diagnosed and then improved performance.

But before we do, here are the results of our efforts in the past two weeks.

Cloudflare has become number one in TCP Connection Time in 79 new networks. This graph shows the number of networks where we ranked number 1 for TCP Connection Time during Full Stack Week compared to Birthday Week:

*Performance is defined by TCP connection time across top 1000 networks in the world by number of IPv4 addresses advertised

We’ve become faster in 79 more networks thanks to our efforts, which have represented a growth of 14% in networks where we were the fastest.  Here’s a chart showing our ranking distribution comparing Birthday Week and Full Stack Week:

Now that we’ve talked about how we’ve improved, let’s share our stories about chasing peak performance across the world — each with a different set of challenges.

Placing Traffic Properly in Peru

Our first stop for improving network performance was in Peru. We observed that a lot of users in Lima were actually getting sent to Chile to be served. Cloudflare has multiple locations in Peru, so this shouldn’t have happened. Sending traffic to Chile caused us to be ranked fourth on that particular network in the country. Our engineers knew that the best way to get to number one was to ensure that all the Lima traffic stayed within the country, so we decided to look at why so much of our traffic was getting routed outside the country.

The reason that so much traffic was being routed outside the country was due to the network provider distributing traffic to Cloudflare unevenly, and too many users were sent to one specific location. Our network has a series of checks and fail-safes that allow us to ensure that even if this happens, our users will continue to see a good experience. The checks were being engaged here because of the uneven distribution of traffic to our locations in Lima; however, the traffic was being sent out of the country.

To fix the situation in the short term, we decided to do a bit of manual load balancing across our locations in Lima while building automation to remove the need for manual actions in the future. We took one of the locations that was seeing the most traffic and stopped advertising some prefixes from that location. The hypothesis we had was that the traffic would simply flow to the other Lima locations instead of Chile, and everything would balance out, improving the TCP connect time for everyone while keeping the traffic in the country. We started to make the change on a small portion of our free traffic, and our hypothesis  proved correct.  At that point, we deployed the change in a larger scope, and the P90 Client TCP RTT dropped from 240ms to 60ms.

As a result, Cloudflare is now number one in network performance in Peru.

Slimming Down Latencies in Sri Lanka

Our next example takes us halfway around the world to Sri Lanka, where we found a network provider who was routing requests from their users to Newark.

1 * * *
2 100.85.0.1 3.061ms 2.522ms 2.728ms
3 198.51.100.146 AS29766 3.651ms 1.855ms 2.715ms
4 198.51.100.145 AS29766 3.438ms 3.225ms 2.805ms
5 222.165.177.150 AS9329 2.233ms 2.272ms 2.843ms
6 222.165.177.145 AS9329 2.703ms 2.862ms 2.291ms
7 103.87.125.253 AS45489 3.658ms 3.708ms 3.613ms
8 103.87.124.245 AS45489 120.027ms 120.665ms 120.471ms
9 103.87.124.146 AS45489 115.597ms 115.863ms 115.178ms
10 50.208.235.157 be-107-2008-pe01.60hudson.ny.ibone.comcast.net AS7922 249.884ms 249.475ms 250.063ms -> going from Sri Lanka to New York
11 96.110.41.145 be-4101-cs01.newyork.ny.ibone.comcast.net AS7922 267.839ms 267.979ms 268.719ms
12 96.110.34.34 be-3112-pe12.111eighthave.ny.ibone.comcast.net AS7922 262.647ms 261.272ms 262.272ms
13 66.208.233.106 AS7922 262.378ms 258.948ms 258.057ms
14 172.70.108.4 AS13335 268.974ms 280.475ms 268.158ms
15 172.67.182.209 AS13335 267.329ms 266.466ms 266.593ms

This understandably caused significant latency problems, and Cloudflare was ranked fourth in Sri Lanka on this network as a result. Even though Colombo is a relatively small location, we moved as much traffic as possible and advertised through the location to improve the user experience and reduce the potential amount of traffic sent to Newark.

Once this was done, we noticed that the P90 Client TCP RTT dropped from 150ms to 50ms.

However, even though we were advertising all of our ranges through Colombo and our performance in aggregate improved, this provider was still sending traffic for some Cloudflare prefixes to Newark. We reached out to the provider and let them know about this user-impacting change they made.

After doing all of these things, Cloudflare moved from fourth in Sri Lanka to number one.

Update on Birthday Week

All of these network changes and more have allowed Cloudflare to become number one in network performance in more networks than before. During Birthday Week, we announced that we were faster in more networks than our competitors. Out of the top 1,000 networks in the world (by number of IPv4 addresses advertised), here’s how Cloudflare performed during Birthday Week (September 2021):

As of Full Stack Week (November 2021), we further improved our position to be faster in 79 new networks:

But we haven’t just increased our performance on the last mile, we’ve even gotten better on Time to Last Byte as well. Here’s how the landscape looked leading up to Birthday Week (September 2021):

And here’s the network landscape now (November 2021):

Cloudflare is also committed to being the fastest provider in every country. Network performance by country is a moving target, and is largely driven by users who are accessing at any given day. Also, looking at network performance in aggregate across countries for long time frames can leave a lot of data out. That being said, this is a world map using the data that was to show the countries with the fastest network provider during Birthday Week (Sept 2021):

Here’s how it looks two months later during Full Stack Week (November 2021):

Long Tail Latency

The running theme of these performance updates has always been the long tail of issues to solve. Ironing out these kinks on our network is critical to ensure that we provide premiere performance as we grow.

Our team has put in a lot of effort and yielded some great results, but we’re constantly trying to be faster. We’ve automated the discovery of performance issues like these, and we’re looking to build automation that will detect and remediate different classes of these issues to stay on top of network performance in the future.

Tracking performance like this doesn’t just make one number faster; it helps improve the performance of your entire stack, making everything lightning-fast.

We have one more innovation week coming in 2021, and we’ll be back to report on further progress on optimizing our performance globally.

Source:: CloudFlare