A cost-effective and extensible testbed for transport protocol development
This was originally published on Perf Planet’s 2019 Web Performance Calendar.
At Cloudflare, we develop protocols at multiple layers of the network stack. In the past, we focused on HTTP/1.1, HTTP/2, and TLS 1.3. Now, we are working on QUIC and HTTP/3, which are still in IETF draft, but gaining a lot of interest.
QUIC is a secure and multiplexed transport protocol that aims to perform better than TCP under some network conditions. It is specified in a family of documents: a transport layer which specifies packet format and basic state machine, recovery and congestion control, security based on TLS 1.3, and an HTTP application layer mapping, which is now called HTTP/3.
Let’s focus on the transport and recovery layer first. This layer provides a basis for what is sent on the wire (the packet binary format) and how we send it reliably. It includes how to open the connection, how to handshake a new secure session with the help of TLS, how to send data reliably and how to react when there is packet loss or reordering of packets. Also it includes flow control and congestion control to interact well with other transport protocols in the same network. With confidence in the basic transport and recovery layer, we can take a look at higher application layers such as HTTP/3.
To develop such a transport protocol, we need multiple stages of the development environment. Since this is a network protocol, it’s best to test in an actual physical network to see how works on the wire. We may start the development using localhost, but after some time we may want to send and receive packets with other hosts. We can build a lab with a couple of virtual machines, using Virtualbox, VMWare or even with Docker. We also have a local testing environment with a Linux VM. But sometimes these have a limited network (localhost only) or are noisy due to other processes in the same host or virtual machines.
Next step is to have a test lab, typically an isolated network focused on protocol analysis only consisting of dedicated x86 hosts. Lab configuration is particularly important for testing various cases – there is no one-size-fits-all scenario for protocol testing. For example, EDGE is still running in production mobile networks but LTE is dominant and 5G deployment is in early stages. WiFi is very common these days. We want to test our protocol in all those environments. Of course, we can’t buy every type of machine or have a very expensive network simulator for every type of environment, so using cheap hardware and an open source OS where we can configure similar environments is ideal.
The QUIC Protocol Testing lab
The goal of the QUIC testing lab is to aid transport layer protocol development. To develop a transport protocol we need to have a way to control our network environment and a way to get as many different types of debugging data as possible. Also we need to get metrics for comparison with other protocols in production.
The QUIC Testing Lab has the following goals:
- Help with multiple transport protocol development: Developing a new transport layer requires many iterations, from building and validating packets as per protocol spec, to making sure everything works fine under moderate load, to very harsh conditions such as low bandwidth and high packet loss. We need a way to run tests with various network conditions reproducibly in order to catch unexpected issues.
- Debugging multiple transport protocol development: Recording as much debugging info as we can is important for fixing bugs. Looking into packet captures definitely helps but we also need a detailed debugging log of the server and client to understand the what and why for each packet. For example, when a packet is sent, we want to know why. Is this because there is an application which wants to send some data? Or is this a retransmit of data previously known as lost? Or is this a loss probe which is not an actual packet loss but sent to see if the network is lossy?
- Performance comparison between each protocol: We want to understand the performance of a new protocol by comparison with existing protocols such as TCP, or with a previous version of the protocol under development. Also we want to test with varying parameters such as changing the congestion control mechanism, changing various timeouts, or changing the buffer sizes at various levels of the stack.
- Finding a bottleneck or errors easily: Running tests we may see an unexpected error – a transfer that timed out, or ended with an error, or a transfer was corrupted at the client side – each test needs to make sure every test is run correctly, by using a checksum of the original file to compare with what is actually downloaded, or by checking various error codes at the protocol of API level.
When we have a test lab with separate hardware, we have benefits, as follows:
- Can configure the testing lab without public Internet access – safe and quiet.
- Handy access to hardware and its console for maintenance purpose, or for adding or updating hardware.
- Try other CPU architectures. For clients we use the Raspberry Pi for regular testing because this is ARM architecture (32bit or 64bit), similar to modern smartphones. So testing with ARM architecture helps for compatibility testing before going into a smartphone OS.
- We can add a real smartphone for testing, such as Android or iPhone. We can test with WiFi but these devices also support Ethernet, so we can test them with a wired network for better consistency.
Here is a diagram of our QUIC Protocol Testing Lab:
This is a conceptual diagram and we need to configure a switch for connecting each machine. Currently, we have Raspberry Pis (2 and 3) as an Origin and a Client. And small Intel x86 boxes for the Traffic Shaper and Edge server plus Ethernet switches for interconnectivity.
- Origin is simply serving HTTP and HTTPS test objects using a web server. Client may download a file from Origin directly to simulate a download direct from a customer’s origin server.
- Client will download a test object from Origin or Edge, using a different protocol. In typical a configuration Client connects to Edge instead of Origin, so this is to simulate an edge server in the real world. For TCP/HTTP we are using the curl command line client and for QUIC, quiche’s http3_client with some modification.
- Edge is running Cloudflare’s web server to serve HTTP/HTTPS via TCP and also the QUIC protocol using quiche. Edge server is installed with the same Linux kernel used on Cloudflare’s production machines in order to have the same low level network stack.
- Traffic Shaper is sitting between Client and Edge (and Origin), controlling network conditions. Currently we are using FreeBSD and ipfw + dummynet. Traffic shaping can also be done using Linux’ netem which provides additional network simulation features.
The goal is to run tests with various network conditions, such as bandwidth, latency and packet loss upstream and downstream. The lab is able to run a plaintext HTTP test but currently our focus of testing is HTTPS over TCP and HTTP/3 over QUIC. Since QUIC is running over UDP, both TCP and UDP traffic need to be controlled.
Test Automation and Visualization
In the lab, we have a script installed in Client, which can run a batch of testing with various configuration parameters – for each test combination, we can define a test configuration, including:
- Network Condition – Bandwidth, Latency, Packet Loss (upstream and downstream)
For example using netem traffic shaper we can simulate LTE network as below,(RTT=50ms, BW=22Mbps upstream and downstream, with BDP queue size)
$ tc qdisc add dev eth0 root handle 1:0 netem delay 25ms $ tc qdisc add dev eth0 parent 1:1 handle 10: tbf rate 22mbit buffer 68750 limit 70000
- Test Object sizes – 1KB, 8KB, … 32MB
- Test Protocols: HTTPS (TCP) and QUIC (UDP)
- Number of runs and number of requests in a single connection
The test script outputs a CSV file of results for importing into other tools for data processing and visualization – such as Google Sheets, Excel or even a jupyter notebook. Also it’s able to post the result to a database (Clickhouse in our case), so we can query and visualize the results.
Sometimes a whole test combination takes a long time – the current standard test set with simulated 2G, 3G, LTE, WiFi and various object sizes repeated 10 times for each request may take several hours to run. Large object testing on a slow network takes most of the time, so sometimes we also need to run a limited test (e.g. testing LTE-like conditions only for a sanity check) for quick debugging.
Chart using Google Sheets:
The following comparison chart shows the total transfer time in msec for TCP vs QUIC for different network conditions. The QUIC protocol used here is a development version one.
Debugging and performance analysis using of a smartphone
Mobile devices have become a crucial part of our day to day life, so testing the new transport protocol on mobile devices is critically important for mobile app performance. To facilitate that, we need to have a mobile test app which will proxy data over the new transport protocol under development. With this we have the ability to analyze protocol functionality and performance in mobile devices with different network conditions.
Adding a smartphone to the testbed mentioned above gives an advantage in terms of understanding real performance issues. The major smartphone operating systems, iOS and Android, have quite different networking stack. Adding a smartphone to testbed gives the ability to understand these operating system network stacks in depth which aides new protocol designs.
The above figure shows the network block diagram of another similar lab testbed used for protocol testing where a smartphone is connected both wired and wirelessly. A Linux netem based traffic shaper sits in-between the client and server shaping the traffic. Various networking profiles are fed to the traffic shaper to mimic real world scenarios. The client can be either an Android or iOS based smartphone, the server is a vanilla web server serving static files. Client, server and traffic shaper are all connected to the Internet along with the private lab network for management purposes.
The above lab has mobile devices for both Android or iOS installed with a test app built with a proprietary client proxy software for proxying data over the new transport protocol under development. The test app also has the ability to make HTTP requests over TCP for comparison purposes.
The Android or iOS test app can be used to issue multiple HTTPS requests of different object sizes sequentially and concurrently using TCP and QUIC as underlying transport protocol. Later, TTOTAL (total transfer time) of each HTTPS request is used to compare TCP and QUIC performance over different network conditions. One such comparison is shown below,
The table above shows the total transfer time taken for TCP and QUIC requests over an LTE network profile fetching different objects with different concurrency levels using the test app. Here TCP goes over native OS network stack and QUIC goes over Cloudflare QUIC stack.
Debugging network performance issues is hard when it comes to mobile devices. By adding an actual smartphone into the testbed itself we have the ability to take packet captures at different layers. These are very critical in analyzing and understanding protocol performance.
It’s easy and straightforward to capture packets and analyze them using the tcpdump tool on x86 boxes, but it’s a challenge to capture packets on iOS and Android devices. On iOS device ‘rvictl’ lets us capture packets on an external interface. But ‘rvictl’ has some drawbacks such as timestamps being inaccurate. Since we are dealing with millisecond level events, timestamps need to be accurate to analyze the root cause of a problem.
We can capture packets on internal loopback interfaces on jailbroken iPhones and rooted Android devices. Jailbreaking a recent iOS device is nontrivial. We also need to make sure that autoupdate of any sort is disabled on such a phone otherwise it would disable the jailbreak and you have to start the whole process again. With a jailbroken phone we have root access to the device which lets us take packet captures as needed using tcpdump.
Packet captures taken using jailbroken iOS devices or rooted Android devices connected to the lab testbed help us analyze performance bottlenecks and improve protocol performance.
iOS and Android devices different network stacks in their core operating systems. These packet captures also help us understand the network stack of these mobile devices, for example in iOS devices packets punted through loopback interface had a mysterious delay of 5 to 7ms.
Cloudflare is actively involved in helping to drive forward the QUIC and HTTP/3 standards by testing and optimizing these new protocols in simulated real world environments. By simulating a wide variety of networks we are working on our mission of Helping Build a Better Internet. For everyone, everywhere.
Would like to thank SangJo Lee, Hiren Panchasara, Lucas Pardue and Sreeni Tellakula for their contributions.