A Tour Inside Cloudflare’s G9 Servers
Cloudflare operates at a significant scale, handling nearly 10% of the Internet HTTP requests that is at peak more than 25 trillion requests through our network every month. To ensure this is as efficient as possible, we own and operate all the equipment in our 154 locations around the world in order to process the volume of traffic that flows through our network. We spend a significant amount of time specing and designing servers that makes up our network to meet our ever changing and growing demands. On regular intervals, we will take everything we’ve learned about our last generation of hardware and refresh each component with the next generation…
If the above paragraph sounds familiar, it’s a reflecting glance to where we were 5 years ago using today’s numbers. We’ve done so much progress engineering and developing our tools with the latest tech through the years by pushing ourselves at getting smarter in what we do.
Here though we’re going to blog about muscle.
Since the last time we blogged about our G4 servers, we’ve iterated one generation each of the past 5 years. Our latest generation is now the G9 server. From a G4 server comprising 12 Intel Sandybridge CPU cores, our G9 server has 192 Intel Skylake CPU cores ready to handle today’s load across Cloudflare’s network. This server is QCT’s T42S-2U multi-node server where we have 4 nodes per chassis, therefore each node has 48 cores. Maximizing compute density is the primary goal since rental colocation space and power are costly. This 2U4N chassis form factor has served us well for the past 3 generations, we’re revisiting this option once more.
Exploded picture of the G9 server’s main components. 4 sleds represent 4 nodes each with 2 24-core Intel CPUs
Each high-level hardware component has gone through their own upgrade as well for a balanced scale up keeping our stack CPU bound, making this generation the most radical revision since we moved on from using HP 5 years ago. Let’s glance through each of those components.
- Previously: 2x 12 core Intel Xeon Silver 4116 2.1Ghz 85W
- Now: 2x 24 core Intel custom off-roadmap 1.9Ghz 150W
The performance of our infrastructure is heavily directed by how much compute we can squeeze in a given physical space and power. In essence, requests per second (RPS) per Watt is a critical metric that Qualcomm’s ARM64 46 core Falkor chip had a big advantage over Intel’s Skylake 4116. Embracing the value of optionality and market competition, we made some noise.
Intel proposed to co-innovate with us an off-roadmap 24-core Xeon Gold CPU specifically made for our workload offering considerable value in Performance per Watt. For this generation, we continue using Intel as system solutions are widely available while we’re working on realizing ARM64’s benefits to production. We expect this CPU to perform with better RPS per Watt right off the bat; increasing the RPS by 200% from doubling the amount of cores, and increasing the power consumption by 174% from increasing the CPUs TDP from 85W to 150W each.
- Previously: 6x Micron 1100 512G SATA
- Now: 6x Intel S4500 480G SATA
With all the requests we foresee for G9 to process, we need to tame down the outlying and long-tail latencies we have seen in our previous SSDs. Lowering p99 and p999 latency has been a serious endeavor. To help save milliseconds in disk response time for 0.01% or even 0.001% of all the traffic we see isn’t a joke!
Datacenter grade SSDs in Intel S4500 will proliferate our fleet. These disks come with better endurance to last over the expected service life of our servers and better performance consistency with lower p95+ latency.
- Previously: dual-port 10G Solarflare Flareon Ultra SFN8522
- Now: dual-port 25G Mellanox ConnectX-4 Lx OCP
Our DDoS mitigation program is all done in userspace, so network adapter model can be anything on market as long as it supports XDP. We went with Mellanox for their solid reliability and their readily available 2x25G CX4 model. Upgrading to 25G intra-rack ethernet network is easy future-proofing since the 10G SFP+ ethernet port shares the same physical form factor as the 25G’s SFP28. Switch and NIC vendors offer models that can be configured as either 10G or 25G.
Another change is the card’s form factor itself being OCP instead of the regular HHHL. That leaves the one x16 PCI slot free for the option to integrate something else; maybe for a high capacity NVMe? Maybe a GPU? We like that our server has the room for upgrades if needed.
Rear side of G9 chassis showing all 4 sled nodes, each leaving room to add on a PCI card
- Previously: 192G (12x16G) DDR4 2400Mhz RDIMM
- Now: 256G (8x32G) DDR4 2666Mhz RDIMM
Going from 192G (12x16G) to 256G (8x32G) made practical sense. The motherboard has 12 DIMM channels, which were all populated in the G8. We want to have the ability to upgrade just in case, as well as keeping memory configuration balanced and optimal bandwidth capacity. 8x32G works well leaving 4 channels open for future upgrades.
Physical stress test
Our software stack scales nicely enough that we can confidently assume we’ll double the amount of requests having twice the amount of CPU cores compared to G8. What we need to ensure before we ship any G9 servers out to our current 154 and future PoPs is that there won’t be any design issues pertaining to thermal nor power failures. At the extreme case that all of our cores run up 100% load, would that cause our server to run above operating temperature? How much power would a whole server with 192 cores totaling 1200W TDP consume? We set out to record both by applying a stress test to the whole system.
Temperature readings were recorded off of ipmitool sdr list, then graphed showing socket and motherboard temperature. For 2U4N being such a compact form factor, it’s worth monitoring that a server running hot isn’t literally running hot. The red lines represent the 4 nodes that compose the whole G9 server under test; blue lines represent G8 nodes (we didn’t stress the G8’s so their temperature readings are constant).
Both graphs are looking fine and not out of control mostly thanks to the T42S-2U’s 4 80mm x 80mm fans capable of blowing over 90CFM; which we managed to reach their max spec RPM.
Recording the new system’s max power consumption is critical information we need to properly design our rack stack choosing the right Power Distribution Unit and ensuring we’re below the budgeted power while keeping adequate phase balancing. For example, a typical 3-phase US-rated 24-Amp PDU gives you a max power rating of 8.6 kilowatts. We wouldn’t be able to fit 9 servers powered by that same PDU if each were running at 1kW without any way to cap them.
Above graph shows our max power as the red line to reach 1.9kW, or crudely 475W per node which is excellent in a modern server. Notice the blue and yellow lines representing the G9’s 2 power supplies summing the total power. The yellow line PSU appearing off is intentional as part of our testing procedure to show the PSU’s resilience in abrupt power changes.
Stressing out all available CPU, IO, and memory along with maxing out fan RPMs combined is a good indicator for the highest possible heat and power draw this server can do. Hopefully we won’t ever see such an extreme case like this in live production environments, and we expect much milder actual results (read: we don’t think catastrophic failures to be possible).
First impression in live production
We increased capacity to one of our most loaded PoPs by adding G9 servers. The following time graphs represent a 24 hour range with how G9 performance compares with G8 in a live PoP.
Looking great! They’re doing about 2x the requests compared to G8 with about 20% less CPU usage. Note that all results here are based from non-optimized systems, so we could add more load on the G9 and have their CPU usage comparable to the G8. Additionally, they’re doing that amount with better CPU processing time shown as NGINX execution time above. You can see the latency gap between generations widening as we go towards the 99.9th percentile:
Long-tail latencies for NGINX CPU processing time (lower is better)
Talking about latency, let’s check how our new SSDs are doing on that front:
Cache disk IOPS and latency (lower is better)
The trend still holds that G9 is doing better. It’s a good thing that the G9’s SSDs aren’t seeing as many IOPS since it means we’re not hitting our cache disks as often and are able to store and process more on CPU and memory. We’ve cut the read cache hits and latency by nearly half. Less writes results in better performance consistency and longevity.
Another metric that G9 does more is power consumption, doing about 55% more then the G8. While it’s not a piece of information to brag about, it is expected when older CPUs were once rated at 85W TDP to now using ones with 150W TDP; and when considering how much work the G9 servers do:
G9 is actually 1.5x more efficient than G8. Temperature readings were checked as well. Inlet and outlet chassis temps, as well as CPU temps, are well within operating temperatures.
G9 is actually more 1.5x efficient than G8. Temperature readings were checked as well. Inlet and outlet chassis temps, as well as CPU temps, are well within operating temperatures.
Now that’s muscle! In other words for every 3 G8 servers, just 2 of those G9’s would take on the same workload. If one of our racks would normally have 9 G8 servers, than only 4 G9’s are needed. Inversely, planning to turn up a cage of 10 G9 racks would be the same as if we did 15 G8 racks!
And we got big plans to cover our entire network with G9 servers, with most of them planned for the existing cities your site most likely uses. By 2019, you’ll benefit with increased bandwidth and lower wait times. And we’ll benefit in expanding and turning up datacenters quicker and more efficiently.
Server X? Right now is exciting times at Cloudflare. Many teams and engineers are testing, porting, and implementing new stuff that can help us lower operating costs, explore new products and possibilities, and improve Quality of Service. We’re tackling problems and taking on projects that are unique in the industry.
Serverless computing like Cloudflare Workers and beyond will ask for new challenges to our infrastructure as all of our customers can program their own features on Cloudflare’s edge network.
The network architecture that was conventionally made up of routers, switches, and servers has been merged into 3-in-1 box solutions allowing Cloudflare services to be set up into locations that weren’t possible before.
The advent of NVMe and persistent memory, as well as the possibility of turning SSDs into DRAM, is redefining how we design cache servers and handle tiered caching. SSDs and memory aren’t treated as separate entities like they used to.
Hardware brings the company together like a rug in a living room. See how many links I mentioned above to show you how we’re one team dedicated to build a better Internet. Everything that we do here roots down to how we manage the tons of aluminum and silicon we’ve invested. There’s a lot of developing on our hardware to help Cloudflare grow to where we envision ourselves to be. If you’d like to contribute, we’d love to hear from you.