Analysis of the EPYC 145% performance gain in Cloudflare Gen 12 servers

Cloudflare’s network spans more than 330 cities in over 120 countries, serving over 60 million HTTP requests per second and 39 million DNS queries per second on average. These numbers will continue to grow, and at an accelerating pace, as will Cloudflare’s infrastructure to support them. While we can continue to scale out by deploying more servers, it is also paramount for us to develop and deploy more performant and more efficient servers.

At the heart of each server is the processor (central processing unit, or CPU). Even though many aspects of a server rack can be redesigned to improve the cost to serve a request, CPU remains the biggest lever, as it is typically the primary compute resource in a server, and the primary enabler of new technologies.

Cloudflare’s 12th Generation server with AMD EPYC 9684-X (codenamed Genoa-X) is 145% more performant and 63% more efficient. These are big numbers, but where do the performance gains come from? Cloudflare’s hardware system engineering team did a sensitivity analysis on three variants of 4th generation AMD EPYC processor to understand the contributing factors.

For the 4th generation AMD EPYC Processors, AMD offers three architectural variants: 

  • mainstream classic Zen 4 cores, codenamed Genoa

  • efficiency optimized dense Zen 4c cores, codenamed Bergamo

  • cache optimized Zen 4 cores with 3D V-cache, codenamed Genoa-X

  • Figure 1 (from left to right): AMD EPYC 9654 (Genoa), AMD EPYC 9754 (Bergamo), AMD EPYC 9684X (Genoa-X)

    Key features common across the 4th Generation AMD EPYC processors:

    • Up to 12x Core Complex Dies (CCDs)

    • Each core has a private 1MB L2 cache

    • The CCDs connect to memory, I/O, and each other through an I/O die

    • Configurable Thermal Design Power (cTDP) up to 400W

    • Support up to 12 channels of DDR5-4800 1DPC

    • Support up to 128 lanes PCIe Gen 5

    Classic Zen 4 Cores (Genoa):

    • Each Core Complex (CCX) has 8x Zen 4 Cores (16x Threads)

    • Each CCX has a shared 32 MB L3 cache (4 MB/core)

    • Each CCD has 1x CCX

    Dense Zen 4c Cores (Bergamo):

    • Each CCX has 8x Zen 4c Cores (16x Threads)

    • Each CCX has a shared 16 MB L3 cache (2 MB/core)

    • Each CCD has 2x CCX

    Classic Zen 4 Cores with 3D V-cache (Genoa-X):

    • Each CCX has 8x Zen 4 Cores (16x Threads)

    • Each CCX has a shared 96MB L3 cache (12 MB/core)

    • Each CCD has 1x CCX

    For more information on 4th generation AMD EPYC Processors architecture, see: https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf 

    The following table is a summary of the specification of the AMD EPYC 7713 CPU in our Gen 11 server against the three CPU candidates, one from each variant of the 4th generation AMD EPYC Processors architecture:

    CPU Model

    AMD EPYC 7713

    AMD EPYC 9654

    AMD EPYC 9754

    AMD EPYC 9684X

    Series

    Milan

    Genoa

    Bergamo

    Genoa-X

    # of CPU Cores

    64

    96

    128

    96

    # of Threads

    128

    192

    256

    192

    Base Clock

    2.0 GHz

    2.4 GHz

    2.25 GHz

    2.4 GHz

    All Core Boost Clock

    ~2.7 GHz*

    3.55 Ghz

    3.1 Ghz

    3.42 Ghz

    Total L3 Cache

    256 MB

    384 MB

    256 MB

    1152 MB

    L3 cache per core

    4 MB / core

    4 MB / core

    2 MB / core

    12 MB / core

    Maximum configurable TDP

    240W

    400W

    400W

    400W

    * AMD EPYC 7713 all core boost clock is based on Cloudflare production data, not the official specification from AMD

    cf_benchmark

    Readers may remember that Cloudflare introduced cf_benchmark when we evaluated Qualcomm’s ARM chips, using it as our first pass benchmark to shortlist AMD’s Rome CPU for our Gen 10 servers and to evaluate our chosen ARM CPU Ampere Altra Max against AWS Graviton 2. Likewise, we ran cf_benchmark against the three candidate CPUs for our 12th Gen servers: AMD EPYC 9654 (Genoa), AMD EPYC 9754 (Bergamo), and AMD EPYC 9684X (Genoa-X). The majority of cf_benchmark workloads are compute bound, and given more cores or higher CPU frequency, they score better. The graph and the table below show the benchmark performance comparison of the three CPU candidates with Genoa 9654 as the baseline, where > 1.00x indicates better performance.

     

    Genoa 9654 (baseline)

    Bergamo 9754

    Genoa-X 9684X

    openssl_pki

    1.00x

    1.16x

    1.01x

    openssl_aead

    1.00x

    1.20x

    1.01x

    luajit

    1.00x

    0.86x

    1.00x

    brotli

    1.00x

    1.11x

    0.98x

    gzip

    1.00x

    0.87x

    1.01x

    go

    1.00x

    1.09x

    1.00x

    Bergamo 9754 with 128 cores scores better in openssl_pki, openssl_aead, brotli, and go benchmark suites, and performs less favorably in luajit and gzip benchmark suites. Genoa-X 9684X (with significantly more L3 cache) doesn’t offer a significant boost in performance for these compute-bound benchmarks.

    These benchmarks are representative of some of the common workloads Cloudflare runs, and are useful in identifying software scaling issues, system configuration bottlenecks, and the impact of CPU design choices on workload-specific performance. However, the benchmark suite is not an exhaustive list of all workloads Cloudflare runs in production, and in reality, the workloads included in the benchmark suites are almost certainly not the exclusive workload running on the CPU. In short, though benchmark results can be informative, they do not represent a good indication of production performance when a mix of these workloads run on the same processor.

    Performance simulation

    To get an early indication of production performance, Cloudflare has an internal performance simulation tool that exercises our software stack to fetch a fixed asset repeatedly. The simulation tool can be configured to fetch a specified fixed-size asset and configured to include or exclude services like WAF or Workers in the request path. Below, we show the simulated performance between the three CPUs for an asset size of 10 KB, where >1.00x indicates better performance.

     

    Milan 7713

    Genoa 9654

    Bergamo 9754

    Genoa-X 9684X

    Lab simulation performance multiplier

    1.00x

    2.20x

    1.95x

    2.75x

    Based on these results, Bergamo 9754, which has the highest core count, but smallest L3 cache per core, is least performant among the three candidates, followed by Genoa 9654. The Genoa-X 9684X with the largest L3 cache per core is the most performant. This data suggests that our software stack is very sensitive to L3 cache size, in addition to core count and CPU frequency. This is interesting and worth a deep dive into a sensitivity analysis of our workload against a few (high level) CPU design points, especially core scaling, frequency scaling, and L2/L3 cache sizes scaling.

    Sensitivity analysis

    Core sensitivity

    Number of cores is the headline specification that practically everyone talks about, and one of the easiest improvements CPU vendors can make to increase performance per socket. The AMD Genoa 9654 has 96 cores, 50% more than the 64 cores available on the AMD Milan 7713 CPUs that we used in our Gen 11 servers. Is more always better? Does Cloudflare’s primary workload scale with core count and effectively utilize all available cores?

    The figure and table below shows the result of a core scaling experiment performed on an AMD Genoa 9654 configured with 96 cores, 80 cores, 64 cores, and 48 cores, which was done by incrementally disabling 2x CCD (8 cores/CCD) at each step. The result is GREAT, as Cloudflare’s simulated primary workload scales linearly with core count on AMD Genoa CPUs.

    Core count

    Core increase

    Performance increase

    48

    1.00x

    1.00

    64

    1.33x

    1.39x

    80

    1.67x

    1.71x

    96

    2.00x

    2.05x

    TDP sensitivity

    Thermal Design Power (TDP), is the maximum amount of heat generated by a CPU that the cooling system is designed to dissipate, but more commonly refers to the power consumption of the processor under the maximum theoretical loads. AMD Genoa 9654’s default TDP is 360W, but can be configured up to 400W TDP. Is more always better? Does Cloudflare continue to see meaningful performance improvement up to 400W, or does performance stagnate at some point?

    The chart below shows the result of sweeping the TDP of the AMD Genoa 9654 (in power determinism mode) from 240W to 400W. (Note: x-axis step size is not linear).

    Cloudflare’s simulated primary workload continues to see incremental performance improvements up to the maximum configurable 400W, albeit at a less favorable perf/watt ratio.

    Looking at TDP sensitivity data is a quick and easy way to identify if performance stagnates at some power point, but what does power sensitivity actually measure? There are several factors contributing to CPU power consumption, but let’s focus on one of the primary factors: dynamic power consumption. Dynamic power consumption is approximately CV2f, where C is the switched load capacitance, V is the regulated voltage, and f is the frequency. In modern processors like the AMD Genoa 9654, the CPU dynamically scales its voltage along with frequency, so theoretically, CPU dynamic power is loosely proportional to f3. In other words, measuring TDP sensitivity is measuring the frequency sensitivity of a workload. Does the data agree? Yes!

    cTDP

    All core boost frequency (GHz)

    Perf (rps) / baseline

    240

    2.47

    0.78x

    280

    2.75

    0.87x

    320

    2.93

    0.93x

    340

    3.13

    0.97x

    360

    3.3

    1.00x

    380

    3.4

    1.03x

    390

    3.465

    1.04x

    400

    3.55

    1.05x

    Frequency sensitivity

    Instead of relying on an indirect measure through the TDP, let’s measure frequency sensitivity directly by sweeping the maximum boost frequency.

    At above 3GHz, the data shows that Cloudflare’s primary workload sees roughly 2% incremental improvement for every 0.1GHz all core average frequency increment. We hit the 400W power cap at 3.545GHz. This is notably higher than the typical all core boost frequency that Cloudflare Gen 11 servers with AMD Milan 7713 at 2.7GHz see in production, or at 2.4GHz in our performance simulation, which is amazing!

    L3 cache size sensitivity

    What about L3 cache size sensitivity? L3 cache size is one of the primary design choices and major differences between the trio of Genoa, Bergamo, and Genoa-X. Genoa 9654 has 4 MB L3/core, Bergamo 9754 has 2 MB L3/core, and Genoa-X has 12 MB L3/core. L3 cache is the last and largest “memory” bank on-chip before having to access memory on DIMMs outside the chip that would take significantly more CPU cycles.

    We ran an experiment on the Genoa 9654 to check how performance scales with L3 cache size. L3 cache size per core is reduced through MSR writes (but could also be done using Intel RDT) and L3 cache per core is increased by disabling physical cores in a CCD (which reduces the number of cores sharing the fixed size 32 MB L3 cache per CCD effectively growing the L3 cache per core). Below is the result of the experiment, where >1.00x indicates better performance:

    L3 cache size increase vs baseline 4MB per core

    0.25x

    0.5x

    0.75x

    1x

    1.14x

    1.33x

    1.60x

    2.00x

    rps/core / baseline

    0.67x

    0.78x

    0.89x

    1.00x

    1.08x

    1.15x

    1.25x

    1.31x

    L3 cache miss rate per CCD

    56.04%

    39.15%

    30.37%

    23.55%

    22.39%

    19.73%

    16.94%

    14.28%

    Even though the expectation was that the impact of a different L3 cache size gets diminished by the faster DDR5 and larger memory bandwidth, Cloudflare’s simulated primary workload is quite sensitive to L3 cache size. The L3 cache miss rate dropped from 56% with only 1 MB L3 per core, to 14.28% with 8 MB L3/core. Changing the L3 cache size by 25% affects the performance by approximately 11%, and we continue to see performance increase to 2x L3 cache size, though the performance increase starts to diminish when we get to 2x L3 cache per core.

    Do we see the same behavior when comparing Genoa 9654, Bergamo 9754 and Genoa-X 9684X? We ran an experiment comparing the impact of L3 cache size, controlling for core count and all core boost frequency, and we also saw significant deltas. Halving the L3 cache size from 4 MB/core to 2 MB/core reduces performance by 24%, roughly matching the experiment above. However, increasing the cache 3x from 4 MB/core to 12 MB/core only increases performance by 25%, less than the indication provided by previous experiments. This is likely because the performance gain we saw on experiment result above could be partially attributed to less cache contention due to reduced number of cores based on how we set up the test. Nevertheless, these are significant deltas!

    L3/core

    2MB/core

    4MB/core

    12MB/core

    Perf (rps) / baseline

    0.76x

    1x

    1.25x

    Putting it all together

    The table below summarizes how each factor from sensitivity analysis above contributes to the overall performance gain. There are an additional 6% to 14% of unaccounted performance improvement that are contributed by other factors like larger L2 cache, higher memory bandwidth, and miscellaneous CPU architecture changes that improve IPC.

     

    Milan

    7713

    Genoa

    9654

    Bergamo

    9754

    Genoa-X

    9684X

    Lab simulation performance multiplier

    1x

    2.2x

    1.95x

    2.75x

    Performance multiplier due to Core scaling

    1x

    1.5x

    2x

    1.5x

    Performance multiplier due to Frequency scaling

    (*Note: Milan 7713 all core frequency is ~2.4GHz when running simulated workload at 100% CPU utilization)

    1x

    1.32x

    1.21x

    1.29x

    Performance multiplier due to L3 cache size scaling

    1x

    1x

    0.76x

    1.25x

    Performance multiplier due to other factors like larger L2 cache, higher memory bandwidth, miscellaneous CPU architecture changes that improve IPC

    1x

    1.11x

    1.06x

    1.14x

    Performance evaluation in production

    How do these CPU candidates perform with real-world traffic and an actual production workload mix? The table below summarizes the performance of the three CPUs in lab simulation and in production. Genoa-X 9684X continues to outperform in production.

    In addition, the Gen 12 server equipped with Genoa-X offered outstanding performance but only consumed 1.5x more power per system than our Gen 11 server with Milan 7713. In other words, we see a 63% increase in performance per watt. Genoa-X 9684X provides the best TCO improvement among the 3 options, and was ultimately chosen as the CPU for our Gen 12 server.

     

    Milan 7713

    Genoa 9654

    Bergamo 9754

    Genoa-X 9684X

    Lab simulation performance multiplier

    1x

    2.2x

    1.95x

    2.75x

    Production performance multiplier

    1x

    2x

    2.15x

    2.45x

    Production performance per watt multiplier

    1x

    1.33x

    1.38x

    1.63x

    The Gen 12 server with AMD Genoa-X 9684X is the most powerful and the most power efficient server Cloudflare has built to date. It serves as the underlying platform for all the incredible services that Cloudflare offers to our customers globally, and will help power the growth of Cloudflare infrastructure for the next several years with improved cost structure. 

    Hardware engineers at Cloudflare work closely with our infrastructure engineering partners and externally with our vendors to design and develop world-class servers to best serve our customers. 

    Come join us at Cloudflare to help build a better Internet!

    Source:: CloudFlare