Gen X Performance Tuning
We are using AMD 2nd Gen EPYC 7642 for our tenth generation “Gen X” servers. We found many aspects of this processor compelling such as its increase in performance due to its frequency bump and cache-to-core ratio. We have partnered with AMD to get the best performance out of this processor and today, we are highlighting our tuning efforts that led to an additional 6% performance.
Thermal Design Power & Dynamic Power
Thermal design power (TDP) and dynamic power, amongst others, play a critical role when tuning a system. Many share a common belief that thermal design power is the maximum or average power drawn by the processor. The 48-core AMD EPYC 7642 has a TDP rating of 225W which is just as high as the 64-core AMD EPYC 7742. It comes to mind that fewer cores should translate into lower power consumption, so why is the AMD EPYC 7642 expected to draw just as much power as the AMD EPYC 7742?
TDP Comparison between the EPYC 7642, EPYC 7742 and top-end EPYC 7H12
Let’s take a step back and understand that TDP does not always mean the maximum or average power that the processor will draw. At a glance, TDP may provide a good estimate about the processor’s power draw; TDP is really about how much heat the processor is expected to generate. TDP should be used as a guideline for designing cooling solutions with appropriate thermal capacitance. The cooling solution is expected to indefinitely dissipate heat up to the TDP, in turn, this can help the chip designers determine a power budget and create new processors around that constraint. In the case of the AMD EPYC 7642, the extra power budget was spent on retaining all of its 256 MiB L3 cache and the cores to operate at a higher sustained frequency as needed during our peak hours.
The overall power drawn by the processor depends on many different factors; dynamic or active power establishes the relationship between power and frequency. Dynamic power is a function of capacitance – the chip itself, supply voltage, frequency of the processor, and activity factor. The activity factor is dependent on the characteristics of the workloads running on the processor. Different workloads will have different characteristics. Some examples include Cloudflare Workers or Cloudflare for Teams, the hotspots in these or any other particular programs will utilize different parts of the processor, affecting the activity factor.
Determinism Modes & Configurable TDP
The latest AMD processors continue to implement AMD Precision Boost which opportunistically allows the processor to operate at a higher frequency than its base frequency. How much higher can the frequency go depends on many different factors such as electrical, thermal, and power limitations.
AMD offers a knob known as determinism modes. This knob affects power and frequency, placing emphasis one over the other depending on the determinism selected. There is a white paper posted on AMD’s website that goes into the nuanced details of determinism, and remembering how frequency and power are related – this was the simplest definition I took away from the paper:
Performance Determinism – Power is a function of frequency.
Power Determinism – Frequency is a function of power.
Another knob that is available to us is Configurable Thermal Design Power (cTDP), this knob allows the end-user to reconfigure the factory default thermal design power. The AMD EPYC 7642 is rated at 225W; however, we have been given guidance from AMD that this particular part can be reconfigured up to 240W. As mentioned previously, the cooling solution must support up to the TDP to avoid throttling. We have also done our due diligence and tested that our cooling solution can reliably dissipate up to 240W, even at higher ambient temperatures.
We gave these two knobs a try and got the results shown in the figures below using 10 KiB web assets over HTTPS provided by our performance team over a sustained period of time to properly heat up the processor. The figures are broken down into the average operating frequencies across all 48 cores, total package power drawn from the socket, and the highest reported die temperature out of the 8 dies that are laid out on the AMD EPYC 7642. Each figure will compare the results we obtained from power and performance determinism, and finally, cTDP at 240W using power determinism.
Performance determinism clearly emphasized stabilizing operating frequency by matching its frequency to its lowest performing core over time. This mode appears to be ideal for tuners that emphasizes predictability over maximizing the processor’s operating frequency. In other words, the number of cycles that the processor has at its disposal every second should be predictable. This is useful if two or more cores share data dependencies. By allowing the cores to work in unison; this can prevent one core stalling another. Power determinism on the other hand, maximized its power and frequency as much as it can.
Heat generated by power can be compounded even further by ambient temperature. All components should stay within safe operating temperature at all times as specified by the vendor’s datasheet. If unsure, please reach out to your vendor as soon as possible. We put the AMD EPYC 7642 through several thermal tests and determined that it will operate within safe operating temperatures. Before diving into the figure below, it is important to mention that our fan speed ramped up over time, preventing the processor from reaching critical temperature; nevertheless, this meant that our cooling solution worked as intended – a combination of fans and a heatsink rated for 240W. We have yet to see the processors throttle in production. The figure below shows the highest temperature of the 8 dies laid out on the AMD EPYC 7642.
An unexpected byproduct of performance determinism was frequency jitter. It took a longer time for performance determinism to achieve steady-state frequency, which contradicted predictable performance that performance determinism mode was meant to achieve.
Finally, here are the deltas out from the real world with performance determinism and TDP set to factory default of 225W as the baseline. Deltas under 10% can be influenced by a wide variety of factors especially in production, however, none of our data showed a negative trend. Here are the averages.
Nodes Per Socket (NPS)
The AMD EPYC 7642 physically lays out its 8 dies across 4 different quadrants on a single package. By having such a layout, AMD supports dividing its dies into NUMA domains and this feature is called Nodes Per Socket or NPS. Available NPS options differ model-to-model, the AMD EPYC 7642 supports 4, 2, and 1 node(s) per socket. We thought it might be worth our time exploring this option despite the fact that none of these choices will end up yielding a shared last level cache. We did not observe any significant deltas from the NPS options in terms of performance.
Tuning the processor to yield an additional 6% throughput or requests per second out of the box has been a great start. Some parts will have more rooms for improvement than others. In production we were able to achieve an additional 2% requests per seconds with power determinism, and we achieved another 4% by reconfiguring the TDP to 240W. We will continue to work with AMD as well as internal teams within Cloudflare to identify additional areas of improvement. If you like the idea of supercharging the Internet on behalf of our customers, then come join us.