Securing Memory at EPYC Scale
Security is a serious business, one that we do not take lightly at Cloudflare. We have invested a lot of effort into ensuring that our services, both external and internal, are protected by meeting or exceeding industry best practices. Encryption is a huge part of our strategy as it is embedded in nearly every process we have. At Cloudflare, we encrypt data both in transit (on the network) and at rest (on the disk). Both practices address some of the most common vectors used to exfiltrate information and these measures serve to protect sensitive data from attackers but, what about data currently in use?
Can encryption or any technology eliminate all threats? No, but as Infrastructure Security, it’s our job to consider worst-case scenarios. For example, what if someone were to steal a server from one of our data centers? How can we leverage the most reliable, cutting edge, innovative technology to secure all data on that host if it were in the wrong hands? Would it be protected? And, in particular, what about the server’s RAM?
Data in random access memory (RAM) is usually stored in the clear. This can leave data vulnerable to software or hardware probing by an attacker on the system. Extracting data from memory isn’t an easy task but, with the rise of persistent memory technologies, additional attack vectors are possible:
- Dynamic random-access memory (DRAM) interface snooping
- Installation of hardware devices that access host memory
- Freezing and stealing dual in-line memory module (DIMMs)
- Stealing non-volatile dual in-line memory module (NVDIMMs)
So, what about enclaves? Hardware manufacturers have introduced Trusted Execution Environments (also known as enclaves) to help create security boundaries by isolating software execution at runtime so that sensitive data can be processed in a trusted environment, such as secure area inside an existing processor or Trusted Platform Module.
While this allows developers to shield applications in untrusted environments, it doesn’t effectively address all of the physical system attacks mentioned previously. Enclaves were also meant to run small pieces of code. You could run an entire OS in an enclave, but there are limitations and performance issues in doing so.
This isn’t meant to bash enclave usage; we just wanted a better solution for encrypting all memory at scale. We expected performance to be compromised, and conducted tests to see just how much.
Time to get EPYC
Since we are using AMD for our tenth generation “Gen X servers”, we found an interesting security feature within the System on a Chip architecture of the AMD EPYC line. Secure Memory Encryption (SME) is an x86 instruction set extension introduced by AMD and available in the EPYC processor line. SME provides the ability to mark individual pages of memory as encrypted using standard x86 page tables. A page that is marked encrypted will be automatically decrypted when read from DRAM and encrypted when written to DRAM. SME can therefore be used to protect the contents of DRAM from physical attacks on the system.
Sounds complicated, right? Here’s the secret: It isn’t 😀
SME is comprised of two components:
- AES-128 encryption engine: Embedded in the memory controller. It is responsible for encrypting and decrypting data in main memory when an appropriate key is provided via the Secure Processor.
- AMD Secure Processor (AMD-SP): An on-die 32-bit ARM Cortex A5 CPU that provides cryptographic functionality for secure key generation and key management. Think of this like a mini hardware security module that uses a hardware random number generator to generate the 128-bit key(s) used by the encryption engine.
How It Works
We had two options available to us when it came to enabling SME. The first option, regular SME, requires enabling a model specific register
MSR 0xC001_0010[SMEE]. This enables the ability to set a page table entry encryption bit:
- 0 = memory encryption features are disabled
- 1 = memory encryption features are enabled
After memory encryption is enabled, a physical address bit (C-Bit) is utilized to mark if a memory page is protected. The operating system sets the bit of a physical address to 1 in the page table entry (PTE) to indicate the page should be encrypted. This causes any data assigned to that memory space to be automatically encrypted and decrypted by the AES engine in the memory controller:
Becoming More Transparent
While arbitrarily flagging which page table entries we want encrypted is nice, our objective is to ensure that we are incorporating the full physical protection of SME. This is where the second mode of SME came in, Transparent SME (TSME). In TSME, all memory is encrypted regardless of the value of the encrypt bit on any particular page. This includes both instruction and data pages, as well as the pages corresponding to the page tables themselves.
Enabling TSME is as simple as:
2. Enabling kernel support with the following flag:
After a reboot you should see the following in
$ sudo dmesg | grep SME [ 2.537160] AMD Secure Memory Encryption (SME) active
To weigh the pros and cons of implementation against the potential risk of a stolen server, we had to test the performance of enabling TSME. We took a test server that mirrored a production edge metal with the following specs:
- Memory: 8 x 32GB 2933MHz
- CPU: AMD 2nd Gen EPYC 7642 with SMT enabled and running NPS4 mode
- OS: Debian 9
- Kernel: 5.4.12
The performance tools we used were:
We used a custom STREAM binary with 24 threads, using all available cores, to measure the sustainable memory bandwidth (in MB/s). Four synthetic computational kernels are run, with the output of each kernel being used as an input to the next. The best rates observed are reported for each choice of thread count.
The figures above show 2.6% to 4.2% performance variation, with a mean of 3.7%. These were the highest numbers measured, which fell below an expected performance impact of >5%.
While cryptsetup is normally used for encrypting disk partitions, it has a benchmarking utility that will report on a host’s cryptographic performance by iterating key derivation functions using memory only:
$ sudo cryptsetup benchmark # Tests are approximate using memory only (no storage IO). PBKDF2-sha1 1162501 iterations per second for 256-bit key PBKDF2-sha256 1403716 iterations per second for 256-bit key PBKDF2-sha512 1161213 iterations per second for 256-bit key PBKDF2-ripemd160 856679 iterations per second for 256-bit key PBKDF2-whirlpool 661979 iterations per second for 256-bit key
Benchmarky is a homegrown tool provided by our Performance team to run synthetic workloads against a specific target to evaluate performance of different components. It uses Cloudflare Workers to send requests and read stats on responses. In addition to that, it also reports versions of all important stack components and their CPU usage. Each test runs 256 concurrent clients, grabbing a cached 10kB PNG image from a performance testing endpoint, and calculating the requests per second (RPS).
In the majority of test results, performance decreased by a nominal amount, actually less than we expected. AMD’s official white paper on SME even states that encryption and decryption of memory through the AES engine does incur a small amount of additional latency for DRAM memory accesses, though dependent on the workload. Across all 11 data points, our average performance drag was only down by .699%. Even at scale, enabling this feature has reduced the worry that any data could be exfiltrated from a stolen server.
While we wait for other hardware manufacturers to add support for total memory encryption, we are happy that AMD has set the bar high and is protecting our next generation edge hardware.