Updated March 1, 2019: The post below outlines the performance benefits of using Retpoline against the Spectre, Variant 2 (CVE-2017-5715) attack—as observed with 64-bit Windows Insider Preview Builds 18272 and later. While Retpoline is currently disabled by default on production Windows 10 client devices, we have backported the OS modifications needed to support Retpoline so that it can be used with Windows 10, version 1809 and have those modifications in the March 1, 2019 update (KB4482887).
Over the coming months, we will enable Retpoline as part of phased rollout via cloud configuration. Due to the complexity of the implementation and changes involved, we are only enabling Retpoline performance benefits for Windows 10, version 1809 and later releases.
In January 2018, Microsoft released an advisory and security updates related to a newly discovered class of hardware vulnerabilities involving speculative execution side channels (known as Spectre and Meltdown) that affect AMD, ARM, and Intel CPUs to varying degrees. If you haven’t had a chance to learn about these issues, we recommend watching The Case of Spectre and Meltdown by the team at TU Graz from BlueHat Israel, reading the blog post by Jann Horn (@tehjh) of Google Project Zero.
We have also had multiple posts detailing the internals of our implementation to handle these side-channel attacks.
- Mitigating speculative execution side channel hardware vulnerabilities
- KVA Shadow: Mitigating Meltdown on Windows
- Analysis and mitigation of L1 Terminal Fault (L1TF)
For today’s post, we have kernel developers Andrea Allievi and Chris Kleynhans describing our design and implementation of retpoline for Windowswhich improves performance of Spectre variant 2 mitigations (CVE-2017-5715) to noise-level for most scenarios. These improvements are available today in Windows Insider Builds (builds 18272 or newer, x64-only).
At a high level, the Spectre variant 2 attack exploits indirect branches to steal secrets located in higher privilege contexts (e.g. kernel-mode vs user-mode). Indirect branches are instructions where the target of the branch is not contained in the instruction itself, such as when the destination address is stored in a CPU register.
Describing the full Spectre attack is outside the scope of this article. Details are in the links above or in this whitepaper from Intel.
Our original mitigations for Spectre variant 2 made use of new capabilities exposed by CPU microcode updates to restrict indirect branch speculation when executing within kernel mode (IBRS and IBPB). While this was an effective mitigation from a security standpoint, it resulted in a larger performance degradation than we’d like on certain processors and workloads.
For this reason, starting in early 2018, we investigated alternatives and found promise in an approach developed by Google called retpoline. A full description of retpoline can be found here, but in short, retpoline works by replacing all indirect call or jumps in kernel-mode binaries with an indirect branch sequence that has safe speculation behavior.
This sequence, shown below in Figure 1, effects a safe control transfer to the target address by performing a function call, modifying the return address and then returning.
RP0: call RP2 ; push address of RP1 onto the stack and jump to RP2
RP1: int 3 ; breakpoint to capture speculation
RP2: mov [rsp], <Jump Target> ; overwrite return address on the stack to desired target
RP3: ret ; return
While this construct is not as fast as a regular indirect call or jump, it has the side effect of preventing the processor from unsafe speculative execution. This proves to be much faster than running all of kernel mode code with branch speculation restricted (IBRS set to 1). However, this construct is only safe to use on processors where the RET instruction does not speculate based on the contents of the indirect branch predictor. Those processors are all AMD processors as well as Intel processors codenamed Broadwell and earlier according to Intel’s whitepaper. Retpoline is not applicable to Skylake and later processors from Intel.
Windows requirements for Retpoline
Traditionally the transformation of indirect calls and jumps into retpolines is performed when a binary is built by the compiler. However, there are several functional requirements in Windows that make a purely compile-time implementation insufficient.
These key requirements are:
- Single binary: Windows releases are long-lived and must support a wide variety of hardware with a single set of binaries. On some hardware retpoline is not a complete mitigation because of alternate behavior of the ret instruction and retpoline must not be used. Further, future hardware may eliminate the need for retpoline entirely. Therefore, a Windows implementation of retpoline must allow the feature to be enabled and disabled at boot time using a single set of binaries, based on whether the underlying hardware is vulnerable, compatible and whether Spectre variant 2 mitigations are enabled on the system. Further, the runtime overhead of retpoline support should be minimal when the feature is disabled.
- 3rd party device drivers: A lot of the code that runs in kernel mode is not part of Windows and consists of 3rd party device driver code. Traditional retpoline would only be secure if all these drivers were recompiled with a new version of the compiler. Given the breadth of Windows 3rd party driver ecosystem, it is not realistic to expect all non-inbox 3rd party drivers to be recompiled and released to customers at the same time. Therefore, a Windows implementation of retpoline must be able to support a mixed environment, providing high performance when running drivers that have been updated, but allowing for graceful fallback to hardware-based mitigations upon entering a non-retpoline driver to preserve security.
- Driver portability: Windows drivers are not bound to a specific release of Windows, many drivers that are built today for Windows 10 will also support older versions of the operating system. Therefore, a Windows implementation of retpoline must ensure that drivers compiled with retpoline support can run on a version of Windows that does not support retpoline.
To satisfy requirement 1 and 3, we decided that binaries would ship in a non-retpolined state and then be transformed into a retpolined state by rewriting the code sequences for all indirect calls. This ensures that systems that do not use retpoline can use the binaries as compiled without needing any support for retpoline and with minimal runtime cost.
However, performing the transformation at runtime does lead to one problem. How do we know what transformations need to be applied? Disassembling and analyzing driver machine code to locate all indirect calls is not practical.
Dynamic Value Relocation Table (DVRT)
To solve this problem, we collaborated with the compiler team in Visual Studio to develop a system whereby the compiler can emit a new type of metadata into driver binaries describing each indirect call or jump in the system. This metadata takes the form of new relocation entries in the Dynamic Value Relocation Table (DVRT).
The DVRT was originally introduced back in the Windows 10 Creators Update to improve kernel address space layout randomization (KASLR). It allowed the memory manager’s page frame number (PFN) database and page table self-map to be assigned dynamic addresses at runtime. The DVRT is stored directly in the binary and contains a series of relocation entries for each symbol (i.e. address) that is to be relocated. The relocation entries are themselves arranged in a hierarchical fashion grouped first by symbol and then by containing page to allow for a compact description of all locations in the binary that reference a relocatable symbol.
At build time, the compiler keeps track of all references to these special symbols and fills out the DVRT. Then at runtime the kernel will parse the DVRT and update each symbol reference with the correct dynamically assigned address. Importantly, the kernel will skip over any DVRT entries it does not recognize (i.e. those with an unknown symbol) so adding new symbols to the DVRT does not break older versions of Windows.
These properties meant the DVRT was a perfect place to store our retpoline metadata, however the existing DVRT format needed to be extended to support retpoline.
Based on Windows requirements, we classified indirect calls/jumps into three distinct forms and each of these forms has its own type of retpoline relocation and corresponding runtime fixup.
- Import calls/jumps
- Switchtable jumps
- Generic indirect calls/jumps
Let’s talk a little about each of these types of calls.
Import calls/jumps are, as the name implies, used for calls/jumps made by a binary to functions that have been imported from another binary. When compiling with retpoline, the compiler ensures that all such calls conform to the following form:
48 FF 15 XX XX XX XX call qword ptr [_imp_<function>]
0F 1F 44 00 00 nop
The call or jmp instruction always directly references the import address table (IAT) and has 5 bytes of additional padding (to be used by the retpoline fixup).
Switchtable jumps are used for jumps made to other locations within the same function and are so-named because of their usage in implementing C/C++ switch statements. When compiling with retpoline support the compiler ensures that such calls are always made through a register and take the following form:
FF D0 jmp rax
CC CC CC int 3
Generic Indirect Calls/Jumps
All other indirect calls/jumps fall into the generic type. To simplify the retpoline relocation format and the corresponding fixup logic, the compiler ensures that all such indirect calls/jumps provide their target address in the RAX register. The exact format of the call/jump instruction however differs depending on whether it is protected by control flow guard (CFG).
Loading binaries at runtime
Now that we have a way to identify all the indirect calls/jumps in the binary, we need to apply the fixups.
The NT memory manager has long had infrastructure to apply fixups to binaries at runtime. This infrastructure was extended to understand retpoline relocations and their corresponding fixups.
But what exactly do these fixups look like? As mentioned earlier, the Windows implementation needs to support mixed environments in which some drivers are not compiled with retpoline support. This means that we cannot simply replace every indirect call with a retpoline sequence like the example shown in the introduction. We need to ensure that the kernel gets the opportunity to inspect the target of the call or jump so that it can apply appropriate mitigations if the target does not support retpoline.
For this reason, we transform every indirect call or jump into a direct call or jump to a kernel provided “retpoline stub function”. For example, an indirect call to an imported function that looks like this:
call qword ptr [_imp_ExAllocatePoolWithTag] ; Target address located at a REL32 offset
nop ; Padding
Will be replaced at runtime with a direct call to the retpoline import stub:
mov r10, qword ptr [_imp_ExAllocatePoolWithTag] ; R10 = target address
call _guard_retpoline_import_r10 ; Direct REL32 call to the stub function
There are several retpoline stub functions each of which is specialized to the type of call/jump it handles. However, each function generally performs the following steps:
- Check if the target binary supports retpoline
- Prior to transferring control to the target address, the function must determine whether the target address belongs to a driver that supports retpoline. To determine this, the kernel maintains a sparse bitmap of the entire kernel-mode address space with each bit describing a 64 KB region of the address space. Bits in this bitmap are set to 1 if and only if their corresponding region of address space belongs to a kernel-mode binary that fully supports retpoline.
- If the bitmap check determines that the target address does not belong to a retpolined binary, the stub function has to fall back to the hardware-based Spectre variant 2 mitigation (by setting IBRS to restrict branch speculation) and then perform a regular indirect call/jmp. Otherwise, the kernel does not need to set IBRS. On processors that do not support IBRS, retpoline will, instead, perform IBPB if user-to-kernel protection is enabled as described here.
- Since the target of a switch table jump is always in the same binary as the source (and therefore the target is guaranteed to support retpoline), this bitmap check is omitted from the switchtable jump stub functions.
- Check if the target address is a valid CFG target
- For CFG instrumented indirect calls/jumps the retpoline stub function is responsible for checking the kernel-mode CFG bitmap to verify that the target address given is a valid CFG call target. If this check fails, then the stub function will bugcheck the system to prevent any exploit that attempts an indirect control transfer to an invalid address.
- Transfer control to the target using a retpoline.
The usage of these stub functions ensures that we can satisfy the requirement to support mixed environments, however they do introduce one additional problem. The x64 direct call/jump instruction can only encode a target address within 2 GB of the call-site (since the target is specified by a signed 16- or 32-bit offset). Since the retpoline stub functions are implemented in the NT kernel binary this would generally mean that drivers would have to be loaded within 2 GB of the kernel binary.
To work around this requirement, all retpoline stub functions are contained within a single section of the NT kernel binary and have been carefully written to take no dependencies on their position relative to the rest of the binary. This allows us to map the physical memory pages backing the retpoline stub functions immediately after every driver in the system, giving each driver its own “copy” of the retpoline stub functions that is guaranteed to be within 2 GBof every indirect call/jump.
Indirect calls due to imported functions are by far the most common form of indirect control transfers in kernel-mode. The import call targets are determined at driver load time by processing the import address table (IAT) and remain constant throughout the driver’s lifetime. This means that most of the work provided by the retpoline import stub is unnecessary because we know at driver load time exactly where each of these calls will end up going and we know whether the target binary supports retpoline or not. Hence, we can use a much faster calling sequence.
With import optimization, we use the retpoline fixup infrastructure to replace eligible import calls with direct calls to the imported function. This eliminates the overhead of the retpoline import call stub as well as the guaranteed branch prediction miss due to retpoline itself. To be eligible for import optimization, a call must meet the following requirements:
- The call/jump must be from a retpolined binary to another retpolined binary.
- This is necessary to maintain the security guarantees of retpoline because once we’ve rewritten the indirect call into a direct call the kernel no longer gets a chance to observe the target address and enable IBRS.
- The target of the call must be within 2 GB of the call site.
- This is because as mentioned above direct call/jump instructions on x64 can only encode a 32-bit offset.
- In order to virtually guarantee that import optimization can be applied all retpolined modules, the OS loader and kernel make sure that all kernel-mode modules are packed tightly in the address space while maintaining address space layout randomizations (ASLR).
Here is an example of how the code generation for the call is modified.
Original code sequence
call [__imp_<Function>] ; Call to an imported function
nop ; 5-byte nop
Import Optimized code sequence
mov r10, [__imp_<Function>] ; R10 = target address (normal transformation)
call <Function> ; Direct REL32 call to target
Import optimization turned out to be a big performance win! Hence, even on processors where retpoline cannot be used due to alternate return instruction behavior, we still use import optimization.
Retpoline has significantly improved the performance of the Spectre variant 2 mitigations on Windows. When all relevant kernel-mode binaries are compiled with retpoline, we’ve measured ~25% speedup in Office app launch times and up to 1.5-2x improved throughput in the Diskspd (storage) and NTttcp(networking) benchmarks on Broadwell CPUs in our lab. It is enabled by default in the latest Windows Client Insider Fast builds (for builds 18272 and higher on machines exposing compatible speculation control capabilities) and is targeted to ship with 19H1.
To check if retpoline and import optimizations are enabled, you can use the PowerShell cmdlet Get-SpeculationControlSettings. You can also use NtQuerySystemInformation to programmatically query retpoline status.
For a more in-depth look, here is a talk by Andrea Allievi at BlueHat 2018 talking about retpoline on Windows.
Give the latest builds a try and let us know your experience!
Join 20,000+ people and subscribe to our newsletter to get the latest tech tutorials and news directly to your inbox.