Skylake bug: a detective story

Published in

Ahrefs

16 min readJun 28, 2017

It was a dark and stormy night; the skylake CPU buzzed with excitement, and then, suddenly, the hyperthreads started to lock up..

Or something like that.

This week a new erratum for the Intel Skylake and Kabylake processors families was brought to public attention on the Debian mailing list, and then on various social media and news outlets.

We have been investigating this issue since January with the core OCaml team, as we were struggling with a mysterious bug affecting our developers machines, and ultimately our production system, resulting in a corruption of important data in our databases.

At Ahrefs, we operate a fleet of thousands of servers, running a wide variety of services (huge web crawler among others). At this scale, dealing with unexpected application behaviors is common. While we try to reduce the probability of the software not functioning as expected, bugs are sadly a real part of our everyday life. Even though we can assume the underlying hardware running any infrastructure can be thought of as more reliable and less prone to bugs than software components, issues can still arise in unexpected ways. When the number of servers increases, it is not unusual to observe faults in the hardware preventing the system from functioning as specified.

It is certainly not frequent to encounter such problems in CPUs but reading through the list of errata published by any manufacturer, each CPU model contains a fair amount of bugs. This story is about the bug in the microcode of Skylake processor leading to incorrect code execution under certain conditions. This is certainly scary at first sight: how can we trust our system if we cannot trust its main component ? Yet, like software bugs, processor defects can be identified, contained, and we can take actions to prevent them from impacting the operation of the infrastructure.

We do not know the full implications of this particular bug, especially security implications in case of untrusted code execution. But we’d like to tell the story of this erratum from our point of view, to provide some context, and show that dealing with it was not much different than dealing with any usual software flaw. While this post aims to cover our own perspective on this adventure, we would like to thank Mark Shinwell, Xavier Leroy, Frédéric Bour, everyone involved in the Mantis issue and the OCaml IRC channel for their help and time spent investigating with us. Update: Xavier Leroy told his own side of the story in another blog post.

Setting the scene

Our story starts in late 2016 after some of our backend developers received new laptops to work on. After a few days Enguerrand Decorne noticed unusual crashes during compilation of our OCaml codebase.

This issue, considered mildly annoying at first, seemed to affect only Enguerrand’s machine. For a few days no other machine would exhibit the same behavior, so we figured this was a fault specific to his system configuration.

However, concerns were subsequently raised after witnessing the generation of invalid machine code and later on, after the deployment of a service on one of our new clusters composed of Skylake Xeon processors, leading to the insertion of corrupted data into our storage system. The priority raised from the annoying level, to potentially critical. Other developers started working together to obtain more information and assess the impact on our infrastructure. Soon after we were able to reproduce the issue on several machines.

The remainder of this post is a technical description of the steps taken to ensure that our systems were operating safely. It is intended to show that such low level CPU issues is not necessarily fatal — in less than two weeks, with the great help of core OCaml developers, we identified the conditions of the crash and set up a workaround.

Tracking down crashes in OCaml

Most of our backend code is written in OCaml, a high level and expressive language supporting functional programming style (among others), which allows us to develop robust systems with ease, thanks to its strong type system and mature legacy.

The compiler segfaults were definitely a surprise, since this shouldn’t happen for any program written in OCaml, as type system and other features (such as automatic bounds-checking) usually guard us from such errors. However, stack overflows can be possible sources of segfault (when a non-optimal recursion is running too deep), so our first intuition was to increase the stack size when running the compiler. This didn’t change anything, and the reported fault address wasn’t anywhere near the stack address bound.

Before witnessing the crash on other machines, we suspected a failure in the virtualization software used by our two developers that were able to reproduce the crash, who use VMware as a part of their development workflow. We tried early on to switch to Virtualbox, but the migration proved itself fruitless as the crashes kept appearing. After a short while we began encountering the same issue on physical machines, so we ruled out a possible virtualization software bug.

The usual debugging process for crashing OCaml code didn’t prove effective — we needed to narrow down our approach.

OCaml ships with two backend implementations: a bytecode interpreter and a native compiler. We were able to reproduce the issue using both a native compiler and a compiler running on the bytecode interpreter. Consequently, this ruled out a miscompilation coming from the code emitted by the compiler, the OCaml runtime itself was misbehaving.

The runtime code is written in C, and implements low level functionalities, including the garbage collector used by both backends. After rebuilding the runtime with debug symbols, we were able to retrieve a proper stack trace and core dump. The stack trace pointed to the garbage collector’s mark phase. OCaml’s GC is a classic generational mark and sweep collector. The mark phase walks the heap starting from pointers on the stack and other registered root values, and marks every reachable block of memory.

Further inspection with gdb of the frame and address of crash revealed that the marking code encountered a corrupted block header with invalid size information, causing what looked like a buffer overrun error. Each memory block allocated in OCaml heap begins with a header word, storing metadata used by the GC, including a tag describing the kind of value present in this memory block. The header contains the size of the block, and the crash happened when the mark code was attempting to scan an array which was supposed to be more than 1TB large.

This was obviously not the cause of the problem but rather the consequence: something corrupted the header word after this block had been properly allocated, postponing the crash until the next GC cycle. It was the right time to escalate the issue to the OCaml bugtracker, after isolating a proper test case to reproduce the issue.

A set of strange leads

Escalating the issue to Mantis made us to take a step back and gather our findings, and we quickly got great feedback from the OCaml core team.

At this point, what does the problem look like?

We only had sparse information, but dmesg gave us interesting data point. When a page fault occurs and the kernel detects an incorrect memory access, it logs a line in kernel log buffer containing the fault address, the instruction pointer and stack pointer.

[22985.879907] ocamlopt.opt[48221]: segfault at af8 ip 00005564455169bd sp 00007ffc9f36b130 error 4 in ocamlopt.opt[556445006000+613000]

Next to the 3 addresses, already available in the coredumps, an error code is reported. This number in decimal form is actually a bitset, and the flags are documented in the Linux kernel sources in arch/x86/mm/fault.c. Error 4 can thus be read as a read access page fault from user mode, trying to read memory which had not been previously mmap’ed.

Error codes reported following our crashes involved protection faults or access to unmapped addresses, which corroborated our earlier buffer overrun hypothesis. More interestingly we witnessed a crash with the PF_RSVD flag enabled. This left us puzzled, none of us had ever seen such fault before. Apparently it indicates that the the page table was somehow corrupted, with some entries having non-zero bits reserved by the x86 architecture specification.

It was scary that the corruption would escape the process address space, and to our limited knowledge, it could only have been caused by kernel issue or potentially hardware issues, like memory errors. Yet we were able to reproduce this on several machines with different kernel version, and different hardware. We blamed virtual machines earlier but this theory was debunked already. We still have no explanation at this time, and pursuit on this front would require intimate knowledge of virtual memory implementations that we didn’t have.

One developer wasn’t able to reproduce the problem at all on his machine after hours of testing, but something was fishy: it didn’t sound right that an OCaml runtime bug would be able to modify the page table. Maybe it was some corner case with reserved addresses, but this something was beyond our reach here. Out of ideas, it was time to get some assistance from tools intended to track memory corruptions, like asan and ubsan.

Running Asan didn’t yield any meaningful results. Valgrind was later tried, following advises from the OCaml team, but every tools were preventing the crash. Quickly reproducing the bug for testing required running code in a loop, keeping the CPU and memory fully busy.

This was harder to do on developers machines, due to limited resources and other processes running, and Address Sanitizer would only increase the resources usage. Dedicating a powerful server would make further investigations more comfortable, and increase the likeliness of reproducing with instrumented code.

But with great surprise, it was not possible to reproduce the problem on a server machine, with and without instrumented code. This is when we realised that all the machines exhibiting the crashes were running a processor of the Intel Skylake processors family, while the server and other developer machines had CPUs from the Broadwell family.

The hardware, an unusual suspect

In the meantime several core OCaml developers had been closely investigating the issue and started auditing recent changes in the runtime, and identified a few suspicious changes and known bugs.

Certainly they were more qualified for this task, but it acted as an incentive to examine the history of this bug from our angle. At first, we had assumed that the bug was specific to the new laptop with virtual machines. This could not explain why the crash never manifested on older workstations equipped with Skylake processors. Several other developers had been using them for a few months, and only noticed the crash after awareness of the issue had been raised by Enguerrand.

What had changed, besides Skylake? Only a few week before, an internal migration from OCaml version 4.02.3 to 4.03.0 was rolled out in our codebase. Intrigued, we went ahead and tested OCaml 4.02.3 again, which showed no memory corruptions after several tests. It was time to browse the OCaml changelog for runtime related entries. The search stopped quickly on a promising item in the list: the OCaml C runtime build optimisation level had been increased to -O2 from -O1.

Could the optimizations dig out an undefined behavior in C code, leading to bad assumptions in the GC code corrupting the heap ? Rebuilding the runtime with -O1did not corrupt memory, so the source of the corruption was in the runtime and was triggered by some gcc specific optimization pass. This sounded like undefined behavior, although the information we had led us to some hardware bug.

The next day, Xavier Leroy commented on the bugreport reporting that the crash had been observed in the past. Another industrial OCaml user was affected, and they had discovered HyperThreading was part of the necessary conditions. After running the test case for several hours on several machines with HT disabled in the UEFI setup, it was clear we were facing a similar situation. This led to the hypothesis of a hardware bug:

Is it crazy to imagine that gcc -O2 on the OCaml 4.03 runtime produces a specific instruction sequence that causes hardware issues in (some steppings of) Skylake processors with hyperthreading? Perhaps it is crazy.

This possibility had struck us too, motivated by the HyperThreading, the page table corruption and the Skylake specific set of conditions.

This issue had certainly a strange profile. But nobody was ready to fully embrace the cpu bug hypothesis yet. We convinced ourselves that disabling HT could affect cache pressure and unfold some undefined behaviours.

HT could also explain the non-determinism, since cache pressure would depend on timings and scheduling. None of us had sufficient experience in this area to assess the strength of such hypothesis, and we did not quite buy it on a single threaded OCaml program. Our debugging motto claims that “assumptions are not facts”.

It was time to browse Intel errata list and attempt to update the CPU microcode. Although, the errata descriptions are formulated in vague terms, none of the issues disclosed at this time were looking similar to the situation under investigation. Unfortunately, CPUs microcode had no fix waiting for us either. OCaml developers investigated the errata list from their side but the lack of detailed information turned this into a fruitless and complex task.

In the absence of better alternative, we focused our work on pinpointing the exact source of the crash as if it was a software bug, in the hope of either finding a code issue or ruling out this hypothesis while getting more detailed data. We needed a way to identify the problematic code and find a workaround. From our side, it was not only a matter of finding whether or not there was a bug in OCaml code, but more crucially we needed a guarantee on the quality of our generated code running critical services in production.

Identifying the offending code

The other OCaml user affected by this issue reported that they had solved the problem by switching to another C compiler. Building the runtime with clang instead of GCC would prevent the GC from crashing. They also suggested to obtain a diff of the generated assembly. Indeed, once built with clang, the runtime would not crash. But clang generates widely different assembly from GCC and we did not have the resources to analyse several hundred thousand lines of changes.

If we could isolate the problematic C code, comparing the generated code would be easier. The problem had the form of a well known nail:

Around 50 C files composing the OCaml runtime,
There is a good state (when built with gcc -O1)
And a bad state (when built with gcc -O2)

This nail comes with a precious hammer: bisection.

The bisection approach had a downside in this occasion. Any state can be labeled bad with certainty as soon as the test crashes, but we would need to wait several hours to be confident enough to trust a non crashing test as good data-point. The reproducibility was not always consistent and a non-crashing state could be a false negative still waiting to trigger the conditions leading to the crash. A reduction of search space was necessary.

All the coredumps we had showed that the fault was caused by a corrupted heap block header, and our testcase involved the compiler. The OCaml compiler is not 100% deterministic, and IO/s primitives and unix environment in the runtime can affect timings and allocation patterns. But it sounded sensible to assume that the code corrupting a heap header block was also the code reading and writing those blocks: the major GC.

This hypothesis made bisecting fast: the first file we tried, major_gc.c, turned out to be the one. To make sure it was not a subtle issue in linker, reordering symbols or code blocks, we tried a few others files and confirmed changing the optimization level of some other files alone made no difference.

But the generated code difference was still way too large. Bringing this topic up on the OCaml IRC discussion channel led to some useful inputs. We were taught that gcc supports an attribute to enable specific optimizations at the function level, using __attribute__((optimize("options,..."))). Following the same strategy, it was easy to trace the source of the malfunctioning code to the sweep_slice function, which implements the sweeping phase of the classic mark and sweep garbage collector for the old generation.

Ignoring the subtle details of incremental GC, the sweep_slice function is the last pass of a normal major collection cycle. It is responsible for scanning all blocks in the major heap, and reclaiming unreachable blocks to the list of unallocated space.

The bulk of this function is a switch taking action for each block depending on its status :

This finding felt consistent with the information at hand. When the block is reachable, the color (describing the reachability status of the block) is reset. If the block became unreachable (while color) - it is reclaimed. In both cases, the block header is modified.

Getting back to the assembly diff.

Nobody in the team knows a great deal about assembly and we only have a really basic understanding of most of the instructions used in both versions. It quickly became obvious that the noise level in this diff, with thousands of lines of changed, was still too high for us to spot anything related to the problem. This problem was getting far beyond the common knowledge of everyone in the team.

But this was still sounding like your day to day bug tracking process. The less you know, the more careful you need to be, tackling the problem step by step. We stuck to what approach had served us well until now: bisecting.

We went through the list of optimisation passes enabled by GCC at -O2. This is a fair amount of optimisation passes and it would have been too time consuming to try them one by one, given the time needed to trigger the crash. Yet we had a hint: a memory corruption was happening semi randomly in the garbage collector. We were also keeping the undefined behaviour bug as a potential explanation. It was likely a pass which would change the structure of the code, reordering blocks and changing conditions.

After reading the description of all switches in the detailed gcc manual, the -ftree-* pass family looked promising. This set of transformations works on the SSA form internal representation, a widespread intermediate language representation which has the benefit of being easy to read. They seem to make a huge impact on the generated assembly code, moving code blocks around and making assumptions on code invariants in order to move around, simplify or eliminate conditional checks altogether.

By looking at output of those passes on the related source code, we narrowed down the list of transformations to a couple of interesting passes, one of them being -ftree-vrp, which stands for Value Range Propagation. This pass computes bounds for each name binding and propagates proofs that a value must lie in a given range.

It turned out most of the other passes depended on it for further optimisations. Even though the issue ended up not being a bad assumptions in the range values, checking this pass proved to be worthwhile: enabling -ftree-vrp on sweep_slice function while every thing else was built with -O1 was enough to trigger a crash.

GCC provides very good diagnostics output, and after reading the manual we found the -fdump-tree-* switch to dump the SSA form before and after specific pass. The output is designed to be read by a human and provides meaningful naming, with source code locations, alongside the ranges propagated by the VRP pass. We spent some time studying the output and matched the difference in SSA tree between the crashing and not crashing code.

Examining the bounds and invariants derived by gcc, it was clear that no wrong hypothesis was stated.

The only meaningful observable change involves the suppression of rechecking the loop condition in the else branch of the sweep_slice function, after Value Range Propagation proved that the condition was invariant in this branch.

Often, reading the code carefully is the fastest way to find a bug. But after spending hours staring at the major GC code, it was clear enough that this check removal should not cause any semantic changes.

In this process, we identified a suspicious bit of code, where a signed long variable was promoted to unsigned according to C standard rules, which was changing the bounds derived by gcc, assuming it was always positive. But after some thinking we realised it made no difference at assembly level and although wrong, this assumption was not used anywhere.

We were now ready to rule out the possibility of a bug in OCaml runtime. It was still possible that GCC backend had a bug and was miscompiling this particular shape of code. And we were back at the assembly level again. After writing some awk formatting script to cleanup assembly and minimise noise in the diff (by renaming labels, detecting spurious code move, etc), and preventing inlining, we found a minimal assembly patch causing the crash.

There were only cosmetic differences. The test removal was propagated down to assembly and caused gcc to reorganise the layout of each switch case block.

Among those minor differences and changes of layout, we noticed a particular change which impacted exactly the reachable block header updated which could have caused header corruption. In the unoptimised version, the updating code looked like this:

For some reason, the block pointer was spilled to the stack.

Perhaps naively, and because we had earlier emitted the hypothesis of HT impacting cache pressure, we spent a few hours staring at this code and check if we were missing something subtle which could affect the control flow of the whole function and the stack location from which it was reloaded could be corrupted.

Despite our lack of assembly knowledge, after spending several hours reading this tiny change, we got convinced that it made strictly no semantic difference. Reading the x86 manual carefully didn’t give any hint on any subtle behavior which would trigger. Executing any of those two sequence of instruction should give the exact same output.

Mitigating the issue

We were now quite certain it was a CPU bug.

The OCaml developers had reached the same conclusion, and were working on escalating the issue to Intel. After internal discussions we decided to keep this bug as low profile as possible since we were unsure about potential security implications, especially for JIT implementations.

Even if we had no confirmation at this point nor any explanations of the cause of this bug, which was beyond our reach, we could take actions.

The first step was to decide against getting any new Skylake based servers until further announcement. We were left with several Skylake machines but we refrained from deploying any OCaml code on them. OCaml comes with a great package manager, opam, which supports compiler switches. Switches allow to set up a clean and distinct environment with specific packages and compiler configuration.

We patched our internal opam repository to distribute unoptimised runtime to all developers and moved forward, waiting for further announcements.

This situation made us realise that microcode requires constant updates, just like any other software in the stack. We raised awareness on this topic in our devops team, and they took measure to ensure we could roll out updates to prod easily.

Happy end

In late May, devops team noticed a debian package update for intel-microcode containing the following change:

Likely fix nightmare-level Skylake erratum SKL150. Fortunately,
either this erratum is very-low-hitting, or gcc/clang/icc/msvc
won’t usually issue the affected opcode pattern and it ends up
being rare.
SKL150 — Short loops using both the AH/BH/CH/DH registers and
the corresponding wide register *may* result in unpredictable
system behavior. Requires both logical processors of the same
core (i.e. sibling hyperthreads) to be active to trigger, as
well as a “complex set of micro-architectural conditions”

The erratum description immediately rang a bell as it matched the diff in the assembly we had observed. We tested the microcode update and confirmed it fixed the corruption.

Finally, our Skylake CPUs were feeling safe and OCaml compiler was happy.

Ahrefs runs an internet-scale bot that crawls the whole Web 24/7. Our backend system is powered by a custom petabyte-scale distributed key-value storage implemented in OCaml (and some C++ and Rust). We are a small team and strongly believe in better technology leading to better solutions for real-world problems. We worship functional languages and static typing, extensively employ code generation and meta-programming, value code clarity and predictability, and are constantly seeking to automate repetitive tasks and eliminate boilerplate. And we are hiring!