Navigating API Compatibility: A Case Study on Restartable Sequences and Hyrum's Law
A tutorial on how kernel 6.19's rseq optimization broke TCMalloc due to Hyrum's Law, with steps to understand, diagnose, and resolve such API compatibility issues.
Overview
In software engineering, Hyrum's Law states that any observable behavior of a system — even if undocumented — will eventually be relied upon by some user. The Linux kernel community recently faced a stark demonstration of this principle. An optimization introduced in kernel 6.19 aimed to fix performance issues with restartable sequences (rseq), a mechanism for user-space to locklessly share per-CPU data. The update meticulously preserved the documented API. Yet it inadvertently broke Google's TCMalloc memory allocator, which had been using rseq in a way that violated the documented contract. This case study walks through the technical details, the underlying conflict, and the lessons for developers maintaining interfaces that evolve under the pressure of real-world usage.
Prerequisites
To follow this guide, you should be comfortable with:
- Linux kernel concepts (scheduling, system calls, per-CPU data)
- User-space memory allocation (TCMalloc, glibc malloc)
- Basic C programming and system call interfaces
- Understanding of API design and the no-regressions policy
Step-by-Step Guide: Analyzing the rseq / TCMalloc Regression
Step 1: Understand Restartable Sequences
Restartable sequences allow user-space code to manipulate per-CPU data without atomic operations, as long as the sequence can be aborted (restarted) if interrupted. The kernel provides the rseq system call to register a per-thread structure. The critical region is delimited by a commit instruction — if the scheduler runs during the region, it restarts from the beginning. This is ideal for fast per-CPU allocations, as TCMalloc uses it to track which CPU a thread last ran on.
Step 2: How TCMalloc Leverages rseq (Documented vs Actual Usage)
The documented rseq API expects each thread to register once and then use a fixed-size structure. The struct rseq contains a cpu_id_start field that the kernel updates on every context switch. TCMalloc, however, exploits an undocumented detail: it reads cpu_id_start from the struct rseq immediately after the kernel updates it, but before any restartable sequence commit. This gives a near-zero-cost CPU identifier. The official API specification says the field is only valid inside a restartable sequence, but TCMalloc uses it outside any critical region, relying on the fact that the kernel previously updated the field regardless. This is a classic violation of the documented contract, but it worked for years because the kernel never enforced the restriction.
Step 3: Kernel 6.19 Change – Intended Performance Fix
In 6.19, kernel developers modified the rseq mechanism to reduce overhead. They changed the timing of the cpu_id_start update: instead of updating it on every context switch (even when no restartable sequence is active), they only updated it when a restartable sequence was actually registered and the thread was in a critical region. This cut unnecessary writes, improving performance for non-rseq users. The patch authors verified that the documented API — which states the field is only valid inside a restartable sequence — was unchanged. They did not consider that a major user-space library was depending on the undocumented behavior.
Step 4: Identifying the Regression – TCMalloc Breaks
After the 6.19 kernel was released, Google engineers reported that TCMalloc was broken. Under the new kernel, TCMalloc would sometimes read stale cpu_id_start values, causing it to assign memory to the wrong CPU and corrupting allocation data. The problem was intermittent and hard to reproduce because it depended on thread scheduling. TCMalloc’s clever trick — reading the field outside a critical region — now returned an outdated value because the kernel no longer updated it unconditionally.
Step 5: Applying Hyrum's Law – Why the Change Broke Things
Hyrum's Law says that any observable behavior becomes a de facto contract. The kernel’s previous behavior (updating cpu_id_start on every context switch) was observable even though it was not documented. TCMalloc’s developers inadvertently came to depend on that observable behavior. The kernel change, while respecting the formal API, violated the implicit contract that TCMalloc relied upon. This is a textbook example: a well-intentioned optimization broke a major application because the application was using the system in a way the designers never anticipated or sanctioned.
Step 6: Resolution – No-Regressions Rule Forces Accommodation
The Linux kernel has a strict no-regressions policy: any change that breaks existing userspace is reverted or adapted unless absolutely necessary. The rseq maintainers had to find a middle ground. They introduced a new kernel configuration option that restores the old behavior (updating the field on every context switch) when TCMalloc is detected or when compatibility is needed. This solution preserves the performance benefit for most users while avoiding breakage. It also adds a mechanism for TCMalloc (and similar libraries) to explicitly request the old behavior via a new flag, aligning with the formal API going forward.
Common Mistakes
- Assuming users follow the specification: Always assume that every observable side effect will be depended upon, even if undocumented. Test your changes against major consumers.
- Neglecting automated regression detection: The kernel now runs continuous integration tests with TCMalloc and other allocators. Without such tests, the breakage would have been discovered much later.
- Over-optimizing without consulting stakeholders: A quick review of TCMalloc's source code before making the change would have revealed the dependency.
- Ignoring Hyrum's Law during API design: When designing an interface, explicitly note which behaviors are guaranteed and which are implementation details that may change.
Summary
This case study shows how a kernel optimization that strictly preserved a documented API still broke a major user-space library because of an undocumented but observable behavior. The Linux kernel’s no-regressions rule forced a pragmatic compromise, but the incident underscores the importance of understanding Hyrum's Law. Developers should proactively identify potential implicit contracts, test against real-world consumers, and design APIs that clearly separate guaranteed features from implementation ephemera. By learning from this rseq / TCMalloc regression, we can build more robust interfaces that evolve without breaking the ecosystem.