There are four “elemental” algorithms for garbage collection: reference counting, mark/sweep, compacting, and copying. The latter three are referred to as tracing collectors.
Reference counting tracks how many copies of each pointer there are in the heap. When the heap contains zero copies of a pointer, the memory it points to can be reclaimed.
The three tracing techniques all start the garbage collection phase by iteratively following the pointers stored in registers, the stack, and globals (called roots) to determine the live subset of allocated data that the program can access (this is the “mark” part of mark/sweep). Allocations that couldn’t be reached from the program’s roots are dead and can be reclaimed, in one of several ways. The “sweep” part of mark/sweep conceptually calls free() for each dead allocation. Compaction takes live objects and moves them into the holes left by the dead. Copying collectors start with the heap split into two halves, with all live data in one of the halves. Instead of sweeping or compacting, each live object is moved into the spare half. When all live objects have been copied out, only dead objects remain in the original half, and it can be reused without further processing.
Much of the garbage collection research of the last few decades can be seen as schemes to extend, combine, or refine these basic techniques.
On a uniprocessor, we have a simple mental model for how GC works: the program runs for a while, until it’s in danger of running out of memory. Then the program is paused while the garbage collector runs to reclaim memory, after which the program resumes. The sum of these pauses is how much slower the program gets overall, and corresponds to GC throughput.
A rule of thumb I find useful: when a paper says they improved throughput, what they really mean is they cut down on their collector’s unnecessary work.
In most real-life situations, though, throughput (raw overhead) is not the only metric that matters. You also want to preserve your program’s responsiveness (to mouse clicks, network packets, the audio or graphics subsystems, or whatever else), since pausing your program to run the GC for a long time will result in a frozen GUI, audio glitches, dropped packets/frames, etc. As is often the case in engineering, there’s a tradeoff between these two goals. Techniques that improve throughput usually degrade pause time, and vice versa. Put another way: reducing (worst case) pause time comes at the price of making the program run slower overall.
Another fundamental property of collectors is space overhead. This takes many forms: interior and exterior fragmentation, metadata overhead, and reserved areas of the heap. As is often the case, space trades off with time: larger heaps can be collected less frequently, thus increasing throughput, but each collection of a larger heap takes longer, thus increasing pause time.
To recap: garbage collector designs encounter “classical” engineering tradeoffs in optimizing for average vs worst case performance. The GC community describes the tradeoff with networking-inspired terminology: throughput vs latency. MMU graphs capture this tradeoff and give a visual presentation of collector performance.
Tracing vs Refcounting
One really cool “observation” paper to come out of the memory management community is A Unified Theory of Garbage Collection
by Bacon, Cheng, and Rajan. This is one of my favorite types of research results: bringing together two separate schools of thought with a mix of theoretical and practical insights that’s almost blindingly obvious… once you’ve been told what to look for!
On the theory side, they observe that tracing and refcounting are very nearly precise duals of each other. The former starts with known-live values and assumes everything else is dead until proven otherwise; it does so by incrementing a one-bit count until it is greater than zero. Reference counting starts with known-dead values and propagates the consequences of their death by decrementing word-sized reference counts until they reach zero. Tracing cannot reclaim individual objects without RC-flavored techniques, and reference counting cannot reclaim object cycles without tracing-flavored techniques.
On the practice side, they note that while the simplest implementations of tracing and reference counting have complementary properties, optimizations to either algorithm tend to result in systems that face similar trade-offs. For example, generational and incremental collectors that reduce average pause time need to use write barriers that raise mutator overhead, whereas techniques like deferred reference counting
, which reduce the mutator overhead from barrier-like refcount operations, result in delayed collection and longer GC pauses! The paper also evaluates the storage reclamation in the Log Structured Filesystem
, and how GC tradeoffs in the OS/filesystem context differ from a language runtime’s choices. Like I said, one heck of a paper!
The Alchemy of Garbage Collection
Okay, now we’re ready to start looking at specific refinements of the elemental algorithms. As mentioned above, generational collection is one of the most common refinements to basic tracing. The scheme is motivated by two insights. The first is that programs tend to allocate a lot of temporary data. Generational collectors take the heap and split off a nursery for new objects. When the nursery fills up, it will usually contain mostly dead temporary objects, which are “free” to reclaim with a copying collector. But what if an object in the main heap refers to something in the nursery? This is the second insight: older objects can only point to newer objects via mutation, which is comparatively rare (for older objects). To make sure that a young-generation collection doesn’t miss any objects, a remembered set notes which parts of the nursery appear in the main heap, and allows the nursery collection to proceed without having to scan objects from the main heap. See the Handbook for implementation details.
There are a variety of refinements to reference counting. The most basic is deferred reference counting, which reduces the overhead of RC by ignoring intermediate updates. (This is an instance of the very handy algorithmic trick of epochs
, which pops up
in many subfields
of systems research). Whereas standard RC immediately frees objects (recursively) when their counts reach zero, deferred RC with a zero count table (ZCT) instead says, “Oh, I’ll get to this later…” and stores the object in the ZCT. When it comes time to actually reclaim objects, anything in the ZCT that is not reachable from the program’s roots is garbage. (Note that this means we’ve introduced tracing to our RC scheme!).
Ulterior reference counting
is a variant of generational collection in which the mature space is managed using (deferred) reference counting instead of tracing. When the nursery fills up, the survivors are evacuated to the reference-counted mature space. Reference counts in the mature space are only computed at the end of a nursery collection. The underlying motivation is that tracing and RC have complementary properties which match the demographics of the nursery and old space, respectively. The URC evaluation, done in 2004, shows that their throughput matches a generational mark/sweep collector, while reducing max pause times from roughly 200ms to 50ms in heaps from 20 to 100 MB. URC also generally needed space overhead of less than 80% to avoid thrashing, whereas generational mark/sweep often needed 300% space overhead.
Shahriyar, Blackburn, and Frampton compare an optimized lazy reference counting system
to non-generational mark-sweep, URC and the Immix collector (we’ll get to it soon). The “traditional” reference counting implementation degrades benchmark throughput by 30% on average versus mark-sweep, and their optimizations eliminate the disparity. They only compare throughput, not pause times, mutator utilization, floating garbage, or sensitivity to heap size. The paper provides some interesting statistics on object popularity: on their benchmarks, 99.35% of objects have reference counts that never exceed 3 bits, but the remaining 0.65% account for nearly 23% of reference counting operations.
Terei, Aiken, and Vitek had a different idea: why not let the programmer decide where to use tracing and refcounting within their own program? Their paper highlights memcached
as a (very) realistic example of a program that challenges most GCs, because memcached servers must use nearly all the available physical memory without swapping or pausing 
. Their multi-memory-management
prototype showed that giving the programmer a fine-grained choice between two simple GC algorithms resulted in overall program performance that matched that of (unsafe) manual memory management with custom allocators. The M^3 prototype combines mark-sweep and a reference-counted heap. Swept objects in the traced heap are inspected for pointers into the RC heap, which must be decremented. Mutations to objects in the RC heap must update (insert into or remove from) a remembered set for the traced heap to use. No bound exists on the potential size of the remembered set; it is up to the programmer to choose a type partitioning that doesn’t result in too many cross-heap pointers.
Quick aside: another paper that proposed an extended interface between the programmer and garbage collector is the Data Structure Aware
collector by Cohen and Petrank. The idea is to assume that the contents of container data structures are reachable until the programmer indicates otherwise. This allows them to avoid tracing memory that is unlikely to be reclaimable. In exchange for modifying about 8 lines of code, their technique reduced GC time between 11% and 76% for the various benchmarks they studied.
Hybrids and Heuristics
There are also hybrid schemes that don’t involve reference counting. Both of the following hybrid algorithms successfully combine the large-heap throughput of semispace collection with the small-heap robustness of mark-sweep and the reduced memory requirements of compaction.
Spoonhower, Blelloch, and Harper published a collector design that dynamically switches between copying and mark/sweep collection. Building on Bartlett’s mostly-copying collector, they divide the heap into fixed-size blocks, and use estimates of block residency to determine at collection time whether a given block should be evacuated (which combats fragmentation and is cheap for blocks containing few objects) or promoted in place, which avoids the cost of copying densely-populated blocks.
Their paper characterizes mark-sweep and semi-space collectors as implementing fixed policies for allocation and reclamation. Regardless of block residency, semi-space collectors evacuate all blocks, and mark-sweep promotes all blocks. Likewise, mark-sweep allocation via free lists will use space from all blocks (even heavily fragmented ones), and semi-space allocation with a bump pointer only allocates from completely unoccupied blocks. Their hybrid policy allows, for example, allocation from blocks that are at most 90% full, and evacuation for blocks that are at most 80% full.
Accurately measuring block residency requires a full-heap traversal, which would essentially double the GC’s overhead. Thus they produce an estimate of each block’s residency, by combining object age, previous residency measurements, and static configuration parameters.
is another mark-sweep collector that uses copying to reduce fragmentation. The heap is divided into 32KB blocks, and blocks themselves are composed from 256 lines of 128 bytes each. Different threads allocate from separate blocks, using a bump-pointer allocator within spans of free lines. When marking a block, Immix uses fragmentation (computed from residency statistics) to decide whether to mark/sweep or evacuate a block’s contents. (Compare/contrast: SBH evacuates sparse blocks; Immix evacuates fragmented blocks; the Metronome
collector also defragments blocks, but does so post-sweep, using exact residency statistics, rather than heuristically during or before collection). Subsequent work has produced several variants of Immix, including precise
reference-counting, that match the performance of the original mark/sweep implementation. There’s also an implementation in Rust
An alternate way of looking at generational collectors is through the lens of incrementality. When the heap fills up, we must reclaim enough free memory to satisfy an allocation request. But a full collection of the heap is likely to reclaim far more free space than we need at the moment, and incur an unpleasantly large pause in doing so. For an interactive application, it would be nice to break up this big chunk of work into several smaller chunks. Collection algorithms which can collect a subset of the heap at a time are called incremental. If we squint a bit, generational collectors make the common case (nursery-only reclamation) incremental, but full-heap collections must be done atomically. There has of course been a great deal of work tackling the problem of making full-heap collection incremental.
The Memory-Constrained Copying collector
(MC²) is a block-structured generational collector. It collects a single block at a time in order to avoid the thrashing behavior exhibited by simpler collectors under memory pressure
. Each block in the old space has a logical number associated with it. While marking the old space, the collector constructs unidirectional remembered sets to allow separate collection of blocks in increasing order. Marking also computes the live volume in each block. Marking and copying are both done incrementally. Incremental copying is done when nursery collection happens, and operates on groups
of blocks, chosen to restrict the size of their combined live data and remembered sets. Like many collectors that aim to limit pause times, MC² makes an effort to explicitly balance the rate of allocation and collection. Since the rate of old-generation collection is tied to the rate of nursery reclamation, which is in turn a function of young object survival rate and nursery size, MC² resizes the nursery after every collection. MC²’s evaluation showed that remembered sets stayed below 8% of the live data, and mark queue size didn’t exceed 0.7%. Throughput matched a generational mark-sweep collector, and mean pause times were reduced by an order of magnitude — from 264.68ms to 27.18ms. As an aside, the MC² evaluation includes data for pause time distributions and bounded mutator utilization, in addition to the usual throughput and max pause numbers. I wish every GC paper included such detailed performance data!
Incrementality is a restrained/restricted form of concurrency. Fundamentally, the work done by the garbage collector is being done in multiple steps, with the mutator running in between the steps. This means that the mutator can “see” — and modify — the objects that the collector is inspecting. In turn, we need more sophisticated invariants for the mutator and collector to correctly work together. In a non-incremental mark/sweep collector, there are two states of interest. White objects are dead-until-proven-live, and black objects are proven-live. Marking turns white objects black, and continues recursively until every pointed-to object is black. This establishes an invariant: black objects never point to white objects. At that point, sweeping frees white objects and un-marks black objects, turning them white.
But in an incremental (or concurrent) scheme, by definition the mutator will be running before marking establishes the aforementioned invariant. The danger is that the mutator’s action — for example by storing a reference to a white object into a field of a black object — can violate the invariant the collector needs to do its job. The solution is to introduce a new state: gray, which is an object that the collector has marked, but which might contain pointers to non-black objects. Gray essentially means “TODO.” Now, when the mutator stores a white reference into a black object, we can preserve the so-called strong tricolor invariant in one of two ways: either by marking the white object, or turning the black one back to gray.
As the name hints at, there is an alternative called the weak tricolor invariant. The basic intuition is that having a black object point to a white one is OK, as long as some other gray object exists that will eventually turn the white object black. The weak invariant is, in some sense, a dual of the strong one; preserving the strong invariant must block “creation” of white pointers in black objects, whereas the weak invariant focuses on deletions of white pointers from non-black objects.
algorithm is a clean, modern compacting design. It makes clever use of mark bitmaps to enable incremental and parallel compaction. The basic problem is: once marking completes, the mark bitmap shows which objects are live and which are dead, but there’s no indication of where in the heap each live object should be moved to. Here is Compressor’s trick: first, divide the heap into small segments (on the order of 512 bytes). The volume of live data in each segment can be calculated in parallel by inspecting the mark bitmap. A parallel scan
then quickly computes the starting point for the compacted data in each segment. Given this preprocessed offset table, the final position of each live object in each segment can be easily computed. In addition, the inverse mapping can be easily constructed, giving the from-space range that will map to each to-space segment. Unlike most other compaction algorithms, Compressor runs in one pass, without requiring extra per-object metadata such as forwarding pointers. One last trick: Compressor identifies dense blocks and doesn’t waste effort compacting them (just like MC²).
The concurrent variant of the Compressor is based on virtual memory protection. After concurrently computing the offset tables, the to-space pages are protected, and the mutators stopped to have their roots point to the protected to-space pages. When the mutators resume, the resulting VM traps trigger incremental concurrent compaction. In terms of the tricolor invariant, unprotected tospace pages are black as a result of compaction. Protected tospace pages are gray, and fromspace is white. The mutator is gray when it operates in fromspace, so it can deal unobstructed with white objects; when its roots are redirected to tospace, the mutator turns black.
Scalable Garbage Collection
I learned the other day that Google’s Compute Engine lets you provision memory for virtual machines independently of core count. So if your two core/four GB VM is running out of memory, you can turn it into a two core/16 GB VM, for about… four cents an hour. Pretty nifty! But this leads to an obvious concern: if we increase our heap sizes (and live data) fourfold, won’t our worst-case GC pause time also rise along with it? Stop-the-world pauses would take minutes to hours (!!!) on very large heaps
. That’s obviously pretty bad; what can we do to avoid it? Can we limit the GC’s maximum pause time and space overhead, independent of heap size and mutator behavior? Felix Klock
(who now works on Rust at Mozilla!) studied this
in his super awesome dissertation
The key thing is: we mustn’t ever need to traverse the full heap all at once. Reference counting can satisfy this property, but needs tracing to be complete, and also suffers from fragmentation, meaning space overhead is mutator-dependent and difficult to bound. (Incidentally, Firefox’s GC implementation has evolved from stop-the-world to incremental
; the latter step was entirely to reduce fragmentation).
So to avoid fragmentation and give reasonable space bounds, we’ll need to move objects.
Well, OK, but some objects are super popular — a significant fraction of the heap is just copies of a pointer to that object. Wouldn’t moving those objects mean we need to traverse most of the heap to update the pointers? Indeed, but there are two ways out of that predicament. One is to add a layer of indirection with a read barrier; this is what most real-time collectors do. Klock had a different insight: you can just elect not to move the popular objects! After all, if you’re worried about objects that might take up, say, 1/8th of your heap, well, there can’t be more than 8 of those! Defining popularity in terms of max heap size is no bueno, though: we mustn’t let the size of the heap dictate our worst-case workload. Instead, the trick is to divide the heap into fixed-size regions, and define the popularity threshold as a multiple of the region size, rather than (a fraction of) the heap size. But the insight remains the same: your program doesn’t execute in Lake Wobegon
, and not all of the regions can be more popular than average.
Having divided up the heap into regions, we must be able to collect a single region independently of the others; this is how Klock achieves scalability. Simple to state, but the details get pretty hairy; one false move and we lose scalability. To collect a region, we have to know what objects in it are referenced by other regions — a remembered set. BUT WAIT! There are two kinds of remembered set representations, points-into and points-out-of; the former has worst-case quadratic space usage, so remembered sets must be points-out-of. BUT WAIT! Even that’s not enough; the aggregate of all remembered sets can still scale with heap size, which would mean that collecting a single region would again involve scanning O(heap) data. The solution is to have a points-into summary set for each non-popular region; summary set sizes are bounded by the popularity threshold, which is independent of heap size. Ah, BUT WAIT! to construct a summary set, we must scan the full-heap remembered set! Now we get into issues of scheduling: first, amortization: each pass over the remembered set tries to summarize a fixed fraction of the heap’s regions (abandoning the popular ones); and second, we can be a bit lazy: summarization only needs to happen when we’re running out of regions with usable summary sets, and can end early when “enough” regions have been summarized. BUT WAIT! If we end early, the mutator can invalidate the summaries we so lovingly crafted. We can keep the summary sets up to date with write barriers (careful to keep the work O(1) amortized!). BUT WAIT! If summary sets grow too large they must be discarded, which means that the rate of collection must roughly keep pace with the rate of program mutation. BUT it also mustn’t go “too fast” or we’d jeopardize the minimum mutator utilization guarantee that scalable collection promises.
Well, I haven’t even covered all the interesting details, like the mark stack and snapshotting, but I hope you get the idea of why I think Klock’s work is an undeservedly-obscure tour-de-force. Experimentally, Klock shows that regional collection’s throughput is only a bit lower than a generational collector, even though worst case pause times (and thus MMU) is greatly improved. The comparisons to other collectors illustrate many of the high-level tradeoffs between latency & throughput, and between guaranteed vs expected-case performance. For example, on one benchmark in a 2GB heap, the regional collector’s max pause time is 0.12s, while the G1 collector paused for 2.13s. But the G1 collector finished in 154s total, and the regional collector took 757s. The worst-case observed pause for the scalable collector on any benchmark was 240ms.
Hans Boehm published a tech report connecting scheduling to lazy reference counting
. He observed that a consequence of deferred (“lazy”) RC is an increase in floating garbage, and presents an adversarial program that causes a lazy RC implementation to retain a quadratic amount of floating garbage. The solution he presents is to permit O(n) reclamation work to match an allocation of size n. This is strongly reminiscent of Klock’s observation that the collector must “keep pace” with the mutator.
OK, time for a quick detour on conservative collectors. I mentioned at the start of the post that garbage collection starts from roots: in particular, from registers and the stack. Often the compiler will emit data structures to help the runtime system figure out what registers and stack slots contain heap pointers, and which do not. Likewise, individual objects in the heap will be given a descriptor indicating which fields should or should not be traced. But sometimes people want to use GC with a compiler that doesn’t know anything about GC, like Clang or GCC. And it can be done, with a few caveats!
The core idea is to assume that anything that looks like a pointer is a pointer. This implies that conservative collectors can’t move objects, since they don’t know for sure whether they have an actual pointer or an integer that happens to look like a pointer. And even if it would be fun and easy, the collector shouldn’t go about tweaking the mutator’s hard-earned values.
One nifty trick is that conservative collectors can actually move objects in many circumstances. If the compiler provides just a tiny bit of information to the runtime about the layout of heap objects, you get a heap-precise
design. Basically, pointers in registers or the stack are treated conservatively, but pointers within the heap can be positively identified, and thus moved/compacted/etc. Really, most of the issues and design decisions for non-conservative (aka fully-precise) collectors also apply to their conservative brethren. For example, Boehm-Demers-Weiser
, the most well-known conservative collector, can run incrementally and generationally. And Ironclad C++
is an extension to Clang that supports heap-precise GC.
GC Case Study: Go
Note that Go’s garbage collector doesn’t deal with fragmentation. This may or may not be an acceptable engineering tradeoff for Go’s use cases.
On Bump Pointer Allocation
Many discussions about garbage collection eventually touch on the benefits of bump-pointer allocation. Intuitively, it seems like it should be much faster to just do an increment and conditional jump, rather than searching through a bunch of freelists. Case in point: the Hacker News discussion
regarding the aforementioned Twitch blogpost. Surprisingly, this turns out to be a vastly smaller difference than most people would think. Measurement from Blackburn et al
showed that allocation was < 10% of program runtime, and the advantage of bump-pointer was about 10% faster; therefore the overall throughput advantage is about 1% of program runtime. That’s about the same performance change to be had from passing extra unused environment variables to your program
There’s at least one study
comparing the throughput of malloc/free versus garbage collection. The result found was that if you have a lot of free memory to work with (about 3x your live data), tracing GC is just as fast as malloc/free, but at smaller heap sizes, you spend a lot of time re-inspecting your live data to determine what’s dead. In effect: when free memory is limited, GC appears to have a soft failure case (slowing down) that malloc/free avoids. But that’s not really the whole story. These benchmarks are fundamentally batch workloads that run for anywhere from seconds to hours, then exit. But on the web, both servers and browsers can run continuously for days, months, or years at a time 
. On such long-term workloads, there’s another factor that comes into play: fragmentation. Any non-moving collector (malloc/free, reference counting, and mark/sweep) will hit some amount of fragmentation. As a failure mode, fragmentation’s costs manifest in two ways. First, there’s a “global” program slowdown from increased effective working set & decreased locality; second, if fragmentation gets bad enough, allocation will eventually fail, and your program will crash (or, uhm, corrupt data and then crash).
So if you want to not crash, you need to size your heap as some multiple of your live data. How much fragmentation do you need to account for? There are a few research results on the theory side. First, Robson proved in the 1970s that the theoretical worst-case is logarithmic in the size of the maximum allocation. That’s pretty atrocious; even supporting piddly 1MB arrays means potential overhead of 20x! More recently, Cohen and Petrank
showed that compaction can drop the needed overhead in (one particular set of) realistic circumstances to… 3.5x. Amusingly similar to the numbers cited by Hertz and Berger!
As for practice, there is a smattering of research with empirical measurements of memory fragmentation. Johnstone and Wilson
showed that on a selection of 8 programs that allocated 27 MB on average, Doug Lea’s malloc implementation limited fragmentation to 4% in the worst case 
. This is a good result to know, but I’m not confident it can be usefully extrapolated to long-running web servers or browsers, which (A) allocate several orders magnitude more data, and (B) have nondeterministic allocation patterns, dependent on network and/or user input. Supporting this skepticism is a study from Bohra and Gabber
measuring several malloc implementations on a long-running caching web proxy. They found that fragmentation varied by allocator from 30% to heap exhaustion (dlmalloc clocked in at 80% overhead). They also noted that the impetus of their study was noticing that their heap was several times larger than their live data. Bacon, Cheng, and Rajan
echo this limitation in their summary of J&W. They also show that fragmentation in SPECjvm98 can reach nearly 40%. More recently, Jula and Rauchwerger
benchmarked larger, more intensive workloads on a few different allocators, finding that (A) dlmalloc had 10.8% average and 23.5% worst cast fragmentation, and (B) faster allocation came at the expense of more fragmentation.
Real World Fragmentation
Most of the time, people don’t provide much in the way of hard data on what sort of fragmentation they’re seeing — not even heap sizes, much less allocation size distributions. When people do provide numbers, they tend to be eye-opening. Aerospike reported that “big-memory (192 GB RAM) server nodes were running out of memory and crashing again within days of being restarted.” In 2012, Kirill Korotaev exhibited a ~20 line program
, distilled from a real app, that caused 20x (!) fragmentation overhead in glibc’s allocator, needing 1.8 GB for a 93 MB live set.
Pushing the Envelope
We’re getting close to the end, I promise! Let’s wrap up with brief descriptions of a few interesting collectors that have pushed the boundaries of real-time and concurrent collection.
A fundamental problem faced by all garbage collectors is controlling fragmentation. Most non-real-time collectors take one of two approaches:
- Ignore the issue by not moving objects, hoping that the mutator doesn’t get unlucky and induce fragmentation that free-block-coalescing cannot handle; or
- Move objects, but in doing so, trade space- for time-problems (in a sense) and accept that the copying/defragmentation process may need to block the mutator while it clears out a fragmented section of the heap.
The Schism collector
is one of a small set of real-time collectors that takes a different tack. It eliminates external fragmentation by using fixed size chunks for all allocation requests. Objects larger than the chunk size are represented with a linked list of chunks. Arrays are represented with two layers of indirection, as a fixed-size object header pointing to a variable-length spine which contains pointers to the fixed-size chunks holding the array elements. And in the common case where contiguous space is available, Schism can use it for arrays, thus avoiding the overhead of spine indirections. Object fragments and array data chunks are managed in a non-moving, concurrently-collected Immix-structured heap.
Array spines are the only variable-size allocations, and thus the only possible source of fragmentation. But here’s a cool trick: since the data blocks never move, the pointers in the spine never change, and the spine is thus immutable. This in turn means that when spines are copied, the mutator is free to use either copy. Furthermore, since the mutator is forced to indirect to the spine through a so-called sentinel object, updating the mutator takes only a single compare-and-swap.
The benefits provided by this scheme are:
- Bounded fragmentation: Schism will never need more than 31.04% extra space beyond that required for live data (this does not account for internal fragmentation).
- Constant time wait-free heap access for the mutator.
- Sub-millisecond pause times on a 40 MHz processor
- Excellent throughput for a real-time collector, only 1/3rd lost versus the desktop Hotspot 1.6 JVM.
Since Java doesn’t support interior pointers, neither does their collector, although I believe that interior pointers for object fields would be an easy extension. Support for interior pointers within arrays would be trickier but still doable, at least in certain circumstances.
Building on their earlier work on lock-free concurrent compaction
, Erik Österlund and Welf Löwe added a syscall to let the GC introspect on the OS scheduler’s thread state, which allows them to perform critical operations — stack scanning and copying
— without ever blocking mutator threads. Memory management for concurrent data structures
with progress guarantees is a hard open problem, and this line of work makes a significant step towards a solution. Their modifications to OpenJDK’s G1 collector reduced 99.99th percentile latency from 174ms to 0.67ms for one of their benchmarks. Overall, average GC pause time dropped 8-fold, at the cost of 15% degraded throughput.
One important point they make is that GC algorithms which use virtual memory (mprotect) can’t be non-blocking, because the page tables can only be modified when the kernel holds a non-preemptible lock. I don’t have much intuition for the costs of page table modifications. It would be very interesting to see what impact that source of blocking would have on an otherwise non-blocking GC design. Of note: Pauseless, C4, Collie, and Concurrent Compressor all use mprotect. Shenandoah eschews mprotect to avoid read storms.
Shenandoah is an effort to produce an open-source, industrial strength low-pause-time garbage collector, as part of the OpenJDK project. Broadly speaking, it aims to be a fully-concurrent and -parallel mark-copy collector, using both read & write barriers to enable concurrent compaction.
As far as I know, there haven’t been any academic publications on Shenandoah yet
Christine Flood recently published a paper on Shenandoah
; there’s also a smattering
of blog posts
and slide decks
from presentations. Shenandoah is available in the default JVM distributed with Fedora 24. It would be very interesting to see a head-to-head comparison of Shenandoah with Österlund & Löwe’s OS-dependent G1 variant.
Pauseless and C4
In 2005, Azul Systems published a paper
detailing the garbage collection algorithm, and associated support from their custom CPU and OS, that allowed them to obtain low pause times (21ms) on very large (50+ GB) heaps. The state of the art until the early 2000’s was 10-20% maximum runtime overhead from software read barriers, with average-case overhead of roughly half that. But Azul’s hardware-supported read barrier has an average cost of one cycle(!), and overall mutator utilization was reported at 98%.
In contrast to Compressor, which protects (all) tospace pages to trigger evacuation, Pauseless protects fromspace pages (incrementally). Pauseless continuously runs a concurrent and parallel mark phase (like G1), which generates mark bits and computes forwarding information (on-the-side). Using residency statistics, Pauseless protects sparsely occupied pages. Such pages are evacuated, and the underlying physical memory immediately freed. There will still be references in the remainder of the heap pointing into evacuated pages; the mutator and collector cooperate to find and correct such pointers.
When the mutator loads a reference to an unmarked object, a hardware trap notifies the marker threads. When the mutator loads a reference to an object on a protected (and thus, likely evacuated!) page, the forwarding tables are queried to find the fresh copy, and the memory containing the original stale reference is updated (the paper refers to this as “self healing”). The collector finishes a GC cycle by traversing the entire heap to find any remaining pointers into fromspace, and fixing up the stragglers. Now, having established the invariant that all pointers in the heap point into tospace, the virtual memory for fromspace pages can be reused.
C4, the Continuously Concurrent Compacting Collector, adds generations to Pauseless. Unlike most generational designs, the nursery is not managed with a stop-the-world copying collector. Rather, both the nursery and mature space are essentially separate Pauseless heaps, with a precise card marking remembered set. They added new Linux system calls to support batched and non-TLB-flushing remappings, yielding order-of-magnitude speedups versus the stock kernel. The upshot is that C4’s throughput matches ParallelGC
in Java while maintaining small-in-practice pauses.
Azul’s GC required a read-barrier (some instructions to be executed on every pointer read) – which cost something like 5% performance on an X86. In exchange max GC pause time is something in the low microsecond range
and noted that they didn’t make use of any novel x86 features:
No change to the X86, instead user-mode TLB handler from RedHat allows ptr-swizzeling, that plus some careful planning and the read barrier fell to 2 X86 ops – with acceptable runtime costs.
The core idea behind Collie
, which builds on Pauseless
, is to use hardware transactional memory to atomically and incrementally compact objects and update all incoming pointers. Obviously this cannot be done for popular objects, so popular objects are temporarily pinned, with the mutator and collector collaboratively updating popular references. There are of course many more details covered in the paper.
Like the other collectors from Azul, Collie is an engineering marvel. It pulls out pretty much every trick in the book: read barriers, write barriers, non-global checkpoints, partial conservativism, virtual memory and HTM support. Like all good engineering, Collie makes careful tradeoffs, in exchange for desirable results: Collie delivers an absolutely staggering 80-90% MMU (!) at 1ms resolution (!!) for heap sizes up to 30 GB (!!!) when continuously collecting. One issue I wish the paper had covered more was the effective space overhead of their per-object remembered sets; it’s unclear whether the space overhead for their MMU benchmarks was 50%, or 400%, or something completely different.
Seven thousand words is still a “whirlwind,” right?
There’s one more quirk of mutator utilization, which is that it can be “inflated” by making the mutator less efficient. This can occur for two reasons. First, different language implementations have different “baseline” levels of generated-code quality, so the same GC implementation will appear to have a better MMU for a slower language. Second, the overhead of high-resolution measurement with tools like Pin can introduce similar effects. But in practice that’s not an issue most people need to worry about.
To better match the human intuition that the mutator’s utilization doesn’t actually get worse at larger intervals, we can instead focus on bounded mutator utilization (BMU). Numerically, BMU at timescale x is MMU at any timescale x or larger. Graphically, BMU is the monotonically increasing graph you get from MMU with the “peaks” flattened out.
They also point out that there’s no single definition for how to measure fragmentation, and highlight a few potential definitions, each with various shortcomings.
Anyways, I digress. Around 2004, when I was in high school, I was a user of Firefox nightly builds. This was before the existence of extension versioning or addons.mozilla.org. The closest equivalent to AMO was mozdev, but that’s a bit like comparing the App Store to github. The mozillazine forums were the watering hole for users, extension developers, and browser core devs alike. At one point, a nightly build broke my favorite plugin, Adblock. After a shockingly long time (two whole days!) went by without a fix by the extension’s author, I decided to poke around and see if I could hack together a fix myself. Well, it worked; after a few more such tweaks, I wound up on an IRC channel with the project’s then-current maintainer/developer, who called himself “rue.” We spent most of the next year collaborating on the Next Big Release of the extension.
Unfortunately, we got sucked into Second System Syndrome, without having had the benefit of having fully built the first version ourselves. Basically, every rule of thumb that Joel Spolsky wrote about, we violated (unintentionally). We tried to rewrite the technical infrastructure, while simultaneously adding many ambitious new features. We were aiming for an Apple-style Grand Curtain Drop on a shiny new toy. Tempting, but foolish. While we were toiling away in secret, pressure was mounting for something, anything, to improve upon the actively-bitrotting public version. Eventually, a previous developer of Adblock, Wladimir Palant, returned to the scene and forked the project to create Adblock Plus.
One interesting side effect of an extremely long development cycle is that it allows for meandering diversions from the yellow brick road. The other developer, rue, found that Firefox extensions could alter the requested URL without having the alteration appear in the browser’s address bar. This held the potential for a great deal of mischief. On the more innocuous end, an April Fools Day prank might have reversed the URLs for NBC.com and ABC.com, or NFL/NHL, etc. On the less innocuous end, rue prototyped a module that would, randomly with small probability, replace Amazon referral codes. It would have been the Office Space penny-shaving scheme, more or less. I objected on grounds of skeeviness and/or illegality, but it was a moot point, since we never released a completed version. Adblock Plus ended up taking a bolder, more direct approach to monetization.
Anyways, while rue was busy poking at Firefox internals and de-obfuscating Gmail’s client-side code, I amused myself by devising an easter egg, and trying to conceal its presence in a non-obvious way. When it comes to such things, Firefox extensions have a distinct advantage over traditional software: they can put UI on a web server, so that a review of the extension’s source will see, at most, an innocuous-looking URL. While normal content served over the web is sandboxed, extensions can collude with a page to do things that a normal page cannot do. So I made a little recreation of the “Wake up, Neo…” scene from the Matrix.
Here is the easter egg in action; you’ll need to view it fullscreen to see the text:
If you have the sound up high enough, you’ll hear a clicky noise when each green character is “typed.” It’s actually the same noise used in the original movie! That’s called craftsmanship. 😉 One other subtle point: Adblock’s UI is in the lower-right corner, as a statusbar item. The modern convention of using toolbar logo buttons wasn’t yet established at that time.
Had anyone used a packet sniffer or a lucky guess to figure out the URL for the easter egg and visited it directly, they would have seen this instead:
The easter egg also altered Adblock’s entry in the Extensions list, changing the version from 0.6 to “The 1.0”, replacing the crosshair-style logo to an image of Agent Smith, and altering the description to include parodies of quotes from Agent Smith and Morpheus.
Final useless random historical fact: Adblock was the 10th extension added to AMO. I have no idea how many there are now, probably in the mid thousands.
LLVM’s documentation suggests that its
getelementptr instruction is one of the most commonly misunderstood aspects of the system. In my experience, GEP is only number two on the list; number one is LLVM’s GC support: what it provides, what it doesn’t provide, and how to use it effectively.
The first two are the easy ones: LLVM does support stack maps, read and write barriers, fully-precise collection, and relocating collectors. It does not require a shadow stack, though it does provide one, which is helpful for first-time GC clients to get started with. LLVM’s overall GC design supports concurrent collectors and mutators, but the implementation lacks a few key ingredients to minimize the performance impact of GC in such situations.
There are a few missing features on the implementation side which obviously don’t need major design overhauls to fix. The big three on my list are code emission at safe points, safe points on back edges, and support for register-allocation of GC roots.
The last one is sort of interesting, because I’ve seen several proposals to change the design of LLVM’s GC infrastructure in a major way to obtain it. The reasoning goes like this: we currently have
llvm.gcroot, which marks a stack slot address as being a GC root. But we want to store roots in machine registers, not stack slots, so we need a new intrinsic to annotate an SSA register as being a GC root as well. With such an intrinsic, LLVM would automatically store and load SSA registers to/from stack slots “behind the scenes.”
So, for example, suppose we start off by acquiring a GC-traced pointer:
%root = alloca (gcroot)
%p = … allocate a pointer …
store %p into %root
If we had a non-copying collector, we could get away with code like this:
%r = … maybe trigger GC…
… use %p …
But with LLVM’s current GC support, a copying collector must write this less-legible IR instead:
store %p into %root
%r = … maybe trigger GC…
%p2 = load %root
… use %p2 …
When repeated over and over, it definitely makes the IR harder to read. Wouldn’t it be nice if LLVM would automatically do all that stuff for you? Like this:
… (no alloca needed!)
%p = … allocate a pointer …
%r = … maybe trigger GC…
… use %p …
It’s a natural desire; the first time I encountered LLVM’s GC support, I wanted the same thing. But it’s a wolf in sheep’s clothing. Remember: the fundamental property of LLVM is that an SSA register is an immutable value. Not just “observably equivalent as far as my high-level program can see,” but equivalent under any observation possible at the IR level!
The problem is that under the hypothetical
gc.reg.root scheme, LLVM is free to do constant propagation before the rewrite to the verbose version, and if it does so, then the program’s semantics will have changed! The problem is that
%p has been transmuted from a constant to a variable by virtue of being marked as a register root. But in doing so, we’ve taken the key benefit of LLVM’s SSA representation—the power its gets thanks to immutable bindings—and thrown it right out the window!
Take another look at the “verbose” pseudo-IR listed above. The common interpretation of this snippet is “spill
%p into a stack slot, perform some operation, and then reload
%p from the stack slot.” This is reasoning about the IR according to its current implementation. But I’d argue that it’s better to think about the abstract semantics of the IR. Abstractly, we’re just encoding the fact that the value we think of as
%p might change at some point in the IR; beforehand, we called it
%p, and afterwards, we called it
%p2. Since phi nodes aren’t really viable for most frontends, the only way LLVM allows us to encode mutability is via memory – i.e.
alloca. But the real point isn’t to set up stack slots, it’s just to reflect pointers updates that the GC might be making. As long as LLVM can verify that the “stack slot” address doesn’t escape, it is free to back that mutable storage by symbolically addressed memory: namely, machine registers! It won’t do that analysis now, but it could. That’s the point: no major design tweaks are needed to get register allocation of GC roots, just a bit of extra analysis internal to LLVM.
A separate question is how much of a difference this would make in practice. Effectively, the lack of register allocation means that GC roots are stored in caller-saved machine registers; maybe that’s a performance-killer, or maybe it’s a wash. To the best of my knowledge, nobody has ever carefully looked at the impact of naïve versus varying degrees of sophisticated treatment of GC roots in LLVM-compiled code.
Stepping back even more, I think the reason there’s so much confusion about LLVM’s GC support is that there’s a philosophical impedance mismatch with the rest of the project. Elsewhere, LLVM tries hard to allow unsophisticated frontends to get equivalent code quality as a more sophisticated frontend. The canonical example is that with mem2reg, a frontend needn’t worry about SSA conversion or dominance frontiers or anything like that. In general, LLVM’s philosophy is to move as much work from the front-end to the common LLVM infrastructure as possible. But LLVM’s GC support is different. There’s no automatic transformation from “dead simple” to “efficient” use of GC primitives. Instead, it’s up to each client to do whatever analysis is needed for the most efficient use of the existing GC-related primitives. In practice, this means that clients need to do data-flow analysis if they want to reuse dead root slots. This is in stark contrast to the negligible level of Dragon-book sophistication otherwise expected of a front-end.
In terms of how to use LLVM’s GC support effectively, I’ve found it clearest to think of the process as a transformation between separate intermediate languages. The goal of the transformation is to add a minimal selection of root slots, along with just enough loads and stores to preserve correctness. The crucial invariant is that at each potential GC point, the live roots must reside in (abstract) memory. So: backwards analysis to determine liveness, and a forwards rewrite to add the loads and stores. Interestingly, this combination of a backwards analysis with a forwards program rewrite is not well-supported by existing dataflow tools or their underlying theoretical frameworks. Anyways, to implement the given invariant, you’ll need to know which types require tracing, and which operations may end up invoking the GC. You can sort of encode this information via metadata and pointer address spaces, if using LLVM IR as your source IR, but I think it’s easier to think of the two IRs as being completely separate.