Summarizing Garbage Collection

This post is intended to give a whirlwind[0] tour of what I think are some highlights of the last decade-ish of the research literature on garbage collection. If you’re interested and want to learn more, the Garbage Collection Handbook (via Amazon) is an indispensable resource. Ravenbrook’s Memory Management Reference is also quite nice!
There are four “elemental” algorithms for garbage collection: reference counting, mark/sweep, compacting, and copying. The latter three are referred to as tracing collectors.

Reference counting tracks how many copies of each pointer there are in the heap. When the heap contains zero copies of a pointer, the memory it points to can be reclaimed.

The three tracing techniques all start the garbage collection phase by iteratively following the pointers stored in registers, the stack, and globals (called roots) to determine the live subset of allocated data that the program can access (this is the “mark” part of mark/sweep). Allocations that couldn’t be reached from the program’s roots are dead and can be reclaimed, in one of several ways. The “sweep” part of mark/sweep conceptually calls free() for each dead allocation. Compaction takes live objects and moves them into the holes left by the dead. Copying collectors start with the heap split into two halves, with all live data in one of the halves. Instead of sweeping or compacting, each live object is moved into the spare half. When all live objects have been copied out, only dead objects remain in the original half, and it can be reused without further processing. 

Much of the garbage collection research of the last few decades can be seen as schemes to extend, combine, or refine these basic techniques.
Describing Garbage Collector Performance
On a uniprocessor, we have a simple mental model for how GC works: the program runs for a while, until it’s in danger of running out of memory. Then the program is paused while the garbage collector runs to reclaim memory, after which the program resumes. The sum of these pauses is how much slower the program gets overall, and corresponds to GC throughput.
A rule of thumb I find useful: when a paper says they improved throughput, what they really mean is they cut down on their collector’s unnecessary work.
In most real-life situations, though, throughput (raw overhead) is not the only metric that matters. You also want to preserve your program’s responsiveness (to mouse clicks, network packets, the audio or graphics subsystems, or whatever else), since pausing your program to run the GC for a long time will result in a frozen GUI, audio glitches, dropped packets/frames, etc. As is often the case in engineering, there’s a tradeoff between these two goals. Techniques that improve throughput usually degrade pause time, and vice versa. Put another way: reducing (worst case) pause time comes at the price of making the program run slower overall.
But max pause time by itself can be misleading! A series of short GC pauses with only a few mutator instruction between each pause will usually “feel” like one big pause. Mutator utilization better captures the notion of “responsiveness”. The minimum mutator utilization (MMU) is a number that says, at a particular timescale, what fraction of that time was spent executing non-GC code in the worst case. MMU is usually presented s a graph: the x-intercept is the largest pause time, and the y-intercept is the overall throughput; the shape of the graph in between tells you the mutator’s worst-case responsiveness at different timescales.
Screenshot 2016-09-30 19.21.53.png

MMU example graph. The purple stripes represent GC pauses; note the “sawtooth edge” of the graph

One counterintuitive consequence of the above definition is that MMU sometimes goes down (!) at larger timescales [1]. Suppose the GC introduces a single pause of 10ms. Then MMU at 10ms is zero, and MMU at 20ms is 50%, and MMU at 10s is 99.9%. If the GC introduces a second pause, 90 ms after the first one, then the MMU at 90ms is 88.8%, but the MMU at 100ms is only 80%, since there is a 100ms window which includes two 10ms pauses [2].
Another “gotcha” when evaluating MMU is that it’s very difficult to account for every last bit of overhead at sub-millisecond granularity. To the best of my knowledge, nobody has ever explicitly factored the effect of GC-introduced cache misses or minor page faults into their MMU calculations. Capturing such events is possible-but-hard; figuring out which are “the GC’s fault” is trickier. And it can be an issue: Click et al observed that, for some concurrent collectors, simply summing the “big” pauses led to reported overhead that was one sixth of the true figure!
I wrote up a little mutator utilization grapher that you can play around with if you’re curious.
Another fundamental property of collectors is space overhead. This takes many forms: interior and exterior fragmentation, metadata overhead, and reserved areas of the heap. As is often the case, space trades off with time: larger heaps can be collected less frequently, thus increasing throughput, but each collection of a larger heap takes longer, thus increasing pause time.

To recap: garbage collector designs encounter “classical” engineering tradeoffs in optimizing for average vs worst case performance. The GC community describes the tradeoff with networking-inspired terminology: throughput vs latency. MMU graphs capture this tradeoff and give a visual presentation of collector performance.
Tracing vs Refcounting
One really cool “observation” paper to come out of the memory management community is A Unified Theory of Garbage Collection by Bacon, Cheng, and Rajan. This is one of my favorite types of research results: bringing together two separate schools of thought with a mix of theoretical and practical insights that’s almost blindingly obvious… once you’ve been told what to look for!
On the theory side, they observe that tracing and refcounting are very nearly precise duals of each other. The former starts with known-live values and assumes everything else is dead until proven otherwise; it does so by incrementing a one-bit count until it is greater than zero. Reference counting starts with known-dead values and propagates the consequences of their death by decrementing word-sized reference counts until they reach zero. Tracing cannot reclaim individual objects without RC-flavored techniques, and reference counting cannot reclaim object cycles without tracing-flavored techniques.
On the practice side, they note that while the simplest implementations of tracing and reference counting have complementary properties, optimizations to either algorithm tend to result in systems that face similar trade-offs. For example, generational and incremental collectors that reduce average pause time need to use write barriers that raise mutator overhead, whereas techniques like deferred reference counting, which reduce the mutator overhead from barrier-like refcount operations, result in delayed collection and longer GC pauses! The paper also evaluates the storage reclamation in the Log Structured Filesystem, and how GC tradeoffs in the OS/filesystem context differ from a language runtime’s choices. Like I said, one heck of a paper!
The Alchemy of Garbage Collection
Okay, now we’re ready to start looking at specific refinements of the elemental algorithms. As mentioned above, generational collection is one of the most common refinements to basic tracing. The scheme is motivated by two insights. The first is that programs tend to allocate a lot of temporary data. Generational collectors take the heap and split off a nursery for new objects. When the nursery fills up, it will usually contain mostly dead temporary objects, which are “free” to reclaim with a copying collector. But what if an object in the main heap refers to something in the nursery? This is the second insight: older objects can only point to newer objects via mutation, which is comparatively rare (for older objects). To make sure that a young-generation collection doesn’t miss any objects, a remembered set notes which parts of the nursery appear in the main heap, and allows the nursery collection to proceed without having to scan objects from the main heap. See the Handbook for implementation details.
There are a variety of refinements to reference counting. The most basic is deferred reference counting, which reduces the overhead of RC by ignoring intermediate updates. (This is an instance of the very handy algorithmic trick of epochs, which pops up in many subfields of systems research). Whereas standard RC immediately frees objects (recursively) when their counts reach zero, deferred RC with a zero count table (ZCT) instead says, “Oh, I’ll get to this later…” and stores the object in the ZCT. When it comes time to actually reclaim objects, anything in the ZCT that is not reachable from the program’s roots is garbage. (Note that this means we’ve introduced tracing to our RC scheme!).
Ulterior reference counting is a variant of generational collection in which the mature space is managed using (deferred) reference counting instead of tracing. When the nursery fills up, the survivors are evacuated to the reference-counted mature space. Reference counts in the mature space are only computed at the end of a nursery collection. The underlying motivation is that tracing and RC have complementary properties which match the demographics of the nursery and old space, respectively. The URC evaluation, done in 2004, shows that their throughput matches a generational mark/sweep collector, while reducing max pause times from roughly 200ms to 50ms in heaps from 20 to 100 MB. URC also generally needed space overhead of less than 80% to avoid thrashing, whereas generational mark/sweep often needed 300% space overhead.
Shahriyar, Blackburn, and Frampton compare an optimized lazy reference counting system to non-generational mark-sweep, URC and the Immix collector (we’ll get to it soon). The “traditional” reference counting implementation degrades benchmark throughput by 30% on average versus mark-sweep, and their optimizations eliminate the disparity. They only compare throughput, not pause times, mutator utilization, floating garbage, or sensitivity to heap size. The paper provides some interesting statistics on object popularity: on their benchmarks, 99.35% of objects have reference counts that never exceed 3 bits, but the remaining 0.65% account for nearly 23% of reference counting operations.
Terei, Aiken, and Vitek had a different idea: why not let the programmer decide where to use tracing and refcounting within their own program? Their paper highlights memcached as a (very) realistic example of a program that challenges most GCs, because memcached servers must use nearly all the available physical memory without swapping or pausing [3]. Their multi-memory-management prototype showed that giving the programmer a fine-grained choice between two simple GC algorithms resulted in overall program performance that matched that of (unsafe) manual memory management with custom allocators. The M^3 prototype combines mark-sweep and a reference-counted heap. Swept objects in the traced heap are inspected for pointers into the RC heap, which must be decremented. Mutations to objects in the RC heap must update (insert into or remove from) a remembered set for the traced heap to use. No bound exists on the potential size of the remembered set; it is up to the programmer to choose a type partitioning that doesn’t result in too many cross-heap pointers.
Quick aside: another paper that proposed an extended interface between the programmer and garbage collector is the Data Structure Aware collector by Cohen and Petrank. The idea is to assume that the contents of container data structures are reachable until the programmer indicates otherwise. This allows them to avoid tracing memory that is unlikely to be reclaimable. In exchange for modifying about 8 lines of code, their technique reduced GC time between 11% and 76% for the various benchmarks they studied.
Hybrids and Heuristics
There are also hybrid schemes that don’t involve reference counting. Both of the following hybrid algorithms successfully combine the large-heap throughput of semispace collection with the small-heap robustness of mark-sweep and the reduced memory requirements of compaction.
Spoonhower, Blelloch, and Harper published a collector design that dynamically switches between copying and mark/sweep collection. Building on Bartlett’s mostly-copying collector, they divide the heap into fixed-size blocks, and use estimates of block residency to determine at collection time whether a given block should be evacuated (which combats fragmentation and is cheap for blocks containing few objects) or promoted in place, which avoids the cost of copying densely-populated blocks.
Their paper characterizes mark-sweep and semi-space collectors as implementing fixed policies for allocation and reclamation. Regardless of block residency, semi-space collectors evacuate all blocks, and mark-sweep promotes all blocks. Likewise, mark-sweep allocation via free lists will use space from all blocks (even heavily fragmented ones), and semi-space allocation with a bump pointer only allocates from completely unoccupied blocks. Their hybrid policy allows, for example, allocation from blocks that are at most 90% full, and evacuation for blocks that are at most 80% full.
Accurately measuring block residency requires a full-heap traversal, which would essentially double the GC’s overhead. Thus they produce an estimate of each block’s residency, by combining object age, previous residency measurements, and static configuration parameters.
Immix is another mark-sweep collector that uses copying to reduce fragmentation. The heap is divided into 32KB blocks, and blocks themselves are composed from 256 lines of 128 bytes each. Different threads allocate from separate blocks, using a bump-pointer allocator within spans of free lines.  When marking a block, Immix uses fragmentation (computed from residency statistics) to decide whether to mark/sweep or evacuate a block’s contents. (Compare/contrast: SBH evacuates sparse blocks; Immix evacuates fragmented blocks; the Metronome collector also defragments blocks, but does so post-sweep, using exact residency statistics, rather than heuristically during or before collection). Subsequent work has produced several variants of Immix, including precise and conservative reference-counting, that match the performance of the original mark/sweep implementation. There’s also an implementation in Rust.
Incremental Collection
An alternate way of looking at generational collectors is through the lens of incrementality. When the heap fills up, we must reclaim enough free memory to satisfy an allocation request. But a full collection of the heap is likely to reclaim far more free space than we need at the moment, and incur an unpleasantly large pause in doing so. For an interactive application, it would be nice to break up this big chunk of work into several smaller chunks. Collection algorithms which can collect a subset of the heap at a time are called incremental. If we squint a bit, generational collectors make the common case (nursery-only reclamation) incremental, but full-heap collections must be done atomically. There has of course been a great deal of work tackling the problem of making full-heap collection incremental.
The Memory-Constrained Copying collector (MC²) is a block-structured generational collector. It collects a single block at a time in order to avoid the thrashing behavior exhibited by simpler collectors under memory pressure. Each block in the old space has a logical number associated with it. While marking the old space, the collector constructs unidirectional remembered sets to allow separate collection of blocks in increasing order. Marking also computes the live volume in each block. Marking and copying are both done incrementally. Incremental copying is done when nursery collection happens, and operates on groups of blocks, chosen to restrict the size of their combined live data and remembered sets. Like many collectors that aim to limit pause times, MC² makes an effort to explicitly balance the rate of allocation and collection. Since the rate of old-generation collection is tied to the rate of nursery reclamation, which is in turn a function of young object survival rate and nursery size, MC² resizes the nursery after every collection. MC²’s evaluation showed that remembered sets stayed below 8% of the live data, and mark queue size didn’t exceed 0.7%. Throughput matched a generational mark-sweep collector, and mean pause times were reduced by an order of magnitude — from 264.68ms to 27.18ms. As an aside, the MC² evaluation includes data for pause time distributions and bounded mutator utilization, in addition to the usual throughput and max pause numbers. I wish every GC paper included such detailed performance data!
Incrementality is a restrained/restricted form of concurrency. Fundamentally, the work done by the garbage collector is being done in multiple steps, with the mutator running in between the steps. This means that the mutator can “see” — and modify — the objects that the collector is inspecting. In turn, we need more sophisticated invariants for the mutator and collector to correctly work together. In a non-incremental mark/sweep collector, there are two states of interest. White objects are dead-until-proven-live, and black objects are proven-live. Marking turns white objects black, and continues recursively until every pointed-to object is black. This establishes an invariant: black objects never point to white objects. At that point, sweeping frees white objects and un-marks black objects, turning them white.
But in an incremental (or concurrent) scheme, by definition the mutator will be running before marking establishes the aforementioned invariant. The danger is that the mutator’s action — for example by storing a reference to a white object into a field of a black object — can violate the invariant the collector needs to do its job. The solution is to introduce a new state: gray, which is an object that the collector has marked, but which might contain pointers to non-black objects. Gray essentially means “TODO.” Now, when the mutator stores a white reference into a black object, we can preserve the so-called strong tricolor invariant in one of two ways: either by marking the white object, or turning the black one back to gray.
As the name hints at, there is an alternative called the weak tricolor invariant. The basic intuition is that having a black object point to a white one is OK, as long as some other gray object exists that will eventually turn the white object black. The weak invariant is, in some sense, a dual of the strong one; preserving the strong invariant must block “creation” of white pointers in black objects, whereas the weak invariant focuses on deletions of white pointers from non-black objects.
The Compressor algorithm is a clean, modern compacting design. It makes clever use of mark bitmaps to enable incremental and parallel compaction. The basic problem is: once marking completes, the mark bitmap shows which objects are live and which are dead, but there’s no indication of where in the heap each live object should be moved to. Here is Compressor’s trick: first, divide the heap into small segments (on the order of 512 bytes). The volume of live data in each segment can be calculated in parallel by inspecting the mark bitmap. A parallel scan then quickly computes the starting point for the compacted data in each segment. Given this preprocessed offset table, the final position of each live object in each segment can be easily computed. In addition, the inverse mapping can be easily constructed, giving the from-space range that will map to each to-space segment. Unlike most other compaction algorithms, Compressor runs in one pass, without requiring extra per-object metadata such as forwarding pointers. One last trick: Compressor identifies dense blocks and doesn’t waste effort compacting them (just like MC²).
The concurrent variant of the Compressor is based on virtual memory protection. After concurrently computing the offset tables, the to-space pages are protected, and the mutators stopped to have their roots point to the protected to-space pages. When the mutators resume, the resulting VM traps trigger incremental concurrent compaction. In terms of the tricolor invariant, unprotected tospace pages are black as a result of compaction. Protected tospace pages are gray, and fromspace is white. The mutator is gray when it operates in fromspace, so it can deal unobstructed with white objects; when its roots are redirected to tospace, the mutator turns black.
Scalable Garbage Collection
I learned the other day that Google’s Compute Engine lets you provision memory for virtual machines independently of core count. So if your two core/four GB VM is running out of memory, you can turn it into a two core/16 GB VM, for about… four cents an hour. Pretty nifty! But this leads to an obvious concern: if we increase our heap sizes (and live data) fourfold, won’t our worst-case GC pause time also rise along with it? Stop-the-world pauses would take minutes to hours (!!!) on very large heaps. That’s obviously pretty bad; what can we do to avoid it? Can we limit the GC’s maximum pause time and space overhead, independent of heap size and mutator behavior? Felix Klock (who now works on Rust at Mozilla!) studied this in his super awesome dissertation.
The key thing is: we mustn’t ever need to traverse the full heap all at once. Reference counting can satisfy this property, but needs tracing to be complete, and also suffers from fragmentation, meaning space overhead is mutator-dependent and difficult to bound. (Incidentally, Firefox’s GC implementation has evolved from stop-the-world to incremental to generational to compacting; the latter step was entirely to reduce fragmentation).
So to avoid fragmentation and give reasonable space bounds, we’ll need to move objects.
Well, OK, but some objects are super popular — a significant fraction of the heap is just copies of a pointer to that object. Wouldn’t moving those objects mean we need to traverse most of the heap to update the pointers? Indeed, but there are two ways out of that predicament. One is to add a layer of indirection with a read barrier; this is what most real-time collectors do. Klock had a different insight: you can just elect not to move the popular objects! After all, if you’re worried about objects that might take up, say, 1/8th of your heap, well, there can’t be more than 8 of those! Defining popularity in terms of max heap size is no bueno, though: we mustn’t let the size of the heap dictate our worst-case workload. Instead, the trick is to divide the heap into fixed-size regions, and define the popularity threshold as a multiple of the region size, rather than (a fraction of) the heap size. But the insight remains the same: your program doesn’t execute in Lake Wobegon, and not all of the regions can be more popular than average.
Having divided up the heap into regions, we must be able to collect a single region independently of the others; this is how Klock achieves scalability. Simple to state, but the details get pretty hairy; one false move and we lose scalability. To collect a region, we have to know what objects in it are referenced by other regions — a remembered set. BUT WAIT! There are two kinds of remembered set representations, points-into and points-out-of; the former has worst-case quadratic space usage, so remembered sets must be points-out-of. BUT WAIT! Even that’s not enough; the aggregate of all remembered sets can still scale with heap size, which would mean that collecting a single region would again involve scanning O(heap) data. The solution is to have a points-into summary set for each non-popular region; summary set sizes are bounded by the popularity threshold, which is independent of heap size. Ah, BUT WAIT! to construct a summary set, we must scan the full-heap remembered set! Now we get into issues of scheduling: first, amortization: each pass over the remembered set tries to summarize a fixed fraction of the heap’s regions (abandoning the popular ones); and second, we can be a bit lazy: summarization only needs to happen when we’re running out of regions with usable summary sets, and can end early when “enough” regions have been summarized. BUT WAIT! If we end early, the mutator can invalidate the summaries we so lovingly crafted. We can keep the summary sets up to date with write barriers (careful to keep the work O(1) amortized!). BUT WAIT! If summary sets grow too large they must be discarded, which means that the rate of collection must roughly keep pace with the rate of program mutation. BUT it also mustn’t go “too fast” or we’d jeopardize the minimum mutator utilization guarantee that scalable collection promises.
Well, I haven’t even covered all the interesting details, like the mark stack and snapshotting, but I hope you get the idea of why I think Klock’s work is an undeservedly-obscure tour-de-force. Experimentally, Klock shows that regional collection’s throughput is only a bit lower than a generational collector, even though worst case pause times (and thus MMU) is greatly improved. The comparisons to other collectors illustrate many of the high-level tradeoffs between latency & throughput, and between guaranteed vs expected-case performance. For example, on one benchmark in a 2GB heap, the regional collector’s max pause time is 0.12s, while the G1 collector paused for 2.13s. But the G1 collector finished in 154s total, and the regional collector took 757s. The worst-case observed pause for the scalable collector on any benchmark was 240ms.
GC Scheduling
Klock’s work on regional garbage collection highlights that scheduling of garbage collection work versus mutator work is a fundamental issue. Such questions of scheduling appear in many places. The collector in Go 1.5 has a scheduling component (they refer to it as “pacing”) to balance heap growth and CPU usage. Chrome (that is, V8) explicitly tries to schedule garbage collection work during idle moments, so as to improve the browser’s responsiveness and reduce hiccups when playing animations (a soft-realtime task). And of course, much work on hard-realtime garbage collection centers around issues of scheduling and schedulability.
Hans Boehm published a tech report connecting scheduling to lazy reference counting. He observed that a consequence of deferred (“lazy”) RC is an increase in floating garbage, and presents an adversarial program that causes a lazy RC implementation to retain a quadratic amount of floating garbage. The solution he presents is to permit O(n) reclamation work to match an allocation of size n. This is strongly reminiscent of Klock’s observation that the collector must “keep pace” with the mutator.
Conservative Collection
OK, time for a quick detour on conservative collectors. I mentioned at the start of the post that garbage collection starts from roots: in particular, from registers and the stack. Often the compiler will emit data structures to help the runtime system figure out what registers and stack slots contain heap pointers, and which do not. Likewise, individual objects in the heap will be given a descriptor indicating which fields should or should not be traced. But sometimes people want to use GC with a compiler that doesn’t know anything about GC, like Clang or GCC. And it can be done, with a few caveats!
The core idea is to assume that anything that looks like a pointer is a pointer. This implies that conservative collectors can’t move objects, since they don’t know for sure whether they have an actual pointer or an integer that happens to look like a pointer. And even if it would be fun and easy, the collector shouldn’t go about tweaking the mutator’s hard-earned values.
One nifty trick is that conservative collectors can actually move objects in many circumstances. If the compiler provides just a tiny bit of information to the runtime about the layout of heap objects, you get a heap-precise design. Basically, pointers in registers or the stack are treated conservatively, but pointers within the heap can be positively identified, and thus moved/compacted/etc. Really, most of the issues and design decisions for non-conservative (aka fully-precise) collectors also apply to their conservative brethren. For example, Boehm-Demers-Weiser, the most well-known conservative collector, can run incrementally and generationally. And Ironclad C++ is an extension to Clang that supports heap-precise GC.
GC Case Study: Go
Early versions of Go used Boehm-Demers-Weiser. This brought two major problems. First, the conservative nature meant that Go programs on 32-bit machines would sometimes spuriously run out of memory. One widely-derided post on the golang-nuts mailing list suggested a workaround: stick to integer values below 10000. Heh. This situation improved with Go 1.1, which was heap-precise, and Go 1.3, which was fully-precise.
Second, multi-gigabyte heaps had multi-second stop-the-world pause times. With the release of Go 1.5, they implemented a concurrent mark-sweep collector, which reduced pause times from multi-second to a few milliseconds. For a more real-world take, Twitch published a blog post detailing how their observed pause times were reduced from multi-second with Go 1.2, to 200 ms with Go 1.5, to 100ms with Go 1.6, to 1ms with Go 1.7.
Note that Go’s garbage collector doesn’t deal with fragmentation. This may or may not be an acceptable engineering tradeoff for Go’s use cases.
On Bump Pointer Allocation
Many discussions about garbage collection eventually touch on the benefits of bump-pointer allocation. Intuitively, it seems like it should be much faster to just do an increment and conditional jump, rather than searching through a bunch of freelists. Case in point: the Hacker News discussion regarding the aforementioned Twitch blogpost. Surprisingly, this turns out to be a vastly smaller difference than most people would think. Measurement from Blackburn et al showed that allocation was < 10% of program runtime, and the advantage of bump-pointer was about 10% faster; therefore the overall throughput advantage is about 1% of program runtime. That’s about the same performance change to be had from passing extra unused environment variables to your program.
There’s at least one study comparing the throughput of malloc/free versus garbage collection. The result found was that if you have a lot of free memory to work with (about 3x your live data), tracing GC is just as fast as malloc/free, but at smaller heap sizes, you spend a lot of time re-inspecting your live data to determine what’s dead. In effect: when free memory is limited, GC appears to have a soft failure case (slowing down) that malloc/free avoids. But that’s not really the whole story. These benchmarks are fundamentally batch workloads that run for anywhere from seconds to hours, then exit. But on the web, both servers and browsers can run continuously for days, months, or years at a time [4]. On such long-term workloads, there’s another factor that comes into play: fragmentation. Any non-moving collector (malloc/free, reference counting, and mark/sweep) will hit some amount of fragmentation. As a failure mode, fragmentation’s costs manifest in two ways. First, there’s a “global” program slowdown from increased effective working set & decreased locality; second, if fragmentation gets bad enough, allocation will eventually fail, and your program will crash (or, uhm, corrupt data and then crash).
So if you want to not crash, you need to size your heap as some multiple of your live data. How much fragmentation do you need to account for? There are a few research results on the theory side. First, Robson proved in the 1970s that the theoretical worst-case is logarithmic in the size of the maximum allocation. That’s pretty atrocious; even supporting piddly 1MB arrays means potential overhead of 20x! More recently, Cohen and Petrank showed that compaction can drop the needed overhead in (one particular set of) realistic circumstances to… 3.5x. Amusingly similar to the numbers cited by Hertz and Berger!
As for practice, there is a smattering of research with empirical measurements of memory fragmentation. Johnstone and Wilson showed that on a selection of 8 programs that allocated 27 MB on average, Doug Lea’s malloc implementation limited fragmentation to 4% in the worst case [5]. This is a good result to know, but I’m not confident it can be usefully extrapolated to long-running web servers or browsers, which (A) allocate several orders magnitude more data, and (B) have nondeterministic allocation patterns, dependent on network and/or user input. Supporting this skepticism is a study from Bohra and Gabber measuring several malloc implementations on a long-running caching web proxy. They found that fragmentation varied by allocator from 30% to heap exhaustion (dlmalloc clocked in at 80% overhead). They also noted that the impetus of their study was noticing that their heap was several times larger than their live data. Bacon, Cheng, and Rajan echo this limitation in their summary of J&W. They also show that fragmentation in SPECjvm98 can reach nearly 40%. More recently, Jula and Rauchwerger benchmarked larger, more intensive workloads on a few different allocators, finding that (A) dlmalloc had 10.8% average and 23.5% worst cast fragmentation, and (B) faster allocation came at the expense of more fragmentation.
Real World Fragmentation
I was (and am!) curious about how much of a problem fragmentation is in practice. I’ve seen lots of statements acknowledging that fragmentation is a big enough problem to be worth avoiding (examples: Mozilla’s Stuart Parmenter in 2007, Huey, Coppeard & Nethercote 2011-2016, Facebook in 2011Aerospike in 2015suniphrase in 2015, Deep in 2016, Jonathan Blow in 2016, gperftools#756).
Most of the time, people don’t provide much in the way of hard data on what sort of fragmentation they’re seeing — not even heap sizes, much less allocation size distributions. When people do provide numbers, they tend to be eye-opening. Aerospike reported that “big-memory (192 GB RAM) server nodes were running out of memory and crashing again within days of being restarted.” In 2012, Kirill Korotaev exhibited a ~20 line program, distilled from a real app, that caused 20x (!) fragmentation overhead in glibc’s allocator, needing 1.8 GB for a 93 MB live set.
Avoiding the space overhead of fragmentation can have significant time costs, even for systems that you don’t usually think of as doing garbage collection. For example, Julia Evans hit a case where the Linux kernel spent 30% of her program’s runtime compacting pages. CloudFlare also has a blog post describing how a form of GC in the Linux kernel for network packet metadata caused hard-to-reproduce pauses of 30s for requests that usually finished in milliseconds. Finally, GCC also has custom infrastructure for GCing the compiler’s allocations; I haven’t measured its usage directly.
TLB Reach
Did you know that modern large-memory workloads can suffer double-digit performance overhead from virtual memory? (studies claiming (mean // max): (9 // 18)% overhead for 8 GB of RAM, (? // 16)% overhead for 16 GB, and (?6? // 53)% for 96 GB using 2 MB pages, and (18 // 40)% for 4 GB VMs on a 24 GB server) The middle study notes that traditional benchmark suites, like SPEC and Parsec, have small working sets that don’t stress the TLB — unlike modern servers! Also, these days you can buy servers with multiple terabytes (!!) of physical memory. There is active research on reducing TLB performance overhead, but it’s anyone’s guess when — or if — such schemes will be available on commodity hardware. On the plus side, virtualization doesn’t appear to significantly increase TLB overhead. Finally, here’s a case where Linux’s support for automatically using large pages caused a 10x performance regression.
Pushing the Envelope
We’re getting close to the end, I promise! Let’s wrap up with brief descriptions of a few interesting collectors that have pushed the boundaries of real-time and concurrent collection.
A fundamental problem faced by all garbage collectors is controlling fragmentation. Most non-real-time collectors take one of two approaches:
  1. Ignore the issue by not moving objects, hoping that the mutator doesn’t get unlucky and induce fragmentation that free-block-coalescing cannot handle; or
  2. Move objects, but in doing so, trade space- for time-problems (in a sense) and accept that the copying/defragmentation process may need to block the mutator while it clears out a fragmented section of the heap.
The Schism collector is one of a small set of real-time collectors that takes a different tack. It eliminates external fragmentation by using fixed size chunks for all allocation requests. Objects larger than the chunk size are represented with a linked list of chunks. Arrays are represented with two layers of indirection, as a fixed-size object header pointing to a variable-length spine which contains pointers to the fixed-size chunks holding the array elements. And in the common case where contiguous space is available, Schism can use it for arrays, thus avoiding the overhead of spine indirections. Object fragments and array data chunks are managed in a non-moving, concurrently-collected Immix-structured heap.
Array spines are the only variable-size allocations, and thus the only possible source of fragmentation. But here’s a cool trick: since the data blocks never move, the pointers in the spine never change, and the spine is thus immutable. This in turn means that when spines are copied, the mutator is free to use either copy. Furthermore, since the mutator is forced to indirect to the spine through a so-called sentinel object, updating the mutator takes only a single compare-and-swap.
The benefits provided by this scheme are:
  1. Bounded fragmentation: Schism will never need more than 31.04% extra space beyond that required for live data (this does not account for internal fragmentation).
  2. Constant time wait-free heap access for the mutator.
  3. Sub-millisecond pause times on a 40 MHz processor
  4. Excellent throughput for a real-time collector, only 1/3rd lost versus the desktop Hotspot 1.6 JVM.
Since Java doesn’t support interior pointers, neither does their collector, although I believe that interior pointers for object fields would be an easy extension. Support for interior pointers within arrays would be trickier but still doable, at least in certain circumstances.
To the best of my knowledge, Schism has never been publicly benchmarked on an x86 or ARM CPU, nor on a multi-gigabyte heap. But I’m very curious to know how it would do in those situations! One last cool bit: the core of Schism’s collection algorithm has been verified against the x86-TSO memory model in the Isabelle/HOL proof assistant.
Block-Free G1
Building on their earlier work on lock-free concurrent compaction, Erik Österlund and Welf Löwe added a syscall to let the GC introspect on the OS scheduler’s thread state, which allows them to perform critical operations — stack scanning and copying — without ever blocking mutator threads. Memory management for concurrent data structures with progress guarantees is a hard open problem, and this line of work makes a significant step towards a solution. Their modifications to OpenJDK’s G1 collector reduced 99.99th percentile latency from 174ms to 0.67ms for one of their benchmarks. Overall, average GC pause time dropped 8-fold, at the cost of 15% degraded throughput.
One important point they make is that GC algorithms which use virtual memory (mprotect) can’t be non-blocking, because the page tables can only be modified when the kernel holds a non-preemptible lock. I don’t have much intuition for the costs of page table modifications. It would be very interesting to see what impact that source of blocking would have on an otherwise non-blocking GC design. Of note: Pauseless, C4, Collie, and Concurrent Compressor all use mprotect. Shenandoah eschews mprotect to avoid read storms.
Shenandoah is an effort to produce an open-source, industrial strength low-pause-time garbage collector, as part of the OpenJDK project. Broadly speaking, it aims to be a fully-concurrent and -parallel mark-copy collector, using both read & write barriers to enable concurrent compaction. As far as I know, there haven’t been any academic publications on Shenandoah yet Christine Flood recently published a paper on Shenandoah; there’s also a smattering of blog posts and slide decks from presentations. Shenandoah is available in the default JVM distributed with Fedora 24. It would be very interesting to see a head-to-head comparison of Shenandoah with Österlund & Löwe’s OS-dependent G1 variant.
Pauseless and C4
In 2005, Azul Systems published a paper detailing the garbage collection algorithm, and associated support from their custom CPU and OS, that allowed them to obtain low pause times (21ms) on very large (50+ GB) heaps.  The state of the art until the early 2000’s was 10-20% maximum runtime overhead from software read barriers, with average-case overhead of roughly half that. But Azul’s hardware-supported read barrier has an average cost of one cycle(!), and overall mutator utilization was reported at 98%.

In contrast to Compressor, which protects (all) tospace pages to trigger evacuation, Pauseless protects fromspace pages (incrementally). Pauseless continuously runs a concurrent and parallel mark phase (like G1), which generates mark bits and computes forwarding information (on-the-side). Using residency statistics, Pauseless protects sparsely occupied pages. Such pages are evacuated, and the underlying physical memory immediately freed. There will still be references in the remainder of the heap pointing into evacuated pages; the mutator and collector cooperate to find and correct such pointers.

When the mutator loads a reference to an unmarked object, a hardware trap notifies the marker threads. When the mutator loads a reference to an object on a protected (and thus, likely evacuated!) page, the forwarding tables are queried to find the fresh copy, and the memory containing the original stale reference is updated (the paper refers to this as “self healing”). The collector finishes a GC cycle by traversing the entire heap to find any remaining pointers into fromspace, and fixing up the stragglers. Now, having established the invariant that all pointers in the heap point into tospace, the virtual memory for fromspace pages can be reused.

C4, the Continuously Concurrent Compacting Collector, adds generations to Pauseless. Unlike most generational designs, the nursery is not managed with a stop-the-world copying collector. Rather, both the nursery and mature space are essentially separate Pauseless heaps, with a precise card marking remembered set. They added new Linux system calls to support batched and non-TLB-flushing remappings, yielding order-of-magnitude speedups versus the stock kernel. The upshot is that C4’s throughput matches ParallelGC in Java while maintaining small-in-practice pauses.
On the subject of Azul’s port to x86, Cliff Click said:

Azul’s GC required a read-barrier (some instructions to be executed on every pointer read) – which cost something like 5% performance on an X86. In exchange max GC pause time is something in the low microsecond range

and noted that they didn’t make use of any novel x86 features:

No change to the X86, instead user-mode TLB handler from RedHat allows ptr-swizzeling, that plus some careful planning and the read barrier fell to 2 X86 ops – with acceptable runtime costs.

The core idea behind Collie, which builds on Pauseless and C4, is to use hardware transactional memory to atomically and incrementally compact objects and update all incoming pointers. Obviously this cannot be done for popular objects, so popular objects are temporarily pinned, with the mutator and collector collaboratively updating popular references. There are of course many more details covered in the paper.
Like the other collectors from Azul, Collie is an engineering marvel. It pulls out pretty much every trick in the book: read barriers, write barriers, non-global checkpoints, partial conservativism, virtual memory and HTM support. Like all good engineering, Collie makes careful tradeoffs, in exchange for desirable results: Collie delivers an absolutely staggering 80-90% MMU (!) at 1ms resolution (!!) for heap sizes up to 30 GB (!!!) when continuously collecting. One issue I wish the paper had covered more was the effective space overhead of their per-object remembered sets; it’s unclear whether the space overhead for their MMU benchmarks was 50%, or 400%, or something completely different.


[0] Seven thousand words is still a “whirlwind,” right?
[1] There’s one more quirk of mutator utilization, which is that it can be “inflated” by making the mutator less efficient. This can occur for two reasons. First, different language implementations have different “baseline” levels of generated-code quality, so the same GC implementation will appear to have a better MMU for a slower language. Second, the overhead of high-resolution measurement with tools like Pin can introduce similar effects. But in practice that’s not an issue most people need to worry about.
[2] To better match the human intuition that the mutator’s utilization doesn’t actually get worse at larger intervals, we can instead focus on bounded mutator utilization (BMU). Numerically, BMU at timescale x is MMU at any timescale x or larger. Graphically, BMU is the monotonically increasing graph you get from MMU with the “peaks” flattened out.
[3] It’s common to use memcached as a caching filter in tree-structured RPCs within a datacenter. This pattern of usage means that tail latency at nodes very quickly becomes median latency at the root of the tree, which is why it’s critical to bound worst-case latency at the nodes and not just settle for good median or average latency. When you want sub-millisecond level QoS, even the OS scheduler might need to be re-examined…
[4] Here’s a real story of a long-running server that hit a GC failure case. Interestingly, the problem wasn’t fragmentation, but rather a generational collector encountering object demographics that didn’t match the generational hypothesis.
[5] They also point out that there’s no single definition for how to measure fragmentation, and highlight a few potential definitions, each with various shortcomings.

3 thoughts on “Summarizing Garbage Collection

  1. Hi Reno, I intentionally left out interactions between GC and paging, primarily because in the last decade, the trend in many server environments (which often drive GC research) has been towards disabling demand paging and running everything in main memory. It’s not unreasonable these days to run servers with more than a terabyte of physical memory! Pretty much the only sane way to operate such a server is to never page to disk.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s