Summarizing Garbage Collection

This post is intended to give a whirlwind[0] tour of what I think are some highlights of the last decade-ish of the research literature on garbage collection. If you’re interested and want to learn more, the Garbage Collection Handbook (via Amazon) is an indispensable resource. Ravenbrook’s Memory Management Reference is also quite nice!
There are four “elemental” algorithms for garbage collection: reference counting, mark/sweep, compacting, and copying. The latter three are referred to as tracing collectors.

Reference counting tracks how many copies of each pointer there are in the heap. When the heap contains zero copies of a pointer, the memory it points to can be reclaimed.

The three tracing techniques all start the garbage collection phase by iteratively following the pointers stored in registers, the stack, and globals (called roots) to determine the live subset of allocated data that the program can access (this is the “mark” part of mark/sweep). Allocations that couldn’t be reached from the program’s roots are dead and can be reclaimed, in one of several ways. The “sweep” part of mark/sweep conceptually calls free() for each dead allocation. Compaction takes live objects and moves them into the holes left by the dead. Copying collectors start with the heap split into two halves, with all live data in one of the halves. Instead of sweeping or compacting, each live object is moved into the spare half. When all live objects have been copied out, only dead objects remain in the original half, and it can be reused without further processing. 

Much of the garbage collection research of the last few decades can be seen as schemes to extend, combine, or refine these basic techniques.
Describing Garbage Collector Performance
On a uniprocessor, we have a simple mental model for how GC works: the program runs for a while, until it’s in danger of running out of memory. Then the program is paused while the garbage collector runs to reclaim memory, after which the program resumes. The sum of these pauses is how much slower the program gets overall, and corresponds to GC throughput.
A rule of thumb I find useful: when a paper says they improved throughput, what they really mean is they cut down on their collector’s unnecessary work.
In most real-life situations, though, throughput (raw overhead) is not the only metric that matters. You also want to preserve your program’s responsiveness (to mouse clicks, network packets, the audio or graphics subsystems, or whatever else), since pausing your program to run the GC for a long time will result in a frozen GUI, audio glitches, dropped packets/frames, etc. As is often the case in engineering, there’s a tradeoff between these two goals. Techniques that improve throughput usually degrade pause time, and vice versa. Put another way: reducing (worst case) pause time comes at the price of making the program run slower overall.
But max pause time by itself can be misleading! A series of short GC pauses with only a few mutator instruction between each pause will usually “feel” like one big pause. Mutator utilization better captures the notion of “responsiveness”. The minimum mutator utilization (MMU) is a number that says, at a particular timescale, what fraction of that time was spent executing non-GC code in the worst case. MMU is usually presented s a graph: the x-intercept is the largest pause time, and the y-intercept is the overall throughput; the shape of the graph in between tells you the mutator’s worst-case responsiveness at different timescales.
Screenshot 2016-09-30 19.21.53.png

MMU example graph. The purple stripes represent GC pauses; note the “sawtooth edge” of the graph

One counterintuitive consequence of the above definition is that MMU sometimes goes down (!) at larger timescales [1]. Suppose the GC introduces a single pause of 10ms. Then MMU at 10ms is zero, and MMU at 20ms is 50%, and MMU at 10s is 99.9%. If the GC introduces a second pause, 90 ms after the first one, then the MMU at 90ms is 88.8%, but the MMU at 100ms is only 80%, since there is a 100ms window which includes two 10ms pauses [2].
Another “gotcha” when evaluating MMU is that it’s very difficult to account for every last bit of overhead at sub-millisecond granularity. To the best of my knowledge, nobody has ever explicitly factored the effect of GC-introduced cache misses or minor page faults into their MMU calculations. Capturing such events is possible-but-hard; figuring out which are “the GC’s fault” is trickier. And it can be an issue: Click et al observed that, for some concurrent collectors, simply summing the “big” pauses led to reported overhead that was one sixth of the true figure!
I wrote up a little mutator utilization grapher that you can play around with if you’re curious.
Another fundamental property of collectors is space overhead. This takes many forms: interior and exterior fragmentation, metadata overhead, and reserved areas of the heap. As is often the case, space trades off with time: larger heaps can be collected less frequently, thus increasing throughput, but each collection of a larger heap takes longer, thus increasing pause time.

To recap: garbage collector designs encounter “classical” engineering tradeoffs in optimizing for average vs worst case performance. The GC community describes the tradeoff with networking-inspired terminology: throughput vs latency. MMU graphs capture this tradeoff and give a visual presentation of collector performance.
Tracing vs Refcounting
One really cool “observation” paper to come out of the memory management community is A Unified Theory of Garbage Collection by Bacon, Cheng, and Rajan. This is one of my favorite types of research results: bringing together two separate schools of thought with a mix of theoretical and practical insights that’s almost blindingly obvious… once you’ve been told what to look for!
On the theory side, they observe that tracing and refcounting are very nearly precise duals of each other. The former starts with known-live values and assumes everything else is dead until proven otherwise; it does so by incrementing a one-bit count until it is greater than zero. Reference counting starts with known-dead values and propagates the consequences of their death by decrementing word-sized reference counts until they reach zero. Tracing cannot reclaim individual objects without RC-flavored techniques, and reference counting cannot reclaim object cycles without tracing-flavored techniques.
On the practice side, they note that while the simplest implementations of tracing and reference counting have complementary properties, optimizations to either algorithm tend to result in systems that face similar trade-offs. For example, generational and incremental collectors that reduce average pause time need to use write barriers that raise mutator overhead, whereas techniques like deferred reference counting, which reduce the mutator overhead from barrier-like refcount operations, result in delayed collection and longer GC pauses! The paper also evaluates the storage reclamation in the Log Structured Filesystem, and how GC tradeoffs in the OS/filesystem context differ from a language runtime’s choices. Like I said, one heck of a paper!
The Alchemy of Garbage Collection
Okay, now we’re ready to start looking at specific refinements of the elemental algorithms. As mentioned above, generational collection is one of the most common refinements to basic tracing. The scheme is motivated by two insights. The first is that programs tend to allocate a lot of temporary data. Generational collectors take the heap and split off a nursery for new objects. When the nursery fills up, it will usually contain mostly dead temporary objects, which are “free” to reclaim with a copying collector. But what if an object in the main heap refers to something in the nursery? This is the second insight: older objects can only point to newer objects via mutation, which is comparatively rare (for older objects). To make sure that a young-generation collection doesn’t miss any objects, a remembered set notes which parts of the nursery appear in the main heap, and allows the nursery collection to proceed without having to scan objects from the main heap. See the Handbook for implementation details.
There are a variety of refinements to reference counting. The most basic is deferred reference counting, which reduces the overhead of RC by ignoring intermediate updates. (This is an instance of the very handy algorithmic trick of epochs, which pops up in many subfields of systems research). Whereas standard RC immediately frees objects (recursively) when their counts reach zero, deferred RC with a zero count table (ZCT) instead says, “Oh, I’ll get to this later…” and stores the object in the ZCT. When it comes time to actually reclaim objects, anything in the ZCT that is not reachable from the program’s roots is garbage. (Note that this means we’ve introduced tracing to our RC scheme!).
Ulterior reference counting is a variant of generational collection in which the mature space is managed using (deferred) reference counting instead of tracing. When the nursery fills up, the survivors are evacuated to the reference-counted mature space. Reference counts in the mature space are only computed at the end of a nursery collection. The underlying motivation is that tracing and RC have complementary properties which match the demographics of the nursery and old space, respectively. The URC evaluation, done in 2004, shows that their throughput matches a generational mark/sweep collector, while reducing max pause times from roughly 200ms to 50ms in heaps from 20 to 100 MB. URC also generally needed space overhead of less than 80% to avoid thrashing, whereas generational mark/sweep often needed 300% space overhead.
Shahriyar, Blackburn, and Frampton compare an optimized lazy reference counting system to non-generational mark-sweep, URC and the Immix collector (we’ll get to it soon). The “traditional” reference counting implementation degrades benchmark throughput by 30% on average versus mark-sweep, and their optimizations eliminate the disparity. They only compare throughput, not pause times, mutator utilization, floating garbage, or sensitivity to heap size. The paper provides some interesting statistics on object popularity: on their benchmarks, 99.35% of objects have reference counts that never exceed 3 bits, but the remaining 0.65% account for nearly 23% of reference counting operations.
Terei, Aiken, and Vitek had a different idea: why not let the programmer decide where to use tracing and refcounting within their own program? Their paper highlights memcached as a (very) realistic example of a program that challenges most GCs, because memcached servers must use nearly all the available physical memory without swapping or pausing [3]. Their multi-memory-management prototype showed that giving the programmer a fine-grained choice between two simple GC algorithms resulted in overall program performance that matched that of (unsafe) manual memory management with custom allocators. The M^3 prototype combines mark-sweep and a reference-counted heap. Swept objects in the traced heap are inspected for pointers into the RC heap, which must be decremented. Mutations to objects in the RC heap must update (insert into or remove from) a remembered set for the traced heap to use. No bound exists on the potential size of the remembered set; it is up to the programmer to choose a type partitioning that doesn’t result in too many cross-heap pointers.
Quick aside: another paper that proposed an extended interface between the programmer and garbage collector is the Data Structure Aware collector by Cohen and Petrank. The idea is to assume that the contents of container data structures are reachable until the programmer indicates otherwise. This allows them to avoid tracing memory that is unlikely to be reclaimable. In exchange for modifying about 8 lines of code, their technique reduced GC time between 11% and 76% for the various benchmarks they studied.
Hybrids and Heuristics
There are also hybrid schemes that don’t involve reference counting. Both of the following hybrid algorithms successfully combine the large-heap throughput of semispace collection with the small-heap robustness of mark-sweep and the reduced memory requirements of compaction.
Spoonhower, Blelloch, and Harper published a collector design that dynamically switches between copying and mark/sweep collection. Building on Bartlett’s mostly-copying collector, they divide the heap into fixed-size blocks, and use estimates of block residency to determine at collection time whether a given block should be evacuated (which combats fragmentation and is cheap for blocks containing few objects) or promoted in place, which avoids the cost of copying densely-populated blocks.
Their paper characterizes mark-sweep and semi-space collectors as implementing fixed policies for allocation and reclamation. Regardless of block residency, semi-space collectors evacuate all blocks, and mark-sweep promotes all blocks. Likewise, mark-sweep allocation via free lists will use space from all blocks (even heavily fragmented ones), and semi-space allocation with a bump pointer only allocates from completely unoccupied blocks. Their hybrid policy allows, for example, allocation from blocks that are at most 90% full, and evacuation for blocks that are at most 80% full.
Accurately measuring block residency requires a full-heap traversal, which would essentially double the GC’s overhead. Thus they produce an estimate of each block’s residency, by combining object age, previous residency measurements, and static configuration parameters.
Immix is another mark-sweep collector that uses copying to reduce fragmentation. The heap is divided into 32KB blocks, and blocks themselves are composed from 256 lines of 128 bytes each. Different threads allocate from separate blocks, using a bump-pointer allocator within spans of free lines.  When marking a block, Immix uses fragmentation (computed from residency statistics) to decide whether to mark/sweep or evacuate a block’s contents. (Compare/contrast: SBH evacuates sparse blocks; Immix evacuates fragmented blocks; the Metronome collector also defragments blocks, but does so post-sweep, using exact residency statistics, rather than heuristically during or before collection). Subsequent work has produced several variants of Immix, including precise and conservative reference-counting, that match the performance of the original mark/sweep implementation. There’s also an implementation in Rust.
Incremental Collection
An alternate way of looking at generational collectors is through the lens of incrementality. When the heap fills up, we must reclaim enough free memory to satisfy an allocation request. But a full collection of the heap is likely to reclaim far more free space than we need at the moment, and incur an unpleasantly large pause in doing so. For an interactive application, it would be nice to break up this big chunk of work into several smaller chunks. Collection algorithms which can collect a subset of the heap at a time are called incremental. If we squint a bit, generational collectors make the common case (nursery-only reclamation) incremental, but full-heap collections must be done atomically. There has of course been a great deal of work tackling the problem of making full-heap collection incremental.
The Memory-Constrained Copying collector (MC²) is a block-structured generational collector. It collects a single block at a time in order to avoid the thrashing behavior exhibited by simpler collectors under memory pressure. Each block in the old space has a logical number associated with it. While marking the old space, the collector constructs unidirectional remembered sets to allow separate collection of blocks in increasing order. Marking also computes the live volume in each block. Marking and copying are both done incrementally. Incremental copying is done when nursery collection happens, and operates on groups of blocks, chosen to restrict the size of their combined live data and remembered sets. Like many collectors that aim to limit pause times, MC² makes an effort to explicitly balance the rate of allocation and collection. Since the rate of old-generation collection is tied to the rate of nursery reclamation, which is in turn a function of young object survival rate and nursery size, MC² resizes the nursery after every collection. MC²’s evaluation showed that remembered sets stayed below 8% of the live data, and mark queue size didn’t exceed 0.7%. Throughput matched a generational mark-sweep collector, and mean pause times were reduced by an order of magnitude — from 264.68ms to 27.18ms. As an aside, the MC² evaluation includes data for pause time distributions and bounded mutator utilization, in addition to the usual throughput and max pause numbers. I wish every GC paper included such detailed performance data!
Incrementality is a restrained/restricted form of concurrency. Fundamentally, the work done by the garbage collector is being done in multiple steps, with the mutator running in between the steps. This means that the mutator can “see” — and modify — the objects that the collector is inspecting. In turn, we need more sophisticated invariants for the mutator and collector to correctly work together. In a non-incremental mark/sweep collector, there are two states of interest. White objects are dead-until-proven-live, and black objects are proven-live. Marking turns white objects black, and continues recursively until every pointed-to object is black. This establishes an invariant: black objects never point to white objects. At that point, sweeping frees white objects and un-marks black objects, turning them white.
But in an incremental (or concurrent) scheme, by definition the mutator will be running before marking establishes the aforementioned invariant. The danger is that the mutator’s action — for example by storing a reference to a white object into a field of a black object — can violate the invariant the collector needs to do its job. The solution is to introduce a new state: gray, which is an object that the collector has marked, but which might contain pointers to non-black objects. Gray essentially means “TODO.” Now, when the mutator stores a white reference into a black object, we can preserve the so-called strong tricolor invariant in one of two ways: either by marking the white object, or turning the black one back to gray.
As the name hints at, there is an alternative called the weak tricolor invariant. The basic intuition is that having a black object point to a white one is OK, as long as some other gray object exists that will eventually turn the white object black. The weak invariant is, in some sense, a dual of the strong one; preserving the strong invariant must block “creation” of white pointers in black objects, whereas the weak invariant focuses on deletions of white pointers from non-black objects.
The Compressor algorithm is a clean, modern compacting design. It makes clever use of mark bitmaps to enable incremental and parallel compaction. The basic problem is: once marking completes, the mark bitmap shows which objects are live and which are dead, but there’s no indication of where in the heap each live object should be moved to. Here is Compressor’s trick: first, divide the heap into small segments (on the order of 512 bytes). The volume of live data in each segment can be calculated in parallel by inspecting the mark bitmap. A parallel scan then quickly computes the starting point for the compacted data in each segment. Given this preprocessed offset table, the final position of each live object in each segment can be easily computed. In addition, the inverse mapping can be easily constructed, giving the from-space range that will map to each to-space segment. Unlike most other compaction algorithms, Compressor runs in one pass, without requiring extra per-object metadata such as forwarding pointers. One last trick: Compressor identifies dense blocks and doesn’t waste effort compacting them (just like MC²).
The concurrent variant of the Compressor is based on virtual memory protection. After concurrently computing the offset tables, the to-space pages are protected, and the mutators stopped to have their roots point to the protected to-space pages. When the mutators resume, the resulting VM traps trigger incremental concurrent compaction. In terms of the tricolor invariant, unprotected tospace pages are black as a result of compaction. Protected tospace pages are gray, and fromspace is white. The mutator is gray when it operates in fromspace, so it can deal unobstructed with white objects; when its roots are redirected to tospace, the mutator turns black.
Scalable Garbage Collection
I learned the other day that Google’s Compute Engine lets you provision memory for virtual machines independently of core count. So if your two core/four GB VM is running out of memory, you can turn it into a two core/16 GB VM, for about… four cents an hour. Pretty nifty! But this leads to an obvious concern: if we increase our heap sizes (and live data) fourfold, won’t our worst-case GC pause time also rise along with it? Stop-the-world pauses would take minutes to hours (!!!) on very large heaps. That’s obviously pretty bad; what can we do to avoid it? Can we limit the GC’s maximum pause time and space overhead, independent of heap size and mutator behavior? Felix Klock (who now works on Rust at Mozilla!) studied this in his super awesome dissertation.
The key thing is: we mustn’t ever need to traverse the full heap all at once. Reference counting can satisfy this property, but needs tracing to be complete, and also suffers from fragmentation, meaning space overhead is mutator-dependent and difficult to bound. (Incidentally, Firefox’s GC implementation has evolved from stop-the-world to incremental to generational to compacting; the latter step was entirely to reduce fragmentation).
So to avoid fragmentation and give reasonable space bounds, we’ll need to move objects.
Well, OK, but some objects are super popular — a significant fraction of the heap is just copies of a pointer to that object. Wouldn’t moving those objects mean we need to traverse most of the heap to update the pointers? Indeed, but there are two ways out of that predicament. One is to add a layer of indirection with a read barrier; this is what most real-time collectors do. Klock had a different insight: you can just elect not to move the popular objects! After all, if you’re worried about objects that might take up, say, 1/8th of your heap, well, there can’t be more than 8 of those! Defining popularity in terms of max heap size is no bueno, though: we mustn’t let the size of the heap dictate our worst-case workload. Instead, the trick is to divide the heap into fixed-size regions, and define the popularity threshold as a multiple of the region size, rather than (a fraction of) the heap size. But the insight remains the same: your program doesn’t execute in Lake Wobegon, and not all of the regions can be more popular than average.
Having divided up the heap into regions, we must be able to collect a single region independently of the others; this is how Klock achieves scalability. Simple to state, but the details get pretty hairy; one false move and we lose scalability. To collect a region, we have to know what objects in it are referenced by other regions — a remembered set. BUT WAIT! There are two kinds of remembered set representations, points-into and points-out-of; the former has worst-case quadratic space usage, so remembered sets must be points-out-of. BUT WAIT! Even that’s not enough; the aggregate of all remembered sets can still scale with heap size, which would mean that collecting a single region would again involve scanning O(heap) data. The solution is to have a points-into summary set for each non-popular region; summary set sizes are bounded by the popularity threshold, which is independent of heap size. Ah, BUT WAIT! to construct a summary set, we must scan the full-heap remembered set! Now we get into issues of scheduling: first, amortization: each pass over the remembered set tries to summarize a fixed fraction of the heap’s regions (abandoning the popular ones); and second, we can be a bit lazy: summarization only needs to happen when we’re running out of regions with usable summary sets, and can end early when “enough” regions have been summarized. BUT WAIT! If we end early, the mutator can invalidate the summaries we so lovingly crafted. We can keep the summary sets up to date with write barriers (careful to keep the work O(1) amortized!). BUT WAIT! If summary sets grow too large they must be discarded, which means that the rate of collection must roughly keep pace with the rate of program mutation. BUT it also mustn’t go “too fast” or we’d jeopardize the minimum mutator utilization guarantee that scalable collection promises.
Well, I haven’t even covered all the interesting details, like the mark stack and snapshotting, but I hope you get the idea of why I think Klock’s work is an undeservedly-obscure tour-de-force. Experimentally, Klock shows that regional collection’s throughput is only a bit lower than a generational collector, even though worst case pause times (and thus MMU) is greatly improved. The comparisons to other collectors illustrate many of the high-level tradeoffs between latency & throughput, and between guaranteed vs expected-case performance. For example, on one benchmark in a 2GB heap, the regional collector’s max pause time is 0.12s, while the G1 collector paused for 2.13s. But the G1 collector finished in 154s total, and the regional collector took 757s. The worst-case observed pause for the scalable collector on any benchmark was 240ms.
GC Scheduling
Klock’s work on regional garbage collection highlights that scheduling of garbage collection work versus mutator work is a fundamental issue. Such questions of scheduling appear in many places. The collector in Go 1.5 has a scheduling component (they refer to it as “pacing”) to balance heap growth and CPU usage. Chrome (that is, V8) explicitly tries to schedule garbage collection work during idle moments, so as to improve the browser’s responsiveness and reduce hiccups when playing animations (a soft-realtime task). And of course, much work on hard-realtime garbage collection centers around issues of scheduling and schedulability.
Hans Boehm published a tech report connecting scheduling to lazy reference counting. He observed that a consequence of deferred (“lazy”) RC is an increase in floating garbage, and presents an adversarial program that causes a lazy RC implementation to retain a quadratic amount of floating garbage. The solution he presents is to permit O(n) reclamation work to match an allocation of size n. This is strongly reminiscent of Klock’s observation that the collector must “keep pace” with the mutator.
Conservative Collection
OK, time for a quick detour on conservative collectors. I mentioned at the start of the post that garbage collection starts from roots: in particular, from registers and the stack. Often the compiler will emit data structures to help the runtime system figure out what registers and stack slots contain heap pointers, and which do not. Likewise, individual objects in the heap will be given a descriptor indicating which fields should or should not be traced. But sometimes people want to use GC with a compiler that doesn’t know anything about GC, like Clang or GCC. And it can be done, with a few caveats!
The core idea is to assume that anything that looks like a pointer is a pointer. This implies that conservative collectors can’t move objects, since they don’t know for sure whether they have an actual pointer or an integer that happens to look like a pointer. And even if it would be fun and easy, the collector shouldn’t go about tweaking the mutator’s hard-earned values.
One nifty trick is that conservative collectors can actually move objects in many circumstances. If the compiler provides just a tiny bit of information to the runtime about the layout of heap objects, you get a heap-precise design. Basically, pointers in registers or the stack are treated conservatively, but pointers within the heap can be positively identified, and thus moved/compacted/etc. Really, most of the issues and design decisions for non-conservative (aka fully-precise) collectors also apply to their conservative brethren. For example, Boehm-Demers-Weiser, the most well-known conservative collector, can run incrementally and generationally. And Ironclad C++ is an extension to Clang that supports heap-precise GC.
GC Case Study: Go
Early versions of Go used Boehm-Demers-Weiser. This brought two major problems. First, the conservative nature meant that Go programs on 32-bit machines would sometimes spuriously run out of memory. One widely-derided post on the golang-nuts mailing list suggested a workaround: stick to integer values below 10000. Heh. This situation improved with Go 1.1, which was heap-precise, and Go 1.3, which was fully-precise.
Second, multi-gigabyte heaps had multi-second stop-the-world pause times. With the release of Go 1.5, they implemented a concurrent mark-sweep collector, which reduced pause times from multi-second to a few milliseconds. For a more real-world take, Twitch published a blog post detailing how their observed pause times were reduced from multi-second with Go 1.2, to 200 ms with Go 1.5, to 100ms with Go 1.6, to 1ms with Go 1.7.
Note that Go’s garbage collector doesn’t deal with fragmentation. This may or may not be an acceptable engineering tradeoff for Go’s use cases.
On Bump Pointer Allocation
Many discussions about garbage collection eventually touch on the benefits of bump-pointer allocation. Intuitively, it seems like it should be much faster to just do an increment and conditional jump, rather than searching through a bunch of freelists. Case in point: the Hacker News discussion regarding the aforementioned Twitch blogpost. Surprisingly, this turns out to be a vastly smaller difference than most people would think. Measurement from Blackburn et al showed that allocation was < 10% of program runtime, and the advantage of bump-pointer was about 10% faster; therefore the overall throughput advantage is about 1% of program runtime. That’s about the same performance change to be had from passing extra unused environment variables to your program.
There’s at least one study comparing the throughput of malloc/free versus garbage collection. The result found was that if you have a lot of free memory to work with (about 3x your live data), tracing GC is just as fast as malloc/free, but at smaller heap sizes, you spend a lot of time re-inspecting your live data to determine what’s dead. In effect: when free memory is limited, GC appears to have a soft failure case (slowing down) that malloc/free avoids. But that’s not really the whole story. These benchmarks are fundamentally batch workloads that run for anywhere from seconds to hours, then exit. But on the web, both servers and browsers can run continuously for days, months, or years at a time [4]. On such long-term workloads, there’s another factor that comes into play: fragmentation. Any non-moving collector (malloc/free, reference counting, and mark/sweep) will hit some amount of fragmentation. As a failure mode, fragmentation’s costs manifest in two ways. First, there’s a “global” program slowdown from increased effective working set & decreased locality; second, if fragmentation gets bad enough, allocation will eventually fail, and your program will crash (or, uhm, corrupt data and then crash).
So if you want to not crash, you need to size your heap as some multiple of your live data. How much fragmentation do you need to account for? There are a few research results on the theory side. First, Robson proved in the 1970s that the theoretical worst-case is logarithmic in the size of the maximum allocation. That’s pretty atrocious; even supporting piddly 1MB arrays means potential overhead of 20x! More recently, Cohen and Petrank showed that compaction can drop the needed overhead in (one particular set of) realistic circumstances to… 3.5x. Amusingly similar to the numbers cited by Hertz and Berger!
As for practice, there is a smattering of research with empirical measurements of memory fragmentation. Johnstone and Wilson showed that on a selection of 8 programs that allocated 27 MB on average, Doug Lea’s malloc implementation limited fragmentation to 4% in the worst case [5]. This is a good result to know, but I’m not confident it can be usefully extrapolated to long-running web servers or browsers, which (A) allocate several orders magnitude more data, and (B) have nondeterministic allocation patterns, dependent on network and/or user input. Supporting this skepticism is a study from Bohra and Gabber measuring several malloc implementations on a long-running caching web proxy. They found that fragmentation varied by allocator from 30% to heap exhaustion (dlmalloc clocked in at 80% overhead). They also noted that the impetus of their study was noticing that their heap was several times larger than their live data. Bacon, Cheng, and Rajan echo this limitation in their summary of J&W. They also show that fragmentation in SPECjvm98 can reach nearly 40%. More recently, Jula and Rauchwerger benchmarked larger, more intensive workloads on a few different allocators, finding that (A) dlmalloc had 10.8% average and 23.5% worst cast fragmentation, and (B) faster allocation came at the expense of more fragmentation.
Real World Fragmentation
I was (and am!) curious about how much of a problem fragmentation is in practice. I’ve seen lots of statements acknowledging that fragmentation is a big enough problem to be worth avoiding (examples: Mozilla’s Stuart Parmenter in 2007, Huey, Coppeard & Nethercote 2011-2016, Facebook in 2011Aerospike in 2015suniphrase in 2015, Deep in 2016, Jonathan Blow in 2016, gperftools#756).
Most of the time, people don’t provide much in the way of hard data on what sort of fragmentation they’re seeing — not even heap sizes, much less allocation size distributions. When people do provide numbers, they tend to be eye-opening. Aerospike reported that “big-memory (192 GB RAM) server nodes were running out of memory and crashing again within days of being restarted.” In 2012, Kirill Korotaev exhibited a ~20 line program, distilled from a real app, that caused 20x (!) fragmentation overhead in glibc’s allocator, needing 1.8 GB for a 93 MB live set.
Avoiding the space overhead of fragmentation can have significant time costs, even for systems that you don’t usually think of as doing garbage collection. For example, Julia Evans hit a case where the Linux kernel spent 30% of her program’s runtime compacting pages. CloudFlare also has a blog post describing how a form of GC in the Linux kernel for network packet metadata caused hard-to-reproduce pauses of 30s for requests that usually finished in milliseconds. Finally, GCC also has custom infrastructure for GCing the compiler’s allocations; I haven’t measured its usage directly.
TLB Reach
Did you know that modern large-memory workloads can suffer double-digit performance overhead from virtual memory? (studies claiming (mean // max): (9 // 18)% overhead for 8 GB of RAM, (? // 16)% overhead for 16 GB, and (?6? // 53)% for 96 GB using 2 MB pages, and (18 // 40)% for 4 GB VMs on a 24 GB server) The middle study notes that traditional benchmark suites, like SPEC and Parsec, have small working sets that don’t stress the TLB — unlike modern servers! Also, these days you can buy servers with multiple terabytes (!!) of physical memory. There is active research on reducing TLB performance overhead, but it’s anyone’s guess when — or if — such schemes will be available on commodity hardware. On the plus side, virtualization doesn’t appear to significantly increase TLB overhead. Finally, here’s a case where Linux’s support for automatically using large pages caused a 10x performance regression.
Pushing the Envelope
We’re getting close to the end, I promise! Let’s wrap up with brief descriptions of a few interesting collectors that have pushed the boundaries of real-time and concurrent collection.
A fundamental problem faced by all garbage collectors is controlling fragmentation. Most non-real-time collectors take one of two approaches:
  1. Ignore the issue by not moving objects, hoping that the mutator doesn’t get unlucky and induce fragmentation that free-block-coalescing cannot handle; or
  2. Move objects, but in doing so, trade space- for time-problems (in a sense) and accept that the copying/defragmentation process may need to block the mutator while it clears out a fragmented section of the heap.
The Schism collector is one of a small set of real-time collectors that takes a different tack. It eliminates external fragmentation by using fixed size chunks for all allocation requests. Objects larger than the chunk size are represented with a linked list of chunks. Arrays are represented with two layers of indirection, as a fixed-size object header pointing to a variable-length spine which contains pointers to the fixed-size chunks holding the array elements. And in the common case where contiguous space is available, Schism can use it for arrays, thus avoiding the overhead of spine indirections. Object fragments and array data chunks are managed in a non-moving, concurrently-collected Immix-structured heap.
Array spines are the only variable-size allocations, and thus the only possible source of fragmentation. But here’s a cool trick: since the data blocks never move, the pointers in the spine never change, and the spine is thus immutable. This in turn means that when spines are copied, the mutator is free to use either copy. Furthermore, since the mutator is forced to indirect to the spine through a so-called sentinel object, updating the mutator takes only a single compare-and-swap.
The benefits provided by this scheme are:
  1. Bounded fragmentation: Schism will never need more than 31.04% extra space beyond that required for live data (this does not account for internal fragmentation).
  2. Constant time wait-free heap access for the mutator.
  3. Sub-millisecond pause times on a 40 MHz processor
  4. Excellent throughput for a real-time collector, only 1/3rd lost versus the desktop Hotspot 1.6 JVM.
Since Java doesn’t support interior pointers, neither does their collector, although I believe that interior pointers for object fields would be an easy extension. Support for interior pointers within arrays would be trickier but still doable, at least in certain circumstances.
To the best of my knowledge, Schism has never been publicly benchmarked on an x86 or ARM CPU, nor on a multi-gigabyte heap. But I’m very curious to know how it would do in those situations! One last cool bit: the core of Schism’s collection algorithm has been verified against the x86-TSO memory model in the Isabelle/HOL proof assistant.
Block-Free G1
Building on their earlier work on lock-free concurrent compaction, Erik Österlund and Welf Löwe added a syscall to let the GC introspect on the OS scheduler’s thread state, which allows them to perform critical operations — stack scanning and copying — without ever blocking mutator threads. Memory management for concurrent data structures with progress guarantees is a hard open problem, and this line of work makes a significant step towards a solution. Their modifications to OpenJDK’s G1 collector reduced 99.99th percentile latency from 174ms to 0.67ms for one of their benchmarks. Overall, average GC pause time dropped 8-fold, at the cost of 15% degraded throughput.
One important point they make is that GC algorithms which use virtual memory (mprotect) can’t be non-blocking, because the page tables can only be modified when the kernel holds a non-preemptible lock. I don’t have much intuition for the costs of page table modifications. It would be very interesting to see what impact that source of blocking would have on an otherwise non-blocking GC design. Of note: Pauseless, C4, Collie, and Concurrent Compressor all use mprotect. Shenandoah eschews mprotect to avoid read storms.
Shenandoah is an effort to produce an open-source, industrial strength low-pause-time garbage collector, as part of the OpenJDK project. Broadly speaking, it aims to be a fully-concurrent and -parallel mark-copy collector, using both read & write barriers to enable concurrent compaction. As far as I know, there haven’t been any academic publications on Shenandoah yet Christine Flood recently published a paper on Shenandoah; there’s also a smattering of blog posts and slide decks from presentations. Shenandoah is available in the default JVM distributed with Fedora 24. It would be very interesting to see a head-to-head comparison of Shenandoah with Österlund & Löwe’s OS-dependent G1 variant.
Pauseless and C4
In 2005, Azul Systems published a paper detailing the garbage collection algorithm, and associated support from their custom CPU and OS, that allowed them to obtain low pause times (21ms) on very large (50+ GB) heaps.  The state of the art until the early 2000’s was 10-20% maximum runtime overhead from software read barriers, with average-case overhead of roughly half that. But Azul’s hardware-supported read barrier has an average cost of one cycle(!), and overall mutator utilization was reported at 98%.

In contrast to Compressor, which protects (all) tospace pages to trigger evacuation, Pauseless protects fromspace pages (incrementally). Pauseless continuously runs a concurrent and parallel mark phase (like G1), which generates mark bits and computes forwarding information (on-the-side). Using residency statistics, Pauseless protects sparsely occupied pages. Such pages are evacuated, and the underlying physical memory immediately freed. There will still be references in the remainder of the heap pointing into evacuated pages; the mutator and collector cooperate to find and correct such pointers.

When the mutator loads a reference to an unmarked object, a hardware trap notifies the marker threads. When the mutator loads a reference to an object on a protected (and thus, likely evacuated!) page, the forwarding tables are queried to find the fresh copy, and the memory containing the original stale reference is updated (the paper refers to this as “self healing”). The collector finishes a GC cycle by traversing the entire heap to find any remaining pointers into fromspace, and fixing up the stragglers. Now, having established the invariant that all pointers in the heap point into tospace, the virtual memory for fromspace pages can be reused.

C4, the Continuously Concurrent Compacting Collector, adds generations to Pauseless. Unlike most generational designs, the nursery is not managed with a stop-the-world copying collector. Rather, both the nursery and mature space are essentially separate Pauseless heaps, with a precise card marking remembered set. They added new Linux system calls to support batched and non-TLB-flushing remappings, yielding order-of-magnitude speedups versus the stock kernel. The upshot is that C4’s throughput matches ParallelGC in Java while maintaining small-in-practice pauses.
On the subject of Azul’s port to x86, Cliff Click said:

Azul’s GC required a read-barrier (some instructions to be executed on every pointer read) – which cost something like 5% performance on an X86. In exchange max GC pause time is something in the low microsecond range

and noted that they didn’t make use of any novel x86 features:

No change to the X86, instead user-mode TLB handler from RedHat allows ptr-swizzeling, that plus some careful planning and the read barrier fell to 2 X86 ops – with acceptable runtime costs.

The core idea behind Collie, which builds on Pauseless and C4, is to use hardware transactional memory to atomically and incrementally compact objects and update all incoming pointers. Obviously this cannot be done for popular objects, so popular objects are temporarily pinned, with the mutator and collector collaboratively updating popular references. There are of course many more details covered in the paper.
Like the other collectors from Azul, Collie is an engineering marvel. It pulls out pretty much every trick in the book: read barriers, write barriers, non-global checkpoints, partial conservativism, virtual memory and HTM support. Like all good engineering, Collie makes careful tradeoffs, in exchange for desirable results: Collie delivers an absolutely staggering 80-90% MMU (!) at 1ms resolution (!!) for heap sizes up to 30 GB (!!!) when continuously collecting. One issue I wish the paper had covered more was the effective space overhead of their per-object remembered sets; it’s unclear whether the space overhead for their MMU benchmarks was 50%, or 400%, or something completely different.


[0] Seven thousand words is still a “whirlwind,” right?
[1] There’s one more quirk of mutator utilization, which is that it can be “inflated” by making the mutator less efficient. This can occur for two reasons. First, different language implementations have different “baseline” levels of generated-code quality, so the same GC implementation will appear to have a better MMU for a slower language. Second, the overhead of high-resolution measurement with tools like Pin can introduce similar effects. But in practice that’s not an issue most people need to worry about.
[2] To better match the human intuition that the mutator’s utilization doesn’t actually get worse at larger intervals, we can instead focus on bounded mutator utilization (BMU). Numerically, BMU at timescale x is MMU at any timescale x or larger. Graphically, BMU is the monotonically increasing graph you get from MMU with the “peaks” flattened out.
[3] It’s common to use memcached as a caching filter in tree-structured RPCs within a datacenter. This pattern of usage means that tail latency at nodes very quickly becomes median latency at the root of the tree, which is why it’s critical to bound worst-case latency at the nodes and not just settle for good median or average latency. When you want sub-millisecond level QoS, even the OS scheduler might need to be re-examined…
[4] Here’s a real story of a long-running server that hit a GC failure case. Interestingly, the problem wasn’t fragmentation, but rather a generational collector encountering object demographics that didn’t match the generational hypothesis.
[5] They also point out that there’s no single definition for how to measure fragmentation, and highlight a few potential definitions, each with various shortcomings.

Adblock History and an Unhatched Easter Egg

The first browser I ever used was Netscape Navigator, and that allegiance continued over the years with the Mozilla Suite, then Camino, then Phoenix/Firebird/Firefox. Thinking about Camino makes me both happy and sad. It was far ahead of its time in terms of user experience, but it stagnated due to lack of attention, as its developers were plucked away to work on Safari or Phoenix. In 2002, it was the browser of 2005; but in 2008, it was still the browser of 2005.

Anyways, I digress. Around 2004, when I was in high school, I was a user of Firefox nightly builds. This was before the existence of extension versioning or The closest equivalent to AMO was mozdev, but that’s a bit like comparing the App Store to github. The mozillazine forums were the watering hole for users, extension developers, and browser core devs alike. At one point, a nightly build broke my favorite plugin, Adblock. After a shockingly long time (two whole days!) went by without a fix by the extension’s author, I decided to poke around and see if I could hack together a fix myself. Well, it worked; after a few more such tweaks, I wound up on an IRC channel with the project’s then-current maintainer/developer, who called himself “rue.” We spent most of the next year collaborating on the Next Big Release of the extension.

Unfortunately, we got sucked into Second System Syndrome, without having had the benefit of having fully built the first version ourselves. Basically, every rule of thumb that Joel Spolsky wrote about, we violated (unintentionally). We tried to rewrite the technical infrastructure, while simultaneously adding many ambitious new features. We were aiming for an Apple-style Grand Curtain Drop on a shiny new toy. Tempting, but foolish. While we were toiling away in secret, pressure was mounting for something, anything, to improve upon the actively-bitrotting public version. Eventually, a previous developer of Adblock, Wladimir Palant, returned to the scene and forked the project to create Adblock Plus.

One interesting side effect of an extremely long development cycle is that it allows for meandering diversions from the yellow brick road. The other developer, rue, found that Firefox extensions could alter the requested URL without having the alteration appear in the browser’s address bar. This held the potential for a great deal of mischief. On the more innocuous end, an April Fools Day prank might have reversed the URLs for and, or NFL/NHL, etc.  On the less innocuous end, rue prototyped a module that would, randomly with small probability, replace Amazon referral codes. It would have been the Office Space penny-shaving scheme, more or less. I objected on grounds of skeeviness and/or illegality, but it was a moot point, since we never released a completed version.  Adblock Plus ended up taking a bolder, more direct approach to monetization.

Anyways, while rue was busy poking at Firefox internals and de-obfuscating Gmail’s client-side code, I amused myself by devising an easter egg, and trying to conceal its presence in a non-obvious way. When it comes to such things, Firefox extensions have a distinct advantage over traditional software: they can put UI on a web server, so that a review of the extension’s source will see, at most, an innocuous-looking URL. While normal content served over the web is sandboxed, extensions can collude with a page to do things that a normal page cannot do. So I made a little recreation of the “Wake up, Neo…” scene from the Matrix.

Here is the easter egg in action; you’ll need to view it fullscreen to see the text:

If you have the sound up high enough, you’ll hear a clicky noise when each green character is “typed.” It’s actually the same noise used in the original movie! That’s called craftsmanship.😉  One other subtle point: Adblock’s UI is in the lower-right corner, as a statusbar item. The modern convention of using toolbar logo buttons wasn’t yet established at that time.

Had anyone used a packet sniffer or a lucky guess to figure out the URL for the easter egg and visited it directly, they would have seen this instead:

cypher - Internet Explorer_2015-02-02_19-30-44

The easter egg also altered Adblock’s entry in the Extensions list, changing the version from 0.6 to “The 1.0”, replacing the crosshair-style logo to an image of Agent Smith, and altering the description to include parodies of quotes from Agent Smith and Morpheus.

IE7 [Running] - Oracle VM VirtualBox_2015-02-02_20-23-56IE7 [Running] - Oracle VM VirtualBox_2015-02-02_22-18-41

Final useless random historical fact: Adblock was the 10th extension added to AMO. I have no idea how many there are now, probably in the mid thousands.

Email Spelunking

Going through some old files emails, it’s been interesting to dip into the not-so-distant past. Unfortunately sent messages were very sparsely archived, so I have only the incoming half of most email conversations. And sadly, all my email before 2002 appears to have been lost or corrupted. Oh well.

A few bits and pieces:

  • Xbox Live beta emails in Oct 2002 – “You still could get in!”
  • iTunes Music Store launches on Apr 29 2003: “Your music. Your way.”
  • Emails about the beta for the (recently defunct) Unison client by Panic.
  • Emails from, before they switched to (I joined in June 2005 – soon to be ten years!).   As late as August 2005, your Wall was a single mutable text field, so emails said “X has changed your wall.” By September, the terminology was changed to “X has written on your wall” as the list-of-posts model was adopted. Techie bonus: mail was sent with a user agent header of “ZuckMail”. Also, Wirehog.
  • At one point, I purchased the domain ILBEN.COM… so that I could write my email address as   …………… facepalm
  • Registered on June 10, 2002.
  • I seemed to have sent a lot of email to random internet folks:
    • Adam Iser, about how much I liked Adium
    • Dan Gillmor, about using Disk Copy to make disk images from DVDs
    • Nick Zitzmann, about adding Applescript support to Battlegram (a chat client for
    • Damien Gallop, about using RDC instead of VirtualPC…???
    • Rob Griffiths, about featuring LaunchBar as a Product of the Week on
    • I distinctly remember writing to Andy Ihnatko but those emails appear to have been lost to the sands of time.
    • Nobert Heger, about his experiences as an independent developer producing LaunchBar, to which he responded: 

      Success as a shareware developer depends on so many different factors, so it’s a bit difficult to give a general answer. You need a good idea, a good implementation, good PR and marketing, good support, etc. Failing in just one of these areas can make a huge difference in terms of success. The only advice I can give is to try it yourself.

  • Correspondence with Nigel McFarlane. Extremely bittersweet for me: he bravely and generously took a chance on me to write a few pages and get my name in a physical book… and committed suicide three months after it was published. Even though I barely knew him, the sense of loss and pain hasn’t diminished in the last ten years.
  • A handful of emails from my uncle, mostly followups to spirited arguments we had over things like whether the world would be better off if every television set vanished into the ether. In retrospect, seeing what I wrote makes me cringe – I was an ass, and being fifteen isn’t much of an excuse. I didn’t notice/appreciate it at the time, but with twelve years of elevated perspective, there was a great deal of wisdom (of the life-experience variety) shining through his words. But he understood what I did not:

    There are, unfortunately, some things that take years to be put into perspective, and even being extraordinarily bright provides no shortcut to such insights nor any guarantee of happiness.

On LLVM’s GC Infrastructure

LLVM’s documentation suggests that its getelementptr instruction is one of the most commonly misunderstood aspects of the system. In my experience, GEP is only number two on the list; number one is LLVM’s GC support: what it provides, what it doesn’t provide, and how to use it effectively.

The first two are the easy ones: LLVM does support stack maps, read and write barriers, fully-precise collection, and relocating collectors. It does not require a shadow stack, though it does provide one, which is helpful for first-time GC clients to get started with. LLVM’s overall GC design supports concurrent collectors and mutators, but the implementation lacks a few key ingredients to minimize the performance impact of GC in such situations.

There are a few missing features on the implementation side which obviously don’t need major design overhauls to fix. The big three on my list are code emission at safe points, safe points on back edges, and support for register-allocation of GC roots.

The last one is sort of interesting, because I’ve seen several proposals to change the design of LLVM’s GC infrastructure in a major way to obtain it. The reasoning goes like this: we currently have llvm.gcroot, which marks a stack slot address as being a GC root. But we want to store roots in machine registers, not stack slots, so we need a new intrinsic to annotate an SSA register as being a GC root as well. With such an intrinsic, LLVM would automatically store and load SSA registers to/from stack slots “behind the scenes.”

So, for example, suppose we start off by acquiring a GC-traced pointer:

%root = alloca (gcroot)
%p = … allocate a pointer …
store %p into %root

If we had a non-copying collector, we could get away with code like this:

%r = … maybe trigger GC…
… use %p …

But with LLVM’s current GC support, a copying collector must write this less-legible IR instead:

store %p into %root
%r = … maybe trigger GC…
%p2 = load %root
… use %p2 …

When repeated over and over, it definitely makes the IR harder to read. Wouldn’t it be nice if LLVM would automatically do all that stuff for you? Like this:

… (no alloca needed!)
%p = … allocate a pointer …
call llvm.gc.reg.root(%p)
%r = … maybe trigger GC…
… use %p …

It’s a natural desire; the first time I encountered LLVM’s GC support, I wanted the same thing. But it’s a wolf in sheep’s clothing. Remember: the fundamental property of LLVM is that an SSA register is an immutable value. Not just “observably equivalent as far as my high-level program can see,” but equivalent under any observation possible at the IR level!

The problem is that under the hypothetical gc.reg.root scheme, LLVM is free to do constant propagation before the rewrite to the verbose version, and if it does so, then the program’s semantics will have changed! The problem is that %p has been transmuted from a constant to a variable by virtue of being marked as a register root. But in doing so, we’ve taken the key benefit of LLVM’s SSA representation—the power its gets thanks to immutable bindings—and thrown it right out the window!


Take another look at the “verbose” pseudo-IR listed above. The common interpretation of this snippet is “spill %p into a stack slot, perform some operation, and then reload %p from the stack slot.”  This is reasoning about the IR according to its current implementation. But I’d argue that it’s better to think about the abstract semantics of the IR. Abstractly, we’re just encoding the fact that the value we think of as %p might change at some point in the IR; beforehand, we called it %p, and afterwards, we called it %p2. Since phi nodes aren’t really viable for most frontends, the only way LLVM allows us to encode mutability is via memory – i.e. alloca.  But the real point isn’t to set up stack slots, it’s just to reflect pointers updates that the GC might be making. As long as LLVM can verify that the “stack slot” address doesn’t escape, it is free to back that mutable storage by symbolically addressed memory: namely, machine registers! It won’t do that analysis now, but it could. That’s the point: no major design tweaks are needed to get register allocation of GC roots, just a bit of extra analysis internal to LLVM.

A separate question is how much of a difference this would make in practice. Effectively, the lack of register allocation means that GC roots are stored in caller-saved machine registers; maybe that’s a performance-killer, or maybe it’s a wash. To the best of my knowledge, nobody has ever carefully looked at the impact of naïve versus varying degrees of sophisticated treatment of GC roots in LLVM-compiled code.


Stepping back even more, I think the reason there’s so much confusion about LLVM’s GC support is that there’s a philosophical impedance mismatch with the rest of the project. Elsewhere, LLVM tries hard to allow unsophisticated frontends to get equivalent code quality as a more sophisticated frontend. The canonical example is that with mem2reg, a frontend needn’t worry about SSA conversion or dominance frontiers or anything like that. In general, LLVM’s philosophy is to move as much work from the front-end to the common LLVM infrastructure as possible.  But LLVM’s GC support is different. There’s no automatic transformation from “dead simple” to “efficient” use of GC primitives. Instead, it’s up to each client to do whatever analysis is needed for the most efficient use of the existing GC-related primitives. In practice, this means that clients need to do data-flow analysis if they want to reuse dead root slots. This is in stark contrast to the negligible level of Dragon-book sophistication otherwise expected of a front-end.


In terms of how to use LLVM’s GC support effectively, I’ve found it clearest to think of the process as a transformation between separate intermediate languages. The goal of the transformation is to add a minimal selection of root slots, along with just enough loads and stores to preserve correctness. The crucial invariant is that at each potential GC point, the live roots must reside in (abstract) memory. So: backwards analysis to determine liveness, and a forwards rewrite to add the loads and stores. Interestingly, this combination of a backwards analysis with a forwards program rewrite is not well-supported by existing dataflow tools or their underlying theoretical frameworks. Anyways, to implement the given invariant, you’ll need to know which types require tracing, and which operations may end up invoking the GC. You can sort of encode this information via metadata and pointer address spaces, if using LLVM IR as your source IR, but I think it’s easier to think of the two IRs as being completely separate.

Date-Preserving Patch Folding with Mercurial Queues

Mercurial Queues are a powerful tool for organizing changes into logically coherent commits. My current workflow often goes something like this:

  1. Make some changes implementing feature A (as a patch).
  2. Generate a series of patches moving on to feature B.
  3. Realize, after working on feature B, that feature A needs a small tweak, or a better commit message.

The relevant mq commands will get the job done in a jiffy. However, there’s a slight downside: after all that qpushing and qpopping, the patches in the series will lose their original dates, which means that qfinish will produce a tightly clustered series of patches that don’t actually reflect the true chronology of their development.

Fortunately, Mercurial provides the tools needed to preserve a meaningful chronological order. We’ll exploit hg log’s configurable formatting to make it generate a shell script that’ll set the dates for our patch series automatically. A pretty little hack, no? Here are the relevant incantations, assuming you want to modify a patch named P with revision number N:

  1. hg log -r[[N]]:tip --template "hg qref -d '{date|date}'; hg qpush;\n" >
  2. Run hg qgoto [[P]], then make whatever changes you need.
  3. Run bash, which will restore the original dates for all patches in the queue following the modified one.

Agrajag Rediscovered

Going through the files on an old family computer, I discovered a long-lost email exchange with Douglas Adams:

Subject: Re: Agrajag
Date: Fri, 11 Jun 1999 17:52:48 -0700
From: "Douglas Adams" <askdna@TDV.COM>
To: Martin Karel <karel@BELLATLANTIC.NET>

Irene –
I’m pretty bad at planning what’s going to happen next week. Alex is exactly right.

Douglas Adams | <http:;
The Digital Village | <http:;

Please – help save the last 650 mountain gorillas

> Hi,
> Please settle a debate between me and my brother. I held forth that
> you must have already had Agrajag in mind when you wrote the bit about
> the pot of flowers that, when falling, said "Oh no, not again!" But
> Alex thinks that you probably just randomly stuck that line in there,
> and came up with Agrajag later and it just happened to fit. If I’m
> right, then we are very impressed with your degree of planning ahead,
> seeing as how it was four years between the publishing dates of
> "Hitchhiker’s Guide" and "Life, the Universe, and Everything."
>            – Irene Karel,
>               an adoring fan

Perhaps worth noting: the saved text file was 1337 bytes.

Firefox Add-On Compatibility

It looks like Firefox is moving to a compatible-by-default model for extensions. I find this interesting, since I proposed this back in 2004 to no avail.

To be clear, I don’t mean to say that compatible-by-default would have been the right choice in 2004… Predictions about the past are almost hard as predictions about the future, and at least two major factors have changed since then. First, the state of the world for extensions now is more stable than it was in the Firefox 1.0 timeframe. And second, Firefox’s rapid release cycle exacerbates spurious compatibility problems, making this proposal significantly more attractive.

The Lessons of Lucasfilm’s Habitat

We were initially our own worst enemies in this undertaking, victims of a way of thinking to which we engineers are dangerously susceptible. This way of thinking is characterized by the conceit that all things may be planned in advance and then directly implemented according to the plan’s detailed specification. For persons schooled in the design and construction of systems based on simple, well-defined and well-understood foundation principles, this is a natural attitude to have.

I wonder what Intel is up to?

“A proposed long term solution: a new functional language that features implicit parallelism, dependent typing, and an effects type system”

“We are also working with an ISV on a compiler for a parallel functional language
– Strict, but not call by value for implicit parallelism (maybe
with lightweight annotations)
– Arrays and array comprehensions for data parallelism
– Effects system to contain impure features
– Atomic for state updates”

“For five years Intel’s Programming Systems Lab (PSL) has been collaborating with an external partner on a new functional programming language designed for productivity on many-core processors. While the language is not yet public, this talk outlines motivations behind the language and describes our experiences in implementing it using a variety of functional languages. The reference interpreter is written in Haskell and compiled with GHC while PSL’s performance implementation is written in SML and compiled with Mlton. We have also generated Scheme code compiled with PLT Scheme as part of a prototyping effort.”

A Small Example Illustrating Why Unification-Based Type Inference Is Not Always The Most User-Friendly Choice

let base = 10
let bitsPerDigit = logBase 2.0 (fromIntegral base)
let activeBits = fromIntegral $
  ceiling (bitsPerDigit * (Prelude.length "123"))

There's a missing fromIntegral above, which causes GHC to underline logBase and complain that it has no instance for (Floating Int). This is, on the face of things, completely mystifying. Both parameters are “obviously” floating! A second error points to ceiling, saying no instance for (RealFrac Int).

The second error is the more suggestive one, since the problem is that the type of * causes GHC to equate the type of bitsPerDigit with that of (Prelude.length clean), i.e. Int.  But it’s more than a little strange to force the programmer to reason about their code in the same order that the type checker does!

Comment on Comments


if (vars.empty()) { 
  // Store null env pointer if environment is empty 
      /* isVolatile= */ false);



if (vars.empty()) {

Citation Chains

Here’s a fun game to play on Google Scholar:

  1. Search for a relatively old paper in your field of choice. It will, presumably, be the top result.
  2. Note the number N in the “Cited by N” link under the top result’s abstract. Also note the year of publication.
  3. Click the “Cited by N” link.
  4. GOTO 2.

For example, here’s what results from searching for Luca Cardelli’s paper A polymorphic lambda-calculus with Type : Type:

Paper Name Pub
by <N>
per year
A polymorphic lambda-calculus with Type: Type 1986 72 4
The calculus of constructions 1986 1053 44
A framework for defining logics 1993 1155 68
Proof-carrying code 1997 1873 145
Xen and the art of virtualization 2003 2375 340
Globus toolkit version 4: Software for service-oriented systems 2006 1179 295
A taxonomy of data grids for distributed data sharing, management, and processing 2006 175 88
A toolkit for modelling and simulating data Grids: an extension to GridSim 2008 38 20
Service and utility oriented distributed computing systems […] 2008 11 4


It’s sort of nifty to see the “arch” of citations, and citations/year, over time.

Other interesting searches:

  • program slicing
  • featherweight

Not very interesting:

  • grain elevators

Coroutine Performance

I was curious this afternoon about how fast (or slow) coroutines can be. So I wrote a small microbenchmark in C, using libcoro’s  handwritten-assembly implementation of stack switching for symmetric coroutines.

The benchmark builds a complete binary tree of configurable size, then visits it twice. First, a conventional recursively-written function takes the tree root and a function pointer, and calls the function pointer to process each value stored in the tree.

The benchmark next re-visits the tree with the iteration encapsulated by a Lua-style asymmetric coroutine class, which yield()s tree node values back to the main routine.

One interesting operational consequence of doing the recursive visiting inside the coroutine is that the main program stack remains at constant depth for every processed node. This constant-stack-space property is also an effect of CPS transformations – not surprising, since both continuations and coroutines are general control abstractions.

Anyways, I found that when compiled with –O2 (gcc 4.4.3) and run on a Core 2, the overhead of using resume()/yield() compared to call/return was about 110 cycles per yield(). That cost is on par with loading a single word from RAM. Stack switching is an impressively efficient way of implementing a feature that lends significant flexibility and power to a language!

ANTLR Grammar Tip: LL(*) and Left Factoring

Suppose you wish to have ANTLR recognize non-degenerate tuples of expressions, like (x, y) or (f(g), a, b) but not (f) or (). Trailing commas are not allowed. The following formulation of such a rule will likely elicit a complaint (“rule tuple has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2; …”):

tuple : '(' (expr ',')+ expr ')' ;

ANTLR’s error message seems unhelpful at first glance, but it’s trying to point you in the right direction. What ANTLR is saying is that after it reaches a comma, it doesn’t know whether it should expect expr ')' or expr ',', and parsing expr could requiring looking at an arbitrary number of tokens. ANTLR recommends left-factoring, but the structure of the rule as written doesn’t make it clear what needs to be factored. Re-writing the rule to make it closer to ANTLR’s view makes it clearer what the problem is:

tuple_start : '(' tuple_continue;
    :    ( expr ',' tuple_continue
    |      expr ')');

Now it’s more obvious where the left-factoring needs to be applied, and how to do it by writing expr once, leaving only the choice of seeing a comma or a close-paren:

tuple_start : '(' tuple_continue;
    :    expr ( ',' tuple_continue
              | ')'  );

The rewritten expansion corresponds to an alternative structuring of the tuple rule that will satisfy ANTLR:

tuple : '(' expr (',' expr)+ ')';

The ANTLR wiki has a page on removing global backtracking from ANTLR grammars, which has a nice presentation of how to do left-factoring when the opportunity is obvious.