On LLVM’s GC Infrastructure

LLVM’s documentation suggests that its getelementptr instruction is one of the most commonly misunderstood aspects of the system. In my experience, GEP is only number two on the list; number one is LLVM’s GC support: what it provides, what it doesn’t provide, and how to use it effectively.

The first two are the easy ones: LLVM does support stack maps, read and write barriers, fully-precise collection, and relocating collectors. It does not require a shadow stack, though it does provide one, which is helpful for first-time GC clients to get started with. LLVM’s overall GC design supports concurrent collectors and mutators, but the implementation lacks a few key ingredients to minimize the performance impact of GC in such situations.

There are a few missing features on the implementation side which obviously don’t need major design overhauls to fix. The big three on my list are code emission at safe points, safe points on back edges, and support for register-allocation of GC roots.

The last one is sort of interesting, because I’ve seen several proposals to change the design of LLVM’s GC infrastructure in a major way to obtain it. The reasoning goes like this: we currently have llvm.gcroot, which marks a stack slot address as being a GC root. But we want to store roots in machine registers, not stack slots, so we need a new intrinsic to annotate an SSA register as being a GC root as well. With such an intrinsic, LLVM would automatically store and load SSA registers to/from stack slots “behind the scenes.”

So, for example, suppose we start off by acquiring a GC-traced pointer:

%root = alloca (gcroot)
%p = … allocate a pointer …
store %p into %root

If we had a non-copying collector, we could get away with code like this:

%r = … maybe trigger GC…
… use %p …

But with LLVM’s current GC support, a copying collector must write this less-legible IR instead:

store %p into %root
%r = … maybe trigger GC…
%p2 = load %root
… use %p2 …

When repeated over and over, it definitely makes the IR harder to read. Wouldn’t it be nice if LLVM would automatically do all that stuff for you? Like this:

… (no alloca needed!)
%p = … allocate a pointer …
call llvm.gc.reg.root(%p)
…
%r = … maybe trigger GC…
… use %p …

It’s a natural desire; the first time I encountered LLVM’s GC support, I wanted the same thing. But it’s a wolf in sheep’s clothing. Remember: the fundamental property of LLVM is that an SSA register is an immutable value. Not just “observably equivalent as far as my high-level program can see,” but equivalent under any observation possible at the IR level!

The problem is that under the hypothetical gc.reg.root scheme, LLVM is free to do constant propagation before the rewrite to the verbose version, and if it does so, then the program’s semantics will have changed! The problem is that %p has been transmuted from a constant to a variable by virtue of being marked as a register root. But in doing so, we’ve taken the key benefit of LLVM’s SSA representation—the power its gets thanks to immutable bindings—and thrown it right out the window!

 

Take another look at the “verbose” pseudo-IR listed above. The common interpretation of this snippet is “spill %p into a stack slot, perform some operation, and then reload %p from the stack slot.”  This is reasoning about the IR according to its current implementation. But I’d argue that it’s better to think about the abstract semantics of the IR. Abstractly, we’re just encoding the fact that the value we think of as %p might change at some point in the IR; beforehand, we called it %p, and afterwards, we called it %p2. Since phi nodes aren’t really viable for most frontends, the only way LLVM allows us to encode mutability is via memory – i.e. alloca.  But the real point isn’t to set up stack slots, it’s just to reflect pointers updates that the GC might be making. As long as LLVM can verify that the “stack slot” address doesn’t escape, it is free to back that mutable storage by symbolically addressed memory: namely, machine registers! It won’t do that analysis now, but it could. That’s the point: no major design tweaks are needed to get register allocation of GC roots, just a bit of extra analysis internal to LLVM.

A separate question is how much of a difference this would make in practice. Effectively, the lack of register allocation means that GC roots are stored in caller-saved machine registers; maybe that’s a performance-killer, or maybe it’s a wash. To the best of my knowledge, nobody has ever carefully looked at the impact of naïve versus varying degrees of sophisticated treatment of GC roots in LLVM-compiled code.

 

Stepping back even more, I think the reason there’s so much confusion about LLVM’s GC support is that there’s a philosophical impedance mismatch with the rest of the project. Elsewhere, LLVM tries hard to allow unsophisticated frontends to get equivalent code quality as a more sophisticated frontend. The canonical example is that with mem2reg, a frontend needn’t worry about SSA conversion or dominance frontiers or anything like that. In general, LLVM’s philosophy is to move as much work from the front-end to the common LLVM infrastructure as possible.  But LLVM’s GC support is different. There’s no automatic transformation from “dead simple” to “efficient” use of GC primitives. Instead, it’s up to each client to do whatever analysis is needed for the most efficient use of the existing GC-related primitives. In practice, this means that clients need to do data-flow analysis if they want to reuse dead root slots. This is in stark contrast to the negligible level of Dragon-book sophistication otherwise expected of a front-end.

 

In terms of how to use LLVM’s GC support effectively, I’ve found it clearest to think of the process as a transformation between separate intermediate languages. The goal of the transformation is to add a minimal selection of root slots, along with just enough loads and stores to preserve correctness. The crucial invariant is that at each potential GC point, the live roots must reside in (abstract) memory. So: backwards analysis to determine liveness, and a forwards rewrite to add the loads and stores. Interestingly, this combination of a backwards analysis with a forwards program rewrite is not well-supported by existing dataflow tools or their underlying theoretical frameworks. Anyways, to implement the given invariant, you’ll need to know which types require tracing, and which operations may end up invoking the GC. You can sort of encode this information via metadata and pointer address spaces, if using LLVM IR as your source IR, but I think it’s easier to think of the two IRs as being completely separate.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s