ELY/rona

The population of a country shifts over time. Here’s how the population distribution of the United States changed from 2000 to 2010:

There’s a noticeable population bulge from Boomers (born ’46-64, age 36-54 in 2000). For more up-to-date numbers, the Census Bureau helpfully provides a projection estimating the distribution for 2020.

The CDC also provides statistics about yearly influenza burden, with impact bracketed by age group.

Confluenza

We can combine the two data sets to determine not only how many lives are lost due to influenza, but how many expected years of life (ELY). As before, a government agency has relevant data. Finally, we might ask the value of a year of life (often called a Quality-Adjusted Life Year, or QALY); this figure varies by country but a reasonable ballpark figure is $100,000.

We can combine these data sources to estimate the “cost” of a disease. For the yearly influenza season, the numbers come out like so:

Flu::
across pop at age  3, est.    81 deaths (IFR = 0.00399%)
across pop at age  7, est.   106 deaths (IFR = 0.00529%)
across pop at age 12, est.   157 deaths (IFR = 0.00759%)
across pop at age 17, est.   219 deaths (IFR = 0.0104%)
across pop at age 22, est.   289 deaths (IFR = 0.0132%)
across pop at age 27, est.   377 deaths (IFR = 0.016%)
across pop at age 32, est.   433 deaths (IFR = 0.0189%)
across pop at age 37, est.   541 deaths (IFR = 0.0247%)
across pop at age 42, est.   638 deaths (IFR = 0.0313%)
across pop at age 47, est.   760 deaths (IFR = 0.0378%)
across pop at age 52, est.   911 deaths (IFR = 0.0444%)
across pop at age 57, est.  1110 deaths (IFR = 0.051%)
across pop at age 62, est.  2076 deaths (IFR = 0.0988%)
across pop at age 67, est.  3218 deaths (IFR = 0.178%)
across pop at age 72, est.  4516 deaths (IFR = 0.306%)
across pop at age 77, est.  5383 deaths (IFR = 0.535%)
across pop at age 82, est.  5974 deaths (IFR = 0.918%)
across pop at age 87, est. 11404 deaths (IFR = 2.89%)
across pop at age 92, est. 14633 deaths (IFR = 7.25%)
total est loss of 52826 lives, 0.575e6 ELYs lost ($57.48 B)
est whole-population IFR: 0.16%

This is mostly a sanity check, and it looks reasonable, within a factor of two or so from a regular flu season (just barely worse than 2014-2015; about twice as bad as 2015-2016). For what it’s worth, two aspects of the model go beyond direct cross-correlation of data. First, the model estimates a flat 10% infection rate; in reality, rates usually vary between 10% and 20% in different age brackets. Second, the relatively coarse granularity of the CDC’s data would lead to sharp discontinuities in mortality rates between brackets. Another CDC data set provides finer granularity of information for people aged 65+, showing that each decade of age brings a threefold risk from flu. The model uses piecewise linear interpolation to estimate mortality at specific ages.

COVIDity

By changing the fatality rate model to use numbers from South Korea (as of 2020/03/18), we can make an estimation of the likely impact of one full year’s worth of coronavirus circulation. The model does not account for hospital capacity limits (which would raise the effective IFR and thereby the number of deaths) or the lack of population/vaccine-derived immunity (which would raise the population infection rate) or the effect of social distancing (which we assume reduces the population infection rate and thereby the number of deaths).

Coronavirus::
across pop at age  3, est. 0 deaths (IFR = 0.0%)
across pop at age  7, est. 0 deaths (IFR = 0.0%)
across pop at age 12, est. 0 deaths (IFR = 0.0%)
across pop at age 17, est. 0 deaths (IFR = 0.0%)
across pop at age 22, est.   218 deaths (IFR = 0.01%)
across pop at age 27, est.  1175 deaths (IFR = 0.05%)
across pop at age 32, est.  1838 deaths (IFR = 0.08%)
across pop at age 37, est.  1753 deaths (IFR = 0.08%)
across pop at age 42, est.  3677 deaths (IFR = 0.18%)
across pop at age 47, est.  3618 deaths (IFR = 0.18%)
across pop at age 52, est.  7793 deaths (IFR = 0.38%)
across pop at age 57, est.  8273 deaths (IFR = 0.38%)
across pop at age 62, est. 28785 deaths (IFR = 1.37%)
across pop at age 67, est. 24701 deaths (IFR = 1.37%)
across pop at age 72, est. 35716 deaths (IFR = 2.42%)
across pop at age 77, est. 52979 deaths (IFR = 5.27%)
across pop at age 82, est. 44709 deaths (IFR = 6.87%)
across pop at age 87, est. 33357 deaths (IFR = 8.46%)
across pop at age 92, est. 18677 deaths (IFR = 9.26%)
total est loss of 267269 lives, 3.44e6 ELYs lost ($344.2 B)
est whole-population IFR: 0.81%

I find these numbers believable: roughly the logarithmic midpoint between a standard flu season and predictions of millions dead. The difference in impact on “lives” versus “life-years” is moderate but noticeable (5x vs 6.9x). This reflects the disproportionate effect that coronavirus appears to have on those in their 70’s, relative to the flu. As usual, a chart makes the numbers more striking:

The differences are due to the measured fatality rates. Due to the small tallies, the linear scale obscures what’s happening in younger cohorts. Using a logarithmic scale reveals the data:

We can also look at the IFR directly, without scaling by subpopulation sizes. This reveals some artifacts of the estimated IFR. For example, the South Korean data is segmented by decade, so each decade at or below 60 gets the same data, visible as a stair-step pattern. Estimated rates for higher decades were linearly interpolated to produce a more realistic curve. My takeaway from this chart is that the estimated IFR for age 62 should be nudged down.

Contamsion

Stepping back: One sketchy thing going on here is the blurring of “infections” versus “cases”. Should we be concerned that the numbers for coronavirus are biased by preferentially testing those with severe cases?

The CDC estimates that for influenza, only one third of symptomatic illnesses result in medical visits, and only 5% or so of medical visits are severe enough to result in hospitalizations. The ratio of deaths to hospitalizations is about 7.6%; the ratio of deaths to infections is about 0.13%, and the ratio of hospitalizations to infections is about 1.65%.

South Korea has done 274,000 tests finding 8,565 confirmed cases, of which 1.7% are now serious, critical, or dead. This may suggest that their testing has been broad enough to not be massively distorted by oversampling severe cases. (I don’t know offhand how many serious cases eventually recovered).

In contrast, Italy has had 35,713 confirmed cases, of which 14.6% are currently serious, critical, or dead.

Balancing Act

Leaders in the USA and other countries are deciding how to balance public health against economic pain. China’s numbers suggest that economic stasis can slow the infection. At what point does economic harm take precedence? The two are closely linked, especially in places with weak social safety nets; recessions bring lost jobs, homelessness, stress, anxiety, depression, etc. When health insurance is linked to employment, a recession means that society must pay for a pound of cure instead of an ounce of prevention.

The analysis above suggests that the “cost” of the flu (in life-years, not hospital bills or other expenditures) is about $60 billion/year in the United States, and the projected “cost” of the coronavirus would be on the order of $300 billion. These are not trivial numbers, to be sure, but for context, tax revenue for fiscal year 2017 was $3.32 trillion. On this basis, it might make sense to endure economic turmoil equating to a few weeks worth of tax revenue, but not much more/longer than that.

(edit 2020-03-19: Two UC Berkeley economists suggest having the government be payer of last resort to avoid mass unemployment. They estimate a hit to GDP of 7.5% per quarter, i.e. about half a trillion dollars per month, due to the direct economic fallout of social distancing, and estimate that the government’s preventative costs would be half that. Ray Dalio thinks that corporate losses will top $4 trillion.)

(edit 2020-03-20: To clarify, the point of this last section is not to suggest that an amelioration should be rejected because it is strictly more costly than one particular alternative projection. Obviously, there is room for reality to diverge from the above model by at least an order of magnitude. The point is to acknowledge the uncomfortable traffic between dollars and life-years, and use it as one lens through which to see the world.)

Dessert

I admit, it’s a bit bananas to compare apples and oranges like this. But… fruit salad is delicious.

Two Lovely Talks from POPL 2020 about Real-World Bugs

I recently watched two videos from POPL and wanted to help make their ideas more widely known. Ten minutes of text to help you decide whether to watch eighty minutes of video, what a deal!

Incorrectness Logic by Peter O’Hearn (paper, talk)

The core idea: create a program logic for finding bugs rather than preventing them.

The technical details of this work are beautiful in their own right, but I wanted to explain the broader picture of what (I think) it’s all about, and why it’s exciting. I suppose the target audience for the first half of this blog post is people who get why their compiler produces error messages, but aren’t so keen on inference rules.

One thing I really like about this work is that it’s the output of a virtuous cycle between theory and practice. O’Hearn (and others) worked on some promising theories in academia, then went to Facebook to build tools for working programmers. Seeing ways the tools might be improved, they went back in search of new theory to make the tools better. This sort of back-and-forth seems to produce great results when it happens, but unfortunately it’s rarer than one might hope.

Background

A type system splits the universe of all possible programs, accepting some and rejecting others. We’d like to reject programs that do bad things and accept programs that don’t do bad things. Most type systems for general purpose languages have an extra category: programs that don’t do bad things but are rejected anyways. This is an inevitable consequence of trying to analyze Turing-complete programs with the guarantee that the type system will rule out every possible form of the particular badness it cares about.

A more sophisticated type system might rule out more bad programs, or accept more good ones, or both. One big cost of this power is that the programmer’s understanding of the system must also grow. When a program is rejected, the programmer first faces the burden of translating from the language of the type error they’ve been given into their own mental model of what the program might do, and then determining whether the error is spurious or not. Spurious errors—those due to limitations in the type system itself—can be avoided by rewriting code in some equivalent form that the type system can do a better job of reasoning about. For “real” errors, the behavior of the code itself has to change.

This disconnect, and the burden it places on the programmer, effectively limits how sophisticated a widely used type system can be.

I mentioned three categories before, but there’s really a fourth as well: programs which do bad things but are accepted because the purview of the type system is (intentionally) limited by the above dynamic. Some of these cases, like array bounds checking, will be caught by the language runtime; others, like SQL injection attacks, probably won’t be. Distributed systems are common, but type systems to make them robust… not so much.

We can try to eliminate this fourth category. Doing so leads to the realm of proof systems and dependent types and so on. At the end of this path, types become formal specifications of what it means for our program to be correct, and our type system is a proof checker. But for most code, it’s not clear what a complete formal specification would even look like (what’s the correctness condition for Slack, or Facebook’s news feed, or a video game?). The overwhelmingly common case is partial correctness, and therefore analyses that catch some bugs and not others.

O’Hearn’s Incorrectness Logic

Peter O’Hearn’s paper on Incorrectness Logic recognizes this situation, and turns it on its head. Most prior work (on type systems, program analysis, verification, etc) focuses on proving the absence of bugs, and is thereby shackled by the burden of false reports. O’Hearn instead examines the theoretical underpinnings of a system focused on proving the presence of bugs.

This rearrangement turns out to have several pleasing symmetries with traditional approaches to program analysis. It also results in a number of useful properties. Instead of a system limited to conveying messages of the form “I wasn’t able to prove that your code is safe in all circumstances” we can be much more concrete, telling the programmer “I was able to prove that your code is unsafe in the following circumstance.” No more false reports; no more struggling to reverse-engineer a bug from a type error.

There have, of course, been decades of tools and techniques focused on finding bugs. Those tools and techniques were by necessity concrete in their analysis, in order to find specific categories of bugs. Different tools used different techniques to find different bugs. The appeal of O’Hearn’s work is that it’s a (more) generic framework for doing the sort of reasoning that results in evidence of bugs.

There are three exciting consequences here (that I can see, and certainly more that I can’t!). First, this new logic can serve as an engine driving a pluggable definition of badness, which means that powerful bug finding tools should become cheaper and easier to build. As O’Hearn puts it,

“Incorrectness is an abstraction that a programmer or tool engineer decides upon to help in engineering concerns for program construction. Logic provides a means to specify these assumptions, and then to perform sound reasoning based on them, but it does not set the assumptions.”

The paper gives an example of an analysis for detecting flaky tests and characterizing why they aren’t reproducible. Such a tool would be widely useful in industry!

Second, the shift in perspective opens the door to considering much more sophisticated forms of reasoning. Rather than the hapless programmer being forced to intuit or reverse-engineer a problematic scenario from the failure to prove correctness, incorrectness logic can directly describe program states that are guaranteed to be problematic. Hiding the machinery needed to find those states allows the machinery to use tools on the basis of their power, rather than only those with simple mental models.

Third, soundness is a binary property, but unsoundness is a spectrum. O’Hearn describes a tool used at Facebook, and how by dialing back a simple parameter, the tool ran ~2.75x faster while still finding 97% of the bugs from the slower run. “Variable effort for variable results” is often an appealing property.

There’s still a considerable amount of work left to do, of course, but incorrectness logic strikes me as a plausible foundation for a new generation of powerful bug finding tools. Fingers crossed!

What is a Secure Programming Language? by Cifuentes & Bierman (paper, talk)

Cristina Cifuentes presented one of the keynotes at POPL this year. Her talk provides a great introductory overview to some of the challenges facing those who want to address (some of) the industry’s security woes with programming language technology. If you’re not familiar with this area, you’ll probably get a lot out of the talk! It does a nice job of providing enough detail—including code snippets—to make things concrete without going too deep into the weeds on any one area.

Cifuentes and her coauthor Gavin Bierman observe that “secure” is a nebulous term even within the narrow domain of PLs. It’s not uncommon for a research paper to be intuition-driven: identify some property that can be used as part of an attack, develop a mitigation for that property, and call it a day. This paper instead takes a data-driven approach, categorizing bugs from public vulnerability databases and looking at how mainstream languages succeed or fail in preventing existing bug classes. Their definition of “mainstream” is also data driven: those languages with long-term spots in the TIOBE top 10, which were Java, C, C++, Python, C#, PHP, JavaScript, and Ruby.

The categorization of bug classes is a bit fuzzier; they consider buffer overflows, injection attacks (such as XSS and SQL injections) and information leakage, such as sensitive information making its way into a log file. Other classes—like use-after-free errors, path traversal bugs, cryptographic issues, and access control errors—are not considered, because it’s unclear to the authors how those categories can be addressed with linguistic abstractions. The authors give brief coverage of the larger universe of security; a slide gives a nod towards work on practical and theoretic work on multi-language system building, including compartmentalization, multi-language runtimes, and linking types.

The talk ends with an exhortation to improve the state of affairs for the 18.5 million extant programmers dealing with imperfect languages: “It’s time to introduce security abstractions into our language design.”

The POPL audience then put forth a few big-picture questions but time limits precluded any real back-and-forth. Paraphrased:

  1. The talk gives a list of different security vulnerability categories to worry about. Does it leave out anything important?
  2. The talk suggests solving security flaws with new language abstractions. Do we need to build separate abstractions/mechanisms to address each individual problem? What about other approaches, such as formal verification? Or capability-based security?
  3. What happens if we don’t improve things, if we keep on using C?

I hope the talk (and this writeup) might catalyze further discussion. Here are some of my thoughts:

  • Consider one specific sub-category of flaw: SQL injections. Languages like Python and PHP don’t prevent SQL injection attacks, but the real story is a bit subtler, in an interesting way. These languages provide library support for avoiding SQL injections, but have no way of forcing the use of safe versus unsafe functions. Do we understand why programmers choose insecure approaches when secure ones are right there?
  • It’s tempting to say “Oh, the key is that we lack linguistic abstractions; library-based conventions are too easy to bypass.” But I don’t think that captures the whole story, especially for things like information flow. Linguistic abstractions for information flow security—whether we’re talking static labels, tagged values, faceted values, or whatever else—provide mechanisms to enforce some policy. But the choice of policy to enforce is left free.

    Compare and contrast with something like GC. Language support for automatic memory management forcibly eliminates (most of) the errors of non-automatic memory management. But language support for information flow does not forcibly eliminate information flow errors; it provides infrastructure that can be used to eliminate such errors. Or phrased more starkly: infrastructure that, when used properly, can eliminate such errors. A subtle but important difference!

    Enforcing security impacts program behavior. If you take a C program, delete all the free() calls, and run it with the BDW GC, you will remove several failure modes (use after free, double free, etc) and add no new linguistic failure modes. In contrast, applying information flow security to a program absolutely risks adding new failure modes. If the logging policy is accidentally too restrictive, your logs will be missing crucial data. If you don’t declassify the output bit of the password checker, you can’t log in! Making the IFC policy too loose will allow bad behavior; making it too tight will prevent good behavior. Empirically, the world has a strong revealed preference towards not preventing good behavior.

    It’s worth considering not only whether secure abstractions exist, but what properties they need to have to make adoption feasible.
  • Bugs in the real world often have multiple contributing factors, which aren’t always accurately captured by the CWE classifications.

    For example, CVE-2014-0574 was labeled with CWE-94 (Code Injection), but its root cause was a double-free error, which was not associated with the vulnerability, thereby under-counting the prevalence of temporal memory safety issues. Likewise, CVE-2014-1739 was labeled CWE-200 (Information Exposure) although the root cause was C’s lack of required initialization, i.e. type unsafety. Other CWEs are highly correlated: Code Injection tends to have a lot of overlap with Improper Input Validation, but usually each bug gets put in just one of those two buckets.

    As with many classification efforts, it doesn’t take much digging to uncover fractal complexity.

    The CVE/CWE data incorporates a certain amount of inherent subjectivity, and the labels it uses may not be the labels you would use. To be clear, I don’t think these labeling issues affect the paper’s argument. After all, the paper explicitly ignores half the dataset! But I do think it’s generally worth being aware of whatever gap, big or small, exists between expectations and reality.
  • Sometimes researchers get the luxury of using precise internal jargon. Terms like “minimum-weight spanning tree” or “pullback functor” leave little room for misinterpretation. Other words risk a collision of terminology that already has meaning in the audience’s mind. “Secure” is one such word. Cifuentes & Bierman give a precise definition for what they mean when they say “secure programming language” but it’s a bit quirky. For example, because it doesn’t cover use-after-free bugs, and because return-oriented programming attacks don’t actually inject new code, their definition wouldn’t eliminate unintended arbitrary remote code execution. I don’t think that state of affairs matches what most people mean when they hear “secure programming language.”

    What do people mean by “secure”? If you ask ten people to define what they want, expect eleven different answers.
  • Okay, sure, defining “secure” is hard. Let’s forget about that and ask a different question: what properties should a definition have?

    For example, let’s rewind to 2017 in an alternate universe. In this new old reality, your favorite web browser has a full-stack formally verified JavaScript implementation, seL4 style, down to the x86 ISA, runtime included. No type safety violations, no heap feng shui, no CVEs, no nothing. Everyone agrees this implementation is secure.

    Then Spectre happens, and oops, untrusted JS can read the kernel’s heap.

    Now… what actually changed? Not the code, for sure. Not the proofs. The definition of “secure”? Not in any glaringly obvious way: before Spectre, everyone agreed that being unable to read kernel memory was an important part of being secure, and believed that the implementation satisfied that property. And in some sense, they were right one day and wrong the next.

    Maybe “secure” has more to do with observers/attackers of a codebase rather than the codebase itself. Or perhaps “secure” cannot be a property of code, but only of systems: it must bring together properties of language, compiler, CPU, etc. Put another way, security can only be defined relative to a threat model, and what Spectre changed is that the CPU became part of the threat model. One unintuitive consequence of this view is that a trusted and formally verified compiler would have to be considered part of the threat model. The systems view says that “secure” cannot be modularized, and there’s no such thing as a trusted computing base.

    (As an aside, I don’t believe that these ideas about security are novel, but I’m unsure of the proper citations. Maybe Fred Schneider? Pointers from those more familiar with the security literature would be greatly appreciated!)

The talk’s authors are aware of most or all of these concerns. After writing this up, I saw that an earlier iteration covered a broader range of topics, including capabilities, cryptography, access control, the pros and cons of formal verification, and the library-versus-language question. Streamlining allowed deeper coverage of the most critical pieces. In the end, security is a big hairy problem, and even an hour-long talk can only capture a small piece of the overall picture.

In summary: security matters, and data-driven decisionmaking is a powerful technique. Maybe the combination can help crack a tough nut.

Undergrad Compilers from the Hive Mind

John Regehr recently solicited advice for what an introductory compilers class should cover in 2020. To save you the trouble of reading through the whole thread, I’ll quote/summarize a bit before throwing in my two cents.

  • Many people wanted less parsing/lexing and more about the middle end; several others (including John) noted with some chagrin that parsing is the most widely applicable knowledge conveyed in this sort of class.
  • Several people wanted to highlight the gap between theory and practice in the design & structure of compilers. Traditionally, compilers have been batch processes, whereas IDEs want compilers to be an interactive service that they can query incrementally. Also, error messages tend to be an afterthought in the classic presentation of compilers, but projects like Clang, Elm, and Rust have found big wins in going out of their way to give better error messages. The implementation impact can be big: for example, Rust does speculative parsing by snapshotting the compiler state to give better fix-it hints. There’s also been a burst of research in the last decade on improving error messages for programs with inferred types, but it’s just hard to determine what code is wrong when an inconsistency is found.
  • Sam Tobin-Hochstadt pointed out that “Your students are more likely to hack on Babel than on GCC.”; Will Chrichton pointed to a course focusing on DSLs rather than general-purpose languages.
  • Several people suggested covering JIT compilers, both for the different assumptions involved and also for the feats that can be accomplished with e.g. Truffle/Graal/Sulong.
  • Rohan Padhye pointed to Berkeley’s compilers class, with some really cool infrastructure for compiling a subset of Python 3 to RISC-V.
  • A few people suggested lightweight correctness/testing methods: differential testing of a compiler vs an interpreter; certifying the output of compiler passes; checking invariants (like type correctness) on multiple IRs.
  • A few people advocated for incremental construction of compilers to make them more approachable, and also for teaching compilers as the functional composition of passes.
  • Geoff Hill pointed to Robby Findler’s course, which does something unusual: it starts from an assembly-like IR, and builds the compiler backwards (!) towards higher-level languages.
  • Matthew Fernandez suggested covering type inference and Reflections on Trusting Trust.
  • Joe Groff wanted to highlight the deep similarity between compilers and interpreters.
  • Talia Ringer noted how much fun it was as a student to be given an open-ended programming assignment; Joe Gibbs Politz confirmed that giving such assignments is strenuous but rewarding.
  • Celeste Hollenbeck noted the appeal of focusing on the basics over esoterica. Compilers for the masses, not compilers for compiler folk. On a perhaps related note, @type_monkey recalled the pleasure of being given historical context when learning PLs.

Parsing’s definitely not the coolest part of compilation but it is important. I’d want students to understand the following about parsing:

  • Various approaches exist: hand-rolled, generated, combinators, PEGs. Focus on parser generators, and give coverage to the others with an eye towards the tradeoffs involved. Err towards broad coverage, because faced with a parsing task, they should understand both what they can do and what others might do.
  • The approaches come with tradeoffs about up-front work versus automated checking/guarantees (much like type systems). Parser generators sometimes complain for silly-seeming reasons (do I care if my grammar is k=1 vs k=2?) but can also catch actual unintended ambiguity in grammars. PEGs and combinators eliminate annoying complaints about grammar structure, but sometimes by silently choosing the “wrong” parse tree.
  • Having a specification distinct from implementation is appealing for parsers just as with any other domain.
  • The choice between LL/LALR/LR is (in practice) mostly an implementation detail that affects the how the input grammar needs to be structured.
  • LL parsers correspond directly to the sort of recursive descent parser they might write by hand. I sometimes find it handy to reason about grammar conflicts by thinking about what “goes wrong” in the corresponding recursive descent code. (“Oh, when the parser sees symbol X, it doesn’t know if it should start parsing production Y or production Z”). LR parsers are recursive ascent, which I (still) find to be a trickier mental model when debugging grammar conflicts. I’d want students to get some exposure to shift/reduce vs reduce/reduce errors but I’m not sure quite how much.
  • Parsers for programming languages don’t need to be 100% strict in what language they accept; it’s often better to shift work to the rest of the compiler. (Most(ly) useful for non-hand-rolled approaches). For example, instead of encoding operator precedence in the structure of the grammar itself, the generated parser can build flat lists of operator expressions which then get structured with a separate operator precedence pass. Another example is that by unifying certain lexical categories, such as number and identifier, errors for incorrect programs come from the compiler rather than the parser generator. A third example is the mini-syntax for strings, with escapes and interpolations and such.
  • I think it’s illustrative to show how e.g. ANTLR’s EBNF-style syntax sugar for grammars, which I find quite natural, can be desugared. Definitely a “yo dawg” moment.
  • Parsing a single well-defined language is (mostly) a solved problem with strong (and very pretty!) underlying theories, but parsing as a whole is not completely solved. Example: embedding/composing languages is common, especially on the web, but support in existing tools is not great so it’s mostly done by hand.
  • Unparsing (both serialization in general, and also faithful pretty-printing) has a much larger gap between theory and practice. Automated formatters are common for Go and Rust and C++, but there’s not much research/theory or tooling support for doing it well.

For the rest of the compiler, the pass-based architecture with a few judiciously chosen IRs seems like an obvious choice. A few more specific thoughts:

  • The details of abstract interpretation and Futamura projections are for sure advanced material, but I think intro students should get some intuition of the connection between compilation & interpretation, and how static analysis approximates dynamic semantics. It doesn’t have to be much: take a concrete snippet of code and show how the same operations are happening just at different phases/times.
  • Show how changing (properties of) the input language can make the compiler’s significantly job easier or harder. Potential examples: the difficulty of parallelization and vectorization for Fortran vs C vs NESL, or modularity and dependency management in C vs Go, or error messages for a language with local vs global type inference.
  • One concern that looms larger than in the past is latency, especially in the distributed systems context. The most obvious connection is with the runtime and GC, but the compiler can also play a role; many languages now feature generators, coroutines, or async/await, with compiler support.
  • Optimizations are fun, especially “heroic” optimizations, but they’re also a nice confluence of several camps/viewpoints: “let’s do it cuz it’s cool” versus “is the complexity worth it” versus “is the compile time hit worth it” versus “does this make the performance model more fragile”. And, of course, they’re a great starting point for exploring the ways in which modern architectures differ from historical assumptions.
  • I think it’s interesting that optimizations can have negative runtime cost, by reducing the workload on the remainder of the compiler (similar story with GC: the overhead of GC barriers can be more than offset by reducing the GC’s workload). Chez enforced this “pay their own way” rule.
  • Ultimately, a big question is what not to cover in an intro class. I’d go so far as to defer most “backend” issues, including data flow analysis (!) to an advanced class. Likewise, ignore pretty much all details related to concurrency and parallelism. It’s a tough balance between giving enough information to be enticing without going too deep nor too thin. I’d also avoid CYK/Earley/GLR/derivative-based parsers and most details of LR parsing’s internals. Coverage of lexing probably depends on what they’ve seen in theory of computation.
  • In my undergrad software engineering class, we ended up using ANTLR to parse a very ad-hoc DSL. In retrospect, the experience was valuable precisely because it was less clear-cut how to separate the input into lexical vs syntactic categories. Not super actionable, I suppose, but to the extent that students will go on to do parsing, they probably won’t be given a clean language design to start from.
  • An intro compilers class shouldn’t get too into the weeds on “adjacent” topics like type checking, but it would be nice to convey some of the range of what can be done with static analysis: substructural typing in Rust, definite assignment in Java/Swift, Extended Static Checking, maybe a whiff of SMT-based refinement type checking (Refined TypeScript?) Another nice example for “things you can do with static analysis but not so much with runtime checks” could be static enforcement of constant timing (e.g. for cryptographic code). Students should also understand that it’s become more common to separate type checking from compilation. For example, JavaScript has tools that do compilation without typechecking (Babel, webpack), and checking without compilation (ESLint) and both together (TypeScript).
  • Likewise, this is too far afield to spend much time on, but it might be worth taking a few minutes from a lecture to give a high level overview of how compilers integrate with modern software engineering practice and infrastructure: CI/CD, coverage-guided fuzzing, hermetic builds, massively parallel builds, distributed tracing, etc.
  • Bootstrapping is a fun puzzle, and meshes well with Reflections on Trusting Trust.

Kami is Graph Contraction

A while back I downloaded an iOS game called KAMI 2. It’s a fun, relaxing puzzler with many nice touches — a very well-crafted app. Playing the game looks like this:

IMG_2978

The goal is to get the screen to show just one color (any color) using a limited number of moves. Each move changes a connected region’s color. In the screenshot above, there are 8 moves left.

What I realized, eventually, is that the core mechanic is graph manipulation. Each solid region is a node with an associated color. Edges connect adjacent regions. Changing a region’s color corresponds to changing the associated vertex’s color, followed by edge contractions to eliminate adjacent nodes with the same color. A monochromatic screen corresponds to a singleton graph.

It turns out that the general problem of deciding whether you can turn one graph into another one via edge contractions is NP-hard. However, that doesn’t directly apply to KAMI, since the singleton graph is a trivial target, and we’re furthermore interested in limiting the number of moves made.

I hacked up a Python script using OpenCV to generate graphs from screenshots of KAMI:

 

For most of the early (easy) boards, you can simply find the node that connects to the largest number of other-colored regions. But the graph above is a counterexample to the greedy application of that rule: the upper-left black patch connects to four red regions, and the 2nd biggest black patch connects five white triangles, whereas the correct move is to connect the two big green triangles via the beige node.

I’m pretty sure that random selection of nodes to contract (and colors to contract with) yields exponential running time. Needless to say, exhaustive search is infeasible. I haven’t yet come up with a good heuristic to guide the search process in balancing between greedy exploration and minima-avoidance. I expect that a traditional optimization algorithm like simulated annealing or hill climbing would be very effective with a bit of tuning, but I’d prefer a non-randomized approach.

Even though I haven’t yet cracked this nut, I thought the problem was pretty enough to write up!

 

Summarizing Garbage Collection

This post is intended to give a whirlwind[0] tour of what I think are some highlights of the last decade-ish of the research literature on garbage collection. If you’re interested and want to learn more, the Garbage Collection Handbook (via Amazon) is an indispensable resource. Ravenbrook’s Memory Management Reference is also quite nice!
 
Background
There are four “elemental” algorithms for garbage collection: reference counting, mark/sweep, compacting, and copying. The latter three are referred to as tracing collectors.
 

Reference counting tracks how many copies of each pointer there are in the heap. When the heap contains zero copies of a pointer, the memory it points to can be reclaimed.

The three tracing techniques all start the garbage collection phase by iteratively following the pointers stored in registers, the stack, and globals (called roots) to determine the live subset of allocated data that the program can access (this is the “mark” part of mark/sweep). Allocations that couldn’t be reached from the program’s roots are dead and can be reclaimed, in one of several ways. The “sweep” part of mark/sweep conceptually calls free() for each dead allocation. Compaction takes live objects and moves them into the holes left by the dead. Copying collectors start with the heap split into two halves, with all live data in one of the halves. Instead of sweeping or compacting, each live object is moved into the spare half. When all live objects have been copied out, only dead objects remain in the original half, and it can be reused without further processing. 

Much of the garbage collection research of the last few decades can be seen as schemes to extend, combine, or refine these basic techniques.
 
Describing Garbage Collector Performance
 
On a uniprocessor, we have a simple mental model for how GC works: the program runs for a while, until it’s in danger of running out of memory. Then the program is paused while the garbage collector runs to reclaim memory, after which the program resumes. The sum of these pauses is how much slower the program gets overall, and corresponds to GC throughput.
screenshot-2016-10-01-16-30-29
A rule of thumb I find useful: when a paper says they improved throughput, what they really mean is they cut down on their collector’s unnecessary work.
 
In most real-life situations, though, throughput (raw overhead) is not the only metric that matters. You also want to preserve your program’s responsiveness (to mouse clicks, network packets, the audio or graphics subsystems, or whatever else), since pausing your program to run the GC for a long time will result in a frozen GUI, audio glitches, dropped packets/frames, etc. As is often the case in engineering, there’s a tradeoff between these two goals. Techniques that improve throughput usually degrade pause time, and vice versa. Put another way: reducing (worst case) pause time comes at the price of making the program run slower overall.
 
But max pause time by itself can be misleading! A series of short GC pauses with only a few mutator instruction between each pause will usually “feel” like one big pause. Mutator utilization better captures the notion of “responsiveness”. The minimum mutator utilization (MMU) is a number that says, at a particular timescale, what fraction of that time was spent executing non-GC code in the worst case. MMU is usually presented s a graph: the x-intercept is the largest pause time, and the y-intercept is the overall throughput; the shape of the graph in between tells you the mutator’s worst-case responsiveness at different timescales.

Screenshot 2016-09-30 19.21.53.png

MMU example graph. The purple stripes represent GC pauses; note the “sawtooth edge” of the graph

One counterintuitive consequence of the above definition is that MMU sometimes goes down (!) at larger timescales [1]. Suppose the GC introduces a single pause of 10ms. Then MMU at 10ms is zero, and MMU at 20ms is 50%, and MMU at 10s is 99.9%. If the GC introduces a second pause, 90 ms after the first one, then the MMU at 90ms is 88.8%, but the MMU at 100ms is only 80%, since there is a 100ms window which includes two 10ms pauses [2].
 
Another “gotcha” when evaluating MMU is that it’s very difficult to account for every last bit of overhead at sub-millisecond granularity. To the best of my knowledge, nobody has ever explicitly factored the effect of GC-introduced cache misses or minor page faults into their MMU calculations. Capturing such events is possible-but-hard; figuring out which are “the GC’s fault” is trickier. And it can be an issue: Click et al observed that, for some concurrent collectors, simply summing the “big” pauses led to reported overhead that was one sixth of the true figure!
 
I wrote up a little mutator utilization grapher that you can play around with if you’re curious.
 
Another fundamental property of collectors is space overhead. This takes many forms: interior and exterior fragmentation, metadata overhead, and reserved areas of the heap. As is often the case, space trades off with time: larger heaps can be collected less frequently, thus increasing throughput, but each collection of a larger heap takes longer, thus increasing pause time.

 
To recap: garbage collector designs encounter “classical” engineering tradeoffs in optimizing for average vs worst case performance. The GC community describes the tradeoff with networking-inspired terminology: throughput vs latency. MMU graphs capture this tradeoff and give a visual presentation of collector performance.
 
Tracing vs Refcounting
One really cool “observation” paper to come out of the memory management community is A Unified Theory of Garbage Collection by Bacon, Cheng, and Rajan. This is one of my favorite types of research results: bringing together two separate schools of thought with a mix of theoretical and practical insights that’s almost blindingly obvious… once you’ve been told what to look for!
 
On the theory side, they observe that tracing and refcounting are very nearly precise duals of each other. The former starts with known-live values and assumes everything else is dead until proven otherwise; it does so by incrementing a one-bit count until it is greater than zero. Reference counting starts with known-dead values and propagates the consequences of their death by decrementing word-sized reference counts until they reach zero. Tracing cannot reclaim individual objects without RC-flavored techniques, and reference counting cannot reclaim object cycles without tracing-flavored techniques.
 
On the practice side, they note that while the simplest implementations of tracing and reference counting have complementary properties, optimizations to either algorithm tend to result in systems that face similar trade-offs. For example, generational and incremental collectors that reduce average pause time need to use write barriers that raise mutator overhead, whereas techniques like deferred reference counting, which reduce the mutator overhead from barrier-like refcount operations, result in delayed collection and longer GC pauses! The paper also evaluates the storage reclamation in the Log Structured Filesystem, and how GC tradeoffs in the OS/filesystem context differ from a language runtime’s choices. Like I said, one heck of a paper!
 
The Alchemy of Garbage Collection
Okay, now we’re ready to start looking at specific refinements of the elemental algorithms. As mentioned above, generational collection is one of the most common refinements to basic tracing. The scheme is motivated by two insights. The first is that programs tend to allocate a lot of temporary data. Generational collectors take the heap and split off a nursery for new objects. When the nursery fills up, it will usually contain mostly dead temporary objects, which are “free” to reclaim with a copying collector. But what if an object in the main heap refers to something in the nursery? This is the second insight: older objects can only point to newer objects via mutation, which is comparatively rare (for older objects). To make sure that a young-generation collection doesn’t miss any objects, a remembered set notes which parts of the nursery appear in the main heap, and allows the nursery collection to proceed without having to scan objects from the main heap. See the Handbook for implementation details.
 
There are a variety of refinements to reference counting. The most basic is deferred reference counting, which reduces the overhead of RC by ignoring intermediate updates. (This is an instance of the very handy algorithmic trick of epochs, which pops up in many subfields of systems research). Whereas standard RC immediately frees objects (recursively) when their counts reach zero, deferred RC with a zero count table (ZCT) instead says, “Oh, I’ll get to this later…” and stores the object in the ZCT. When it comes time to actually reclaim objects, anything in the ZCT that is not reachable from the program’s roots is garbage. (Note that this means we’ve introduced tracing to our RC scheme!).
 
Ulterior reference counting is a variant of generational collection in which the mature space is managed using (deferred) reference counting instead of tracing. When the nursery fills up, the survivors are evacuated to the reference-counted mature space. Reference counts in the mature space are only computed at the end of a nursery collection. The underlying motivation is that tracing and RC have complementary properties which match the demographics of the nursery and old space, respectively. The URC evaluation, done in 2004, shows that their throughput matches a generational mark/sweep collector, while reducing max pause times from roughly 200ms to 50ms in heaps from 20 to 100 MB. URC also generally needed space overhead of less than 80% to avoid thrashing, whereas generational mark/sweep often needed 300% space overhead.
 
Shahriyar, Blackburn, and Frampton compare an optimized lazy reference counting system to non-generational mark-sweep, URC and the Immix collector (we’ll get to it soon). The “traditional” reference counting implementation degrades benchmark throughput by 30% on average versus mark-sweep, and their optimizations eliminate the disparity. They only compare throughput, not pause times, mutator utilization, floating garbage, or sensitivity to heap size. The paper provides some interesting statistics on object popularity: on their benchmarks, 99.35% of objects have reference counts that never exceed 3 bits, but the remaining 0.65% account for nearly 23% of reference counting operations.
 
Terei, Aiken, and Vitek had a different idea: why not let the programmer decide where to use tracing and refcounting within their own program? Their paper highlights memcached as a (very) realistic example of a program that challenges most GCs, because memcached servers must use nearly all the available physical memory without swapping or pausing [3]. Their multi-memory-management prototype showed that giving the programmer a fine-grained choice between two simple GC algorithms resulted in overall program performance that matched that of (unsafe) manual memory management with custom allocators. The M^3 prototype combines mark-sweep and a reference-counted heap. Swept objects in the traced heap are inspected for pointers into the RC heap, which must be decremented. Mutations to objects in the RC heap must update (insert into or remove from) a remembered set for the traced heap to use. No bound exists on the potential size of the remembered set; it is up to the programmer to choose a type partitioning that doesn’t result in too many cross-heap pointers.
 
Quick aside: another paper that proposed an extended interface between the programmer and garbage collector is the Data Structure Aware collector by Cohen and Petrank. The idea is to assume that the contents of container data structures are reachable until the programmer indicates otherwise. This allows them to avoid tracing memory that is unlikely to be reclaimable. In exchange for modifying about 8 lines of code, their technique reduced GC time between 11% and 76% for the various benchmarks they studied.
 
Hybrids and Heuristics
There are also hybrid schemes that don’t involve reference counting. Both of the following hybrid algorithms successfully combine the large-heap throughput of semispace collection with the small-heap robustness of mark-sweep and the reduced memory requirements of compaction.
 
Spoonhower, Blelloch, and Harper published a collector design that dynamically switches between copying and mark/sweep collection. Building on Bartlett’s mostly-copying collector, they divide the heap into fixed-size blocks, and use estimates of block residency to determine at collection time whether a given block should be evacuated (which combats fragmentation and is cheap for blocks containing few objects) or promoted in place, which avoids the cost of copying densely-populated blocks.
 
Their paper characterizes mark-sweep and semi-space collectors as implementing fixed policies for allocation and reclamation. Regardless of block residency, semi-space collectors evacuate all blocks, and mark-sweep promotes all blocks. Likewise, mark-sweep allocation via free lists will use space from all blocks (even heavily fragmented ones), and semi-space allocation with a bump pointer only allocates from completely unoccupied blocks. Their hybrid policy allows, for example, allocation from blocks that are at most 90% full, and evacuation for blocks that are at most 80% full.
 
Accurately measuring block residency requires a full-heap traversal, which would essentially double the GC’s overhead. Thus they produce an estimate of each block’s residency, by combining object age, previous residency measurements, and static configuration parameters.
 
Immix is another mark-sweep collector that uses copying to reduce fragmentation. The heap is divided into 32KB blocks, and blocks themselves are composed from 256 lines of 128 bytes each. Different threads allocate from separate blocks, using a bump-pointer allocator within spans of free lines.  When marking a block, Immix uses fragmentation (computed from residency statistics) to decide whether to mark/sweep or evacuate a block’s contents. (Compare/contrast: SBH evacuates sparse blocks; Immix evacuates fragmented blocks; the Metronome collector also defragments blocks, but does so post-sweep, using exact residency statistics, rather than heuristically during or before collection). Subsequent work has produced several variants of Immix, including precise and conservative reference-counting, that match the performance of the original mark/sweep implementation. There’s also an implementation in Rust.
 
Incremental Collection
An alternate way of looking at generational collectors is through the lens of incrementality. When the heap fills up, we must reclaim enough free memory to satisfy an allocation request. But a full collection of the heap is likely to reclaim far more free space than we need at the moment, and incur an unpleasantly large pause in doing so. For an interactive application, it would be nice to break up this big chunk of work into several smaller chunks. Collection algorithms which can collect a subset of the heap at a time are called incremental. If we squint a bit, generational collectors make the common case (nursery-only reclamation) incremental, but full-heap collections must be done atomically. There has of course been a great deal of work tackling the problem of making full-heap collection incremental.
 
The Memory-Constrained Copying collector (MC²) is a block-structured generational collector. It collects a single block at a time in order to avoid the thrashing behavior exhibited by simpler collectors under memory pressure. Each block in the old space has a logical number associated with it. While marking the old space, the collector constructs unidirectional remembered sets to allow separate collection of blocks in increasing order. Marking also computes the live volume in each block. Marking and copying are both done incrementally. Incremental copying is done when nursery collection happens, and operates on groups of blocks, chosen to restrict the size of their combined live data and remembered sets. Like many collectors that aim to limit pause times, MC² makes an effort to explicitly balance the rate of allocation and collection. Since the rate of old-generation collection is tied to the rate of nursery reclamation, which is in turn a function of young object survival rate and nursery size, MC² resizes the nursery after every collection. MC²’s evaluation showed that remembered sets stayed below 8% of the live data, and mark queue size didn’t exceed 0.7%. Throughput matched a generational mark-sweep collector, and mean pause times were reduced by an order of magnitude — from 264.68ms to 27.18ms. As an aside, the MC² evaluation includes data for pause time distributions and bounded mutator utilization, in addition to the usual throughput and max pause numbers. I wish every GC paper included such detailed performance data!
 
 
Incrementality is a restrained/restricted form of concurrency. Fundamentally, the work done by the garbage collector is being done in multiple steps, with the mutator running in between the steps. This means that the mutator can “see” — and modify — the objects that the collector is inspecting. In turn, we need more sophisticated invariants for the mutator and collector to correctly work together. In a non-incremental mark/sweep collector, there are two states of interest. White objects are dead-until-proven-live, and black objects are proven-live. Marking turns white objects black, and continues recursively until every pointed-to object is black. This establishes an invariant: black objects never point to white objects. At that point, sweeping frees white objects and un-marks black objects, turning them white.
 
But in an incremental (or concurrent) scheme, by definition the mutator will be running before marking establishes the aforementioned invariant. The danger is that the mutator’s action — for example by storing a reference to a white object into a field of a black object — can violate the invariant the collector needs to do its job. The solution is to introduce a new state: gray, which is an object that the collector has marked, but which might contain pointers to non-black objects. Gray essentially means “TODO.” Now, when the mutator stores a white reference into a black object, we can preserve the so-called strong tricolor invariant in one of two ways: either by marking the white object, or turning the black one back to gray.
 
As the name hints at, there is an alternative called the weak tricolor invariant. The basic intuition is that having a black object point to a white one is OK, as long as some other gray object exists that will eventually turn the white object black. The weak invariant is, in some sense, a dual of the strong one; preserving the strong invariant must block “creation” of white pointers in black objects, whereas the weak invariant focuses on deletions of white pointers from non-black objects.
 
Compressor
The Compressor algorithm is a clean, modern compacting design. It makes clever use of mark bitmaps to enable incremental and parallel compaction. The basic problem is: once marking completes, the mark bitmap shows which objects are live and which are dead, but there’s no indication of where in the heap each live object should be moved to. Here is Compressor’s trick: first, divide the heap into small segments (on the order of 512 bytes). The volume of live data in each segment can be calculated in parallel by inspecting the mark bitmap. A parallel scan then quickly computes the starting point for the compacted data in each segment. Given this preprocessed offset table, the final position of each live object in each segment can be easily computed. In addition, the inverse mapping can be easily constructed, giving the from-space range that will map to each to-space segment. Unlike most other compaction algorithms, Compressor runs in one pass, without requiring extra per-object metadata such as forwarding pointers. One last trick: Compressor identifies dense blocks and doesn’t waste effort compacting them (just like MC²).
 
The concurrent variant of the Compressor is based on virtual memory protection. After concurrently computing the offset tables, the to-space pages are protected, and the mutators stopped to have their roots point to the protected to-space pages. When the mutators resume, the resulting VM traps trigger incremental concurrent compaction. In terms of the tricolor invariant, unprotected tospace pages are black as a result of compaction. Protected tospace pages are gray, and fromspace is white. The mutator is gray when it operates in fromspace, so it can deal unobstructed with white objects; when its roots are redirected to tospace, the mutator turns black.
 
Scalable Garbage Collection
I learned the other day that Google’s Compute Engine lets you provision memory for virtual machines independently of core count. So if your two core/four GB VM is running out of memory, you can turn it into a two core/16 GB VM, for about… four cents an hour. Pretty nifty! But this leads to an obvious concern: if we increase our heap sizes (and live data) fourfold, won’t our worst-case GC pause time also rise along with it? Stop-the-world pauses would take minutes to hours (!!!) on very large heaps. That’s obviously pretty bad; what can we do to avoid it? Can we limit the GC’s maximum pause time and space overhead, independent of heap size and mutator behavior? Felix Klock (who now works on Rust at Mozilla!) studied this in his super awesome dissertation.
 
The key thing is: we mustn’t ever need to traverse the full heap all at once. Reference counting can satisfy this property, but needs tracing to be complete, and also suffers from fragmentation, meaning space overhead is mutator-dependent and difficult to bound. (Incidentally, Firefox’s GC implementation has evolved from stop-the-world to incremental to generational to compacting; the latter step was entirely to reduce fragmentation).
 
So to avoid fragmentation and give reasonable space bounds, we’ll need to move objects.
 
Well, OK, but some objects are super popular — a significant fraction of the heap is just copies of a pointer to that object. Wouldn’t moving those objects mean we need to traverse most of the heap to update the pointers? Indeed, but there are two ways out of that predicament. One is to add a layer of indirection with a read barrier; this is what most real-time collectors do. Klock had a different insight: you can just elect not to move the popular objects! After all, if you’re worried about objects that might take up, say, 1/8th of your heap, well, there can’t be more than 8 of those! Defining popularity in terms of max heap size is no bueno, though: we mustn’t let the size of the heap dictate our worst-case workload. Instead, the trick is to divide the heap into fixed-size regions, and define the popularity threshold as a multiple of the region size, rather than (a fraction of) the heap size. But the insight remains the same: your program doesn’t execute in Lake Wobegon, and not all of the regions can be more popular than average.
 
Having divided up the heap into regions, we must be able to collect a single region independently of the others; this is how Klock achieves scalability. Simple to state, but the details get pretty hairy; one false move and we lose scalability. To collect a region, we have to know what objects in it are referenced by other regions — a remembered set. BUT WAIT! There are two kinds of remembered set representations, points-into and points-out-of; the former has worst-case quadratic space usage, so remembered sets must be points-out-of. BUT WAIT! Even that’s not enough; the aggregate of all remembered sets can still scale with heap size, which would mean that collecting a single region would again involve scanning O(heap) data. The solution is to have a points-into summary set for each non-popular region; summary set sizes are bounded by the popularity threshold, which is independent of heap size. Ah, BUT WAIT! to construct a summary set, we must scan the full-heap remembered set! Now we get into issues of scheduling: first, amortization: each pass over the remembered set tries to summarize a fixed fraction of the heap’s regions (abandoning the popular ones); and second, we can be a bit lazy: summarization only needs to happen when we’re running out of regions with usable summary sets, and can end early when “enough” regions have been summarized. BUT WAIT! If we end early, the mutator can invalidate the summaries we so lovingly crafted. We can keep the summary sets up to date with write barriers (careful to keep the work O(1) amortized!). BUT WAIT! If summary sets grow too large they must be discarded, which means that the rate of collection must roughly keep pace with the rate of program mutation. BUT it also mustn’t go “too fast” or we’d jeopardize the minimum mutator utilization guarantee that scalable collection promises.
 
Well, I haven’t even covered all the interesting details, like the mark stack and snapshotting, but I hope you get the idea of why I think Klock’s work is an undeservedly-obscure tour-de-force. Experimentally, Klock shows that regional collection’s throughput is only a bit lower than a generational collector, even though worst case pause times (and thus MMU) is greatly improved. The comparisons to other collectors illustrate many of the high-level tradeoffs between latency & throughput, and between guaranteed vs expected-case performance. For example, on one benchmark in a 2GB heap, the regional collector’s max pause time is 0.12s, while the G1 collector paused for 2.13s. But the G1 collector finished in 154s total, and the regional collector took 757s. The worst-case observed pause for the scalable collector on any benchmark was 240ms.
 
GC Scheduling
 
Klock’s work on regional garbage collection highlights that scheduling of garbage collection work versus mutator work is a fundamental issue. Such questions of scheduling appear in many places. The collector in Go 1.5 has a scheduling component (they refer to it as “pacing”) to balance heap growth and CPU usage. Chrome (that is, V8) explicitly tries to schedule garbage collection work during idle moments, so as to improve the browser’s responsiveness and reduce hiccups when playing animations (a soft-realtime task). And of course, much work on hard-realtime garbage collection centers around issues of scheduling and schedulability.
 
Hans Boehm published a tech report connecting scheduling to lazy reference counting. He observed that a consequence of deferred (“lazy”) RC is an increase in floating garbage, and presents an adversarial program that causes a lazy RC implementation to retain a quadratic amount of floating garbage. The solution he presents is to permit O(n) reclamation work to match an allocation of size n. This is strongly reminiscent of Klock’s observation that the collector must “keep pace” with the mutator.
 
Conservative Collection
OK, time for a quick detour on conservative collectors. I mentioned at the start of the post that garbage collection starts from roots: in particular, from registers and the stack. Often the compiler will emit data structures to help the runtime system figure out what registers and stack slots contain heap pointers, and which do not. Likewise, individual objects in the heap will be given a descriptor indicating which fields should or should not be traced. But sometimes people want to use GC with a compiler that doesn’t know anything about GC, like Clang or GCC. And it can be done, with a few caveats!
 
The core idea is to assume that anything that looks like a pointer is a pointer. This implies that conservative collectors can’t move objects, since they don’t know for sure whether they have an actual pointer or an integer that happens to look like a pointer. And even if it would be fun and easy, the collector shouldn’t go about tweaking the mutator’s hard-earned values.
 
One nifty trick is that conservative collectors can actually move objects in many circumstances. If the compiler provides just a tiny bit of information to the runtime about the layout of heap objects, you get a heap-precise design. Basically, pointers in registers or the stack are treated conservatively, but pointers within the heap can be positively identified, and thus moved/compacted/etc. Really, most of the issues and design decisions for non-conservative (aka fully-precise) collectors also apply to their conservative brethren. For example, Boehm-Demers-Weiser, the most well-known conservative collector, can run incrementally and generationally. And Ironclad C++ is an extension to Clang that supports heap-precise GC.
 
GC Case Study: Go
Early versions of Go used Boehm-Demers-Weiser. This brought two major problems. First, the conservative nature meant that Go programs on 32-bit machines would sometimes spuriously run out of memory. One widely-derided post on the golang-nuts mailing list suggested a workaround: stick to integer values below 10000. Heh. This situation improved with Go 1.1, which was heap-precise, and Go 1.3, which was fully-precise.
 
Second, multi-gigabyte heaps had multi-second stop-the-world pause times. With the release of Go 1.5, they implemented a concurrent mark-sweep collector, which reduced pause times from multi-second to a few milliseconds. For a more real-world take, Twitch published a blog post detailing how their observed pause times were reduced from multi-second with Go 1.2, to 200 ms with Go 1.5, to 100ms with Go 1.6, to 1ms with Go 1.7.
 
Note that Go’s garbage collector doesn’t deal with fragmentation. This may or may not be an acceptable engineering tradeoff for Go’s use cases.
 
On Bump Pointer Allocation
Many discussions about garbage collection eventually touch on the benefits of bump-pointer allocation. Intuitively, it seems like it should be much faster to just do an increment and conditional jump, rather than searching through a bunch of freelists. Case in point: the Hacker News discussion regarding the aforementioned Twitch blogpost. Surprisingly, this turns out to be a vastly smaller difference than most people would think. Measurement from Blackburn et al showed that allocation was < 10% of program runtime, and the advantage of bump-pointer was about 10% faster; therefore the overall throughput advantage is about 1% of program runtime. That’s about the same performance change to be had from passing extra unused environment variables to your program.
 
Fragmentation
There’s at least one study comparing the throughput of malloc/free versus garbage collection. The result found was that if you have a lot of free memory to work with (about 3x your live data), tracing GC is just as fast as malloc/free, but at smaller heap sizes, you spend a lot of time re-inspecting your live data to determine what’s dead. In effect: when free memory is limited, GC appears to have a soft failure case (slowing down) that malloc/free avoids. But that’s not really the whole story. These benchmarks are fundamentally batch workloads that run for anywhere from seconds to hours, then exit. But on the web, both servers and browsers can run continuously for days, months, or years at a time [4]. On such long-term workloads, there’s another factor that comes into play: fragmentation. Any non-moving collector (malloc/free, reference counting, and mark/sweep) will hit some amount of fragmentation. As a failure mode, fragmentation’s costs manifest in two ways. First, there’s a “global” program slowdown from increased effective working set & decreased locality; second, if fragmentation gets bad enough, allocation will eventually fail, and your program will crash (or, uhm, corrupt data and then crash).
 
So if you want to not crash, you need to size your heap as some multiple of your live data. How much fragmentation do you need to account for? There are a few research results on the theory side. First, Robson proved in the 1970s that the theoretical worst-case is logarithmic in the size of the maximum allocation. That’s pretty atrocious; even supporting piddly 1MB arrays means potential overhead of 20x! More recently, Cohen and Petrank showed that compaction can drop the needed overhead in (one particular set of) realistic circumstances to… 3.5x. Amusingly similar to the numbers cited by Hertz and Berger!
 
As for practice, there is a smattering of research with empirical measurements of memory fragmentation. Johnstone and Wilson showed that on a selection of 8 programs that allocated 27 MB on average, Doug Lea’s malloc implementation limited fragmentation to 4% in the worst case [5]. This is a good result to know, but I’m not confident it can be usefully extrapolated to long-running web servers or browsers, which (A) allocate several orders magnitude more data, and (B) have nondeterministic allocation patterns, dependent on network and/or user input. Supporting this skepticism is a study from Bohra and Gabber measuring several malloc implementations on a long-running caching web proxy. They found that fragmentation varied by allocator from 30% to heap exhaustion (dlmalloc clocked in at 80% overhead). They also noted that the impetus of their study was noticing that their heap was several times larger than their live data. Bacon, Cheng, and Rajan echo this limitation in their summary of J&W. They also show that fragmentation in SPECjvm98 can reach nearly 40%. More recently, Jula and Rauchwerger benchmarked larger, more intensive workloads on a few different allocators, finding that (A) dlmalloc had 10.8% average and 23.5% worst cast fragmentation, and (B) faster allocation came at the expense of more fragmentation.
 
Real World Fragmentation
I was (and am!) curious about how much of a problem fragmentation is in practice. I’ve seen lots of statements acknowledging that fragmentation is a big enough problem to be worth avoiding (examples: Mozilla’s Stuart Parmenter in 2007, Huey, Coppeard & Nethercote 2011-2016, Facebook in 2011Aerospike in 2015suniphrase in 2015, Deep in 2016, Jonathan Blow in 2016, gperftools#756).
 
Most of the time, people don’t provide much in the way of hard data on what sort of fragmentation they’re seeing — not even heap sizes, much less allocation size distributions. When people do provide numbers, they tend to be eye-opening. Aerospike reported that “big-memory (192 GB RAM) server nodes were running out of memory and crashing again within days of being restarted.” In 2012, Kirill Korotaev exhibited a ~20 line program, distilled from a real app, that caused 20x (!) fragmentation overhead in glibc’s allocator, needing 1.8 GB for a 93 MB live set.
 
Avoiding the space overhead of fragmentation can have significant time costs, even for systems that you don’t usually think of as doing garbage collection. For example, Julia Evans hit a case where the Linux kernel spent 30% of her program’s runtime compacting pages. CloudFlare also has a blog post describing how a form of GC in the Linux kernel for network packet metadata caused hard-to-reproduce pauses of 30s for requests that usually finished in milliseconds. Finally, GCC also has custom infrastructure for GCing the compiler’s allocations; I haven’t measured its usage directly.
 
TLB Reach
Did you know that modern large-memory workloads can suffer double-digit performance overhead from virtual memory? (studies claiming (mean // max): (9 // 18)% overhead for 8 GB of RAM, (? // 16)% overhead for 16 GB, and (?6? // 53)% for 96 GB using 2 MB pages, and (18 // 40)% for 4 GB VMs on a 24 GB server) The middle study notes that traditional benchmark suites, like SPEC and Parsec, have small working sets that don’t stress the TLB — unlike modern servers! Also, these days you can buy servers with multiple terabytes (!!) of physical memory. There is active research on reducing TLB performance overhead, but it’s anyone’s guess when — or if — such schemes will be available on commodity hardware. On the plus side, virtualization doesn’t appear to significantly increase TLB overhead. Finally, here’s a case where Linux’s support for automatically using large pages caused a 10x performance regression.
 
 
Pushing the Envelope
 
We’re getting close to the end, I promise! Let’s wrap up with brief descriptions of a few interesting collectors that have pushed the boundaries of real-time and concurrent collection.
 
Schism
A fundamental problem faced by all garbage collectors is controlling fragmentation. Most non-real-time collectors take one of two approaches:
 
  1. Ignore the issue by not moving objects, hoping that the mutator doesn’t get unlucky and induce fragmentation that free-block-coalescing cannot handle; or
  2. Move objects, but in doing so, trade space- for time-problems (in a sense) and accept that the copying/defragmentation process may need to block the mutator while it clears out a fragmented section of the heap.
The Schism collector is one of a small set of real-time collectors that takes a different tack. It eliminates external fragmentation by using fixed size chunks for all allocation requests. Objects larger than the chunk size are represented with a linked list of chunks. Arrays are represented with two layers of indirection, as a fixed-size object header pointing to a variable-length spine which contains pointers to the fixed-size chunks holding the array elements. And in the common case where contiguous space is available, Schism can use it for arrays, thus avoiding the overhead of spine indirections. Object fragments and array data chunks are managed in a non-moving, concurrently-collected Immix-structured heap.
 
Array spines are the only variable-size allocations, and thus the only possible source of fragmentation. But here’s a cool trick: since the data blocks never move, the pointers in the spine never change, and the spine is thus immutable. This in turn means that when spines are copied, the mutator is free to use either copy. Furthermore, since the mutator is forced to indirect to the spine through a so-called sentinel object, updating the mutator takes only a single compare-and-swap.
The benefits provided by this scheme are:
 
  1. Bounded fragmentation: Schism will never need more than 31.04% extra space beyond that required for live data (this does not account for internal fragmentation).
  2. Constant time wait-free heap access for the mutator.
  3. Sub-millisecond pause times on a 40 MHz processor
  4. Excellent throughput for a real-time collector, only 1/3rd lost versus the desktop Hotspot 1.6 JVM.
Since Java doesn’t support interior pointers, neither does their collector, although I believe that interior pointers for object fields would be an easy extension. Support for interior pointers within arrays would be trickier but still doable, at least in certain circumstances.
 
To the best of my knowledge, Schism has never been publicly benchmarked on an x86 or ARM CPU, nor on a multi-gigabyte heap. But I’m very curious to know how it would do in those situations! One last cool bit: the core of Schism’s collection algorithm has been verified against the x86-TSO memory model in the Isabelle/HOL proof assistant.
 
Block-Free G1
Building on their earlier work on lock-free concurrent compaction, Erik Österlund and Welf Löwe added a syscall to let the GC introspect on the OS scheduler’s thread state, which allows them to perform critical operations — stack scanning and copying — without ever blocking mutator threads. Memory management for concurrent data structures with progress guarantees is a hard open problem, and this line of work makes a significant step towards a solution. Their modifications to OpenJDK’s G1 collector reduced 99.99th percentile latency from 174ms to 0.67ms for one of their benchmarks. Overall, average GC pause time dropped 8-fold, at the cost of 15% degraded throughput.
 
One important point they make is that GC algorithms which use virtual memory (mprotect) can’t be non-blocking, because the page tables can only be modified when the kernel holds a non-preemptible lock. I don’t have much intuition for the costs of page table modifications. It would be very interesting to see what impact that source of blocking would have on an otherwise non-blocking GC design. Of note: Pauseless, C4, Collie, and Concurrent Compressor all use mprotect. Shenandoah eschews mprotect to avoid read storms.
 
Shenandoah 
Shenandoah is an effort to produce an open-source, industrial strength low-pause-time garbage collector, as part of the OpenJDK project. Broadly speaking, it aims to be a fully-concurrent and -parallel mark-copy collector, using both read & write barriers to enable concurrent compaction. As far as I know, there haven’t been any academic publications on Shenandoah yet Christine Flood recently published a paper on Shenandoah; there’s also a smattering of blog posts and slide decks from presentations. Shenandoah is available in the default JVM distributed with Fedora 24. It would be very interesting to see a head-to-head comparison of Shenandoah with Österlund & Löwe’s OS-dependent G1 variant.
 
Pauseless and C4
In 2005, Azul Systems published a paper detailing the garbage collection algorithm, and associated support from their custom CPU and OS, that allowed them to obtain low pause times (21ms) on very large (50+ GB) heaps.  The state of the art until the early 2000’s was 10-20% maximum runtime overhead from software read barriers, with average-case overhead of roughly half that. But Azul’s hardware-supported read barrier has an average cost of one cycle(!), and overall mutator utilization was reported at 98%.
 

In contrast to Compressor, which protects (all) tospace pages to trigger evacuation, Pauseless protects fromspace pages (incrementally). Pauseless continuously runs a concurrent and parallel mark phase (like G1), which generates mark bits and computes forwarding information (on-the-side). Using residency statistics, Pauseless protects sparsely occupied pages. Such pages are evacuated, and the underlying physical memory immediately freed. There will still be references in the remainder of the heap pointing into evacuated pages; the mutator and collector cooperate to find and correct such pointers.

When the mutator loads a reference to an unmarked object, a hardware trap notifies the marker threads. When the mutator loads a reference to an object on a protected (and thus, likely evacuated!) page, the forwarding tables are queried to find the fresh copy, and the memory containing the original stale reference is updated (the paper refers to this as “self healing”). The collector finishes a GC cycle by traversing the entire heap to find any remaining pointers into fromspace, and fixing up the stragglers. Now, having established the invariant that all pointers in the heap point into tospace, the virtual memory for fromspace pages can be reused.

 
C4, the Continuously Concurrent Compacting Collector, adds generations to Pauseless. Unlike most generational designs, the nursery is not managed with a stop-the-world copying collector. Rather, both the nursery and mature space are essentially separate Pauseless heaps, with a precise card marking remembered set. They added new Linux system calls to support batched and non-TLB-flushing remappings, yielding order-of-magnitude speedups versus the stock kernel. The upshot is that C4’s throughput matches ParallelGC in Java while maintaining small-in-practice pauses.
 
On the subject of Azul’s port to x86, Cliff Click said:

Azul’s GC required a read-barrier (some instructions to be executed on every pointer read) – which cost something like 5% performance on an X86. In exchange max GC pause time is something in the low microsecond range

and noted that they didn’t make use of any novel x86 features:

No change to the X86, instead user-mode TLB handler from RedHat allows ptr-swizzeling, that plus some careful planning and the read barrier fell to 2 X86 ops – with acceptable runtime costs.

 
Collie
The core idea behind Collie, which builds on Pauseless and C4, is to use hardware transactional memory to atomically and incrementally compact objects and update all incoming pointers. Obviously this cannot be done for popular objects, so popular objects are temporarily pinned, with the mutator and collector collaboratively updating popular references. There are of course many more details covered in the paper.
 
Like the other collectors from Azul, Collie is an engineering marvel. It pulls out pretty much every trick in the book: read barriers, write barriers, non-global checkpoints, partial conservativism, virtual memory and HTM support. Like all good engineering, Collie makes careful tradeoffs, in exchange for desirable results: Collie delivers an absolutely staggering 80-90% MMU (!) at 1ms resolution (!!) for heap sizes up to 30 GB (!!!) when continuously collecting. One issue I wish the paper had covered more was the effective space overhead of their per-object remembered sets; it’s unclear whether the space overhead for their MMU benchmarks was 50%, or 400%, or something completely different.
 

Footnotes

[0] Seven thousand words is still a “whirlwind,” right?
 
[1] There’s one more quirk of mutator utilization, which is that it can be “inflated” by making the mutator less efficient. This can occur for two reasons. First, different language implementations have different “baseline” levels of generated-code quality, so the same GC implementation will appear to have a better MMU for a slower language. Second, the overhead of high-resolution measurement with tools like Pin can introduce similar effects. But in practice that’s not an issue most people need to worry about.
 
[2] To better match the human intuition that the mutator’s utilization doesn’t actually get worse at larger intervals, we can instead focus on bounded mutator utilization (BMU). Numerically, BMU at timescale x is MMU at any timescale x or larger. Graphically, BMU is the monotonically increasing graph you get from MMU with the “peaks” flattened out.
 
[3] It’s common to use memcached as a caching filter in tree-structured RPCs within a datacenter. This pattern of usage means that tail latency at nodes very quickly becomes median latency at the root of the tree, which is why it’s critical to bound worst-case latency at the nodes and not just settle for good median or average latency. When you want sub-millisecond level QoS, even the OS scheduler might need to be re-examined…
 
[4] Here’s a real story of a long-running server that hit a GC failure case. Interestingly, the problem wasn’t fragmentation, but rather a generational collector encountering object demographics that didn’t match the generational hypothesis.
 
[5] They also point out that there’s no single definition for how to measure fragmentation, and highlight a few potential definitions, each with various shortcomings.

Adblock History and an Unhatched Easter Egg

The first browser I ever used was Netscape Navigator, and that allegiance continued over the years with the Mozilla Suite, then Camino, then Phoenix/Firebird/Firefox. Thinking about Camino makes me both happy and sad. It was far ahead of its time in terms of user experience, but it stagnated due to lack of attention, as its developers were plucked away to work on Safari or Phoenix. In 2002, it was the browser of 2005; but in 2008, it was still the browser of 2005.

Anyways, I digress. Around 2004, when I was in high school, I was a user of Firefox nightly builds. This was before the existence of extension versioning or addons.mozilla.org. The closest equivalent to AMO was mozdev, but that’s a bit like comparing the App Store to github. The mozillazine forums were the watering hole for users, extension developers, and browser core devs alike. At one point, a nightly build broke my favorite plugin, Adblock. After a shockingly long time (two whole days!) went by without a fix by the extension’s author, I decided to poke around and see if I could hack together a fix myself. Well, it worked; after a few more such tweaks, I wound up on an IRC channel with the project’s then-current maintainer/developer, who called himself “rue.” We spent most of the next year collaborating on the Next Big Release of the extension.

Unfortunately, we got sucked into Second System Syndrome, without having had the benefit of having fully built the first version ourselves. Basically, every rule of thumb that Joel Spolsky wrote about, we violated (unintentionally). We tried to rewrite the technical infrastructure, while simultaneously adding many ambitious new features. We were aiming for an Apple-style Grand Curtain Drop on a shiny new toy. Tempting, but foolish. While we were toiling away in secret, pressure was mounting for something, anything, to improve upon the actively-bitrotting public version. Eventually, a previous developer of Adblock, Wladimir Palant, returned to the scene and forked the project to create Adblock Plus.

One interesting side effect of an extremely long development cycle is that it allows for meandering diversions from the yellow brick road. The other developer, rue, found that Firefox extensions could alter the requested URL without having the alteration appear in the browser’s address bar. This held the potential for a great deal of mischief. On the more innocuous end, an April Fools Day prank might have reversed the URLs for NBC.com and ABC.com, or NFL/NHL, etc.  On the less innocuous end, rue prototyped a module that would, randomly with small probability, replace Amazon referral codes. It would have been the Office Space penny-shaving scheme, more or less. I objected on grounds of skeeviness and/or illegality, but it was a moot point, since we never released a completed version.  Adblock Plus ended up taking a bolder, more direct approach to monetization.

Anyways, while rue was busy poking at Firefox internals and de-obfuscating Gmail’s client-side code, I amused myself by devising an easter egg, and trying to conceal its presence in a non-obvious way. When it comes to such things, Firefox extensions have a distinct advantage over traditional software: they can put UI on a web server, so that a review of the extension’s source will see, at most, an innocuous-looking URL. While normal content served over the web is sandboxed, extensions can collude with a page to do things that a normal page cannot do. So I made a little recreation of the “Wake up, Neo…” scene from the Matrix.

Here is the easter egg in action; you’ll need to view it fullscreen to see the text:

If you have the sound up high enough, you’ll hear a clicky noise when each green character is “typed.” It’s actually the same noise used in the original movie! That’s called craftsmanship. 😉  One other subtle point: Adblock’s UI is in the lower-right corner, as a statusbar item. The modern convention of using toolbar logo buttons wasn’t yet established at that time.

Had anyone used a packet sniffer or a lucky guess to figure out the URL for the easter egg and visited it directly, they would have seen this instead:

cypher - Internet Explorer_2015-02-02_19-30-44

The easter egg also altered Adblock’s entry in the Extensions list, changing the version from 0.6 to “The 1.0”, replacing the crosshair-style logo to an image of Agent Smith, and altering the description to include parodies of quotes from Agent Smith and Morpheus.

IE7 [Running] - Oracle VM VirtualBox_2015-02-02_20-23-56IE7 [Running] - Oracle VM VirtualBox_2015-02-02_22-18-41

Final useless random historical fact: Adblock was the 10th extension added to AMO. I have no idea how many there are now, probably in the mid thousands.

Email Spelunking

Going through some old files emails, it’s been interesting to dip into the not-so-distant past. Unfortunately sent messages were very sparsely archived, so I have only the incoming half of most email conversations. And sadly, all my email before 2002 appears to have been lost or corrupted. Oh well.

A few bits and pieces:

  • Xbox Live beta emails in Oct 2002 – “You still could get in!”
  • iTunes Music Store launches on Apr 29 2003: “Your music. Your way.”
  • Emails about the beta for the (recently defunct) Unison client by Panic.
  • Emails from thefacebook.com, before they switched to facebook.com (I joined in June 2005 – soon to be ten years!).   As late as August 2005, your Wall was a single mutable text field, so emails said “X has changed your wall.” By September, the terminology was changed to “X has written on your wall” as the list-of-posts model was adopted. Techie bonus: mail was sent with a user agent header of “ZuckMail”. Also, Wirehog.
  • At one point, I purchased the domain ILBEN.COM… so that I could write my email address as em@ilben.com   …………… facepalm
  • Registered eschew.org on June 10, 2002.
  • I seemed to have sent a lot of email to random internet folks:
    • Adam Iser, about how much I liked Adium
    • Dan Gillmor, about using Disk Copy to make disk images from DVDs
    • Nick Zitzmann, about adding Applescript support to Battlegram (a chat client for battle.net)
    • Damien Gallop, about using RDC instead of VirtualPC…???
    • Rob Griffiths, about featuring LaunchBar as a Product of the Week on macosxhints.com
    • I distinctly remember writing to Andy Ihnatko but those emails appear to have been lost to the sands of time.
    • Nobert Heger, about his experiences as an independent developer producing LaunchBar, to which he responded: 

      Success as a shareware developer depends on so many different factors, so it’s a bit difficult to give a general answer. You need a good idea, a good implementation, good PR and marketing, good support, etc. Failing in just one of these areas can make a huge difference in terms of success. The only advice I can give is to try it yourself.

  • Correspondence with Nigel McFarlane. Extremely bittersweet for me: he bravely and generously took a chance on me to write a few pages and get my name in a physical book… and committed suicide three months after it was published. Even though I barely knew him, the sense of loss and pain hasn’t diminished in the last ten years.
  • A handful of emails from my uncle, mostly followups to spirited arguments we had over things like whether the world would be better off if every television set vanished into the ether. In retrospect, seeing what I wrote makes me cringe – I was an ass, and being fifteen isn’t much of an excuse. I didn’t notice/appreciate it at the time, but with twelve years of elevated perspective, there was a great deal of wisdom (of the life-experience variety) shining through his words. But he understood what I did not:

    There are, unfortunately, some things that take years to be put into perspective, and even being extraordinarily bright provides no shortcut to such insights nor any guarantee of happiness.

On LLVM’s GC Infrastructure

LLVM’s documentation suggests that its getelementptr instruction is one of the most commonly misunderstood aspects of the system. In my experience, GEP is only number two on the list; number one is LLVM’s GC support: what it provides, what it doesn’t provide, and how to use it effectively.

The first two are the easy ones: LLVM does support stack maps, read and write barriers, fully-precise collection, and relocating collectors. It does not require a shadow stack, though it does provide one, which is helpful for first-time GC clients to get started with. LLVM’s overall GC design supports concurrent collectors and mutators, but the implementation lacks a few key ingredients to minimize the performance impact of GC in such situations.

There are a few missing features on the implementation side which obviously don’t need major design overhauls to fix. The big three on my list are code emission at safe points, safe points on back edges, and support for register-allocation of GC roots.

The last one is sort of interesting, because I’ve seen several proposals to change the design of LLVM’s GC infrastructure in a major way to obtain it. The reasoning goes like this: we currently have llvm.gcroot, which marks a stack slot address as being a GC root. But we want to store roots in machine registers, not stack slots, so we need a new intrinsic to annotate an SSA register as being a GC root as well. With such an intrinsic, LLVM would automatically store and load SSA registers to/from stack slots “behind the scenes.”

So, for example, suppose we start off by acquiring a GC-traced pointer:

%root = alloca (gcroot)
%p = … allocate a pointer …
store %p into %root

If we had a non-copying collector, we could get away with code like this:

%r = … maybe trigger GC…
… use %p …

But with LLVM’s current GC support, a copying collector must write this less-legible IR instead:

store %p into %root
%r = … maybe trigger GC…
%p2 = load %root
… use %p2 …

When repeated over and over, it definitely makes the IR harder to read. Wouldn’t it be nice if LLVM would automatically do all that stuff for you? Like this:

… (no alloca needed!)
%p = … allocate a pointer …
call llvm.gc.reg.root(%p)
…
%r = … maybe trigger GC…
… use %p …

It’s a natural desire; the first time I encountered LLVM’s GC support, I wanted the same thing. But it’s a wolf in sheep’s clothing. Remember: the fundamental property of LLVM is that an SSA register is an immutable value. Not just “observably equivalent as far as my high-level program can see,” but equivalent under any observation possible at the IR level!

The problem is that under the hypothetical gc.reg.root scheme, LLVM is free to do constant propagation before the rewrite to the verbose version, and if it does so, then the program’s semantics will have changed! The problem is that %p has been transmuted from a constant to a variable by virtue of being marked as a register root. But in doing so, we’ve taken the key benefit of LLVM’s SSA representation—the power its gets thanks to immutable bindings—and thrown it right out the window!

 

Take another look at the “verbose” pseudo-IR listed above. The common interpretation of this snippet is “spill %p into a stack slot, perform some operation, and then reload %p from the stack slot.”  This is reasoning about the IR according to its current implementation. But I’d argue that it’s better to think about the abstract semantics of the IR. Abstractly, we’re just encoding the fact that the value we think of as %p might change at some point in the IR; beforehand, we called it %p, and afterwards, we called it %p2. Since phi nodes aren’t really viable for most frontends, the only way LLVM allows us to encode mutability is via memory – i.e. alloca.  But the real point isn’t to set up stack slots, it’s just to reflect pointers updates that the GC might be making. As long as LLVM can verify that the “stack slot” address doesn’t escape, it is free to back that mutable storage by symbolically addressed memory: namely, machine registers! It won’t do that analysis now, but it could. That’s the point: no major design tweaks are needed to get register allocation of GC roots, just a bit of extra analysis internal to LLVM.

A separate question is how much of a difference this would make in practice. Effectively, the lack of register allocation means that GC roots are stored in caller-saved machine registers; maybe that’s a performance-killer, or maybe it’s a wash. To the best of my knowledge, nobody has ever carefully looked at the impact of naïve versus varying degrees of sophisticated treatment of GC roots in LLVM-compiled code.

 

Stepping back even more, I think the reason there’s so much confusion about LLVM’s GC support is that there’s a philosophical impedance mismatch with the rest of the project. Elsewhere, LLVM tries hard to allow unsophisticated frontends to get equivalent code quality as a more sophisticated frontend. The canonical example is that with mem2reg, a frontend needn’t worry about SSA conversion or dominance frontiers or anything like that. In general, LLVM’s philosophy is to move as much work from the front-end to the common LLVM infrastructure as possible.  But LLVM’s GC support is different. There’s no automatic transformation from “dead simple” to “efficient” use of GC primitives. Instead, it’s up to each client to do whatever analysis is needed for the most efficient use of the existing GC-related primitives. In practice, this means that clients need to do data-flow analysis if they want to reuse dead root slots. This is in stark contrast to the negligible level of Dragon-book sophistication otherwise expected of a front-end.

 

In terms of how to use LLVM’s GC support effectively, I’ve found it clearest to think of the process as a transformation between separate intermediate languages. The goal of the transformation is to add a minimal selection of root slots, along with just enough loads and stores to preserve correctness. The crucial invariant is that at each potential GC point, the live roots must reside in (abstract) memory. So: backwards analysis to determine liveness, and a forwards rewrite to add the loads and stores. Interestingly, this combination of a backwards analysis with a forwards program rewrite is not well-supported by existing dataflow tools or their underlying theoretical frameworks. Anyways, to implement the given invariant, you’ll need to know which types require tracing, and which operations may end up invoking the GC. You can sort of encode this information via metadata and pointer address spaces, if using LLVM IR as your source IR, but I think it’s easier to think of the two IRs as being completely separate.

Date-Preserving Patch Folding with Mercurial Queues

Mercurial Queues are a powerful tool for organizing changes into logically coherent commits. My current workflow often goes something like this:

  1. Make some changes implementing feature A (as a patch).
  2. Generate a series of patches moving on to feature B.
  3. Realize, after working on feature B, that feature A needs a small tweak, or a better commit message.

The relevant mq commands will get the job done in a jiffy. However, there’s a slight downside: after all that qpushing and qpopping, the patches in the series will lose their original dates, which means that qfinish will produce a tightly clustered series of patches that don’t actually reflect the true chronology of their development.

Fortunately, Mercurial provides the tools needed to preserve a meaningful chronological order. We’ll exploit hg log’s configurable formatting to make it generate a shell script that’ll set the dates for our patch series automatically. A pretty little hack, no? Here are the relevant incantations, assuming you want to modify a patch named P with revision number N:

  1. hg log -r[[N]]:tip --template "hg qref -d '{date|date}'; hg qpush;\n" > d.sh
  2. Run hg qgoto [[P]], then make whatever changes you need.
  3. Run bash d.sh, which will restore the original dates for all patches in the queue following the modified one.

Agrajag Rediscovered

Going through the files on an old family computer, I discovered a long-lost email exchange with Douglas Adams:

Subject: Re: Agrajag
Date: Fri, 11 Jun 1999 17:52:48 -0700
From: "Douglas Adams" <askdna@TDV.COM>
To: Martin Karel <karel@BELLATLANTIC.NET>

Irene –
I’m pretty bad at planning what’s going to happen next week. Alex is exactly right.

Best,
———————————————————–
Douglas Adams | <http: http://www.douglasadams.com&gt;
———————————————————–
The Digital Village | <http: http://www.h2g2.com&gt;
———————————————————–

Please – help save the last 650 mountain gorillas
http://www.gorillas.org/index.html

> Hi,
>
> Please settle a debate between me and my brother. I held forth that
> you must have already had Agrajag in mind when you wrote the bit about
> the pot of flowers that, when falling, said "Oh no, not again!" But
> Alex thinks that you probably just randomly stuck that line in there,
> and came up with Agrajag later and it just happened to fit. If I’m
> right, then we are very impressed with your degree of planning ahead,
> seeing as how it was four years between the publishing dates of
> "Hitchhiker’s Guide" and "Life, the Universe, and Everything."
>
>            – Irene Karel,
>               an adoring fan

Perhaps worth noting: the saved text file was 1337 bytes.

Firefox Add-On Compatibility

It looks like Firefox is moving to a compatible-by-default model for extensions. I find this interesting, since I proposed this back in 2004 to no avail.

To be clear, I don’t mean to say that compatible-by-default would have been the right choice in 2004… Predictions about the past are almost hard as predictions about the future, and at least two major factors have changed since then. First, the state of the world for extensions now is more stable than it was in the Firefox 1.0 timeframe. And second, Firefox’s rapid release cycle exacerbates spurious compatibility problems, making this proposal significantly more attractive.

The Lessons of Lucasfilm’s Habitat

We were initially our own worst enemies in this undertaking, victims of a way of thinking to which we engineers are dangerously susceptible. This way of thinking is characterized by the conceit that all things may be planned in advance and then directly implemented according to the plan’s detailed specification. For persons schooled in the design and construction of systems based on simple, well-defined and well-understood foundation principles, this is a natural attitude to have.

http://www.fudco.com/chip/lessons.html

I wonder what Intel is up to?

 

http://cufp.galois.com/2007/slides/AnwarGhuloum.pdf

“A proposed long term solution: a new functional language that features implicit parallelism, dependent typing, and an effects type system”

“We are also working with an ISV on a compiler for a parallel functional language
– Strict, but not call by value for implicit parallelism (maybe
with lightweight annotations)
– Arrays and array comprehensions for data parallelism
– Effects system to contain impure features
– Atomic for state updates”

 

http://cufp.org/conference/sessions/2010/functional-language-compiler-experiences-intel

“For five years Intel’s Programming Systems Lab (PSL) has been collaborating with an external partner on a new functional programming language designed for productivity on many-core processors. While the language is not yet public, this talk outlines motivations behind the language and describes our experiences in implementing it using a variety of functional languages. The reference interpreter is written in Haskell and compiled with GHC while PSL’s performance implementation is written in SML and compiled with Mlton. We have also generated Scheme code compiled with PLT Scheme as part of a prototyping effort.”