Undergrad Compilers from the Hive Mind

John Regehr recently solicited advice for what an introductory compilers class should cover in 2020. To save you the trouble of reading through the whole thread, I’ll quote/summarize a bit before throwing in my two cents.

  • Many people wanted less parsing/lexing and more about the middle end; several others (including John) noted with some chagrin that parsing is the most widely applicable knowledge conveyed in this sort of class.
  • Several people wanted to highlight the gap between theory and practice in the design & structure of compilers. Traditionally, compilers have been batch processes, whereas IDEs want compilers to be an interactive service that they can query incrementally. Also, error messages tend to be an afterthought in the classic presentation of compilers, but projects like Clang, Elm, and Rust have found big wins in going out of their way to give better error messages. The implementation impact can be big: for example, Rust does speculative parsing by snapshotting the compiler state to give better fix-it hints. There’s also been a burst of research in the last decade on improving error messages for programs with inferred types, but it’s just hard to determine what code is wrong when an inconsistency is found.
  • Sam Tobin-Hochstadt pointed out that “Your students are more likely to hack on Babel than on GCC.”; Will Chrichton pointed to a course focusing on DSLs rather than general-purpose languages.
  • Several people suggested covering JIT compilers, both for the different assumptions involved and also for the feats that can be accomplished with e.g. Truffle/Graal/Sulong.
  • Rohan Padhye pointed to Berkeley’s compilers class, with some really cool infrastructure for compiling a subset of Python 3 to RISC-V.
  • A few people suggested lightweight correctness/testing methods: differential testing of a compiler vs an interpreter; certifying the output of compiler passes; checking invariants (like type correctness) on multiple IRs.
  • A few people advocated for incremental construction of compilers to make them more approachable, and also for teaching compilers as the functional composition of passes.
  • Geoff Hill pointed to Robby Findler’s course, which does something unusual: it starts from an assembly-like IR, and builds the compiler backwards (!) towards higher-level languages.
  • Matthew Fernandez suggested covering type inference and Reflections on Trusting Trust.
  • Joe Groff wanted to highlight the deep similarity between compilers and interpreters.
  • Talia Ringer noted how much fun it was as a student to be given an open-ended programming assignment; Joe Gibbs Politz confirmed that giving such assignments is strenuous but rewarding.
  • Celeste Hollenbeck noted the appeal of focusing on the basics over esoterica. Compilers for the masses, not compilers for compiler folk. On a perhaps related note, @type_monkey recalled the pleasure of being given historical context when learning PLs.

Parsing’s definitely not the coolest part of compilation but it is important. I’d want students to understand the following about parsing:

  • Various approaches exist: hand-rolled, generated, combinators, PEGs. Focus on parser generators, and give coverage to the others with an eye towards the tradeoffs involved. Err towards broad coverage, because faced with a parsing task, they should understand both what they can do and what others might do.
  • The approaches come with tradeoffs about up-front work versus automated checking/guarantees (much like type systems). Parser generators sometimes complain for silly-seeming reasons (do I care if my grammar is k=1 vs k=2?) but can also catch actual unintended ambiguity in grammars. PEGs and combinators eliminate annoying complaints about grammar structure, but sometimes by silently choosing the “wrong” parse tree.
  • Having a specification distinct from implementation is appealing for parsers just as with any other domain.
  • The choice between LL/LALR/LR is (in practice) mostly an implementation detail that affects the how the input grammar needs to be structured.
  • LL parsers correspond directly to the sort of recursive descent parser they might write by hand. I sometimes find it handy to reason about grammar conflicts by thinking about what “goes wrong” in the corresponding recursive descent code. (“Oh, when the parser sees symbol X, it doesn’t know if it should start parsing production Y or production Z”). LR parsers are recursive ascent, which I (still) find to be a trickier mental model when debugging grammar conflicts. I’d want students to get some exposure to shift/reduce vs reduce/reduce errors but I’m not sure quite how much.
  • Parsers for programming languages don’t need to be 100% strict in what language they accept; it’s often better to shift work to the rest of the compiler. (Most(ly) useful for non-hand-rolled approaches). For example, instead of encoding operator precedence in the structure of the grammar itself, the generated parser can build flat lists of operator expressions which then get structured with a separate operator precedence pass. Another example is that by unifying certain lexical categories, such as number and identifier, errors for incorrect programs come from the compiler rather than the parser generator. A third example is the mini-syntax for strings, with escapes and interpolations and such.
  • I think it’s illustrative to show how e.g. ANTLR’s EBNF-style syntax sugar for grammars, which I find quite natural, can be desugared. Definitely a “yo dawg” moment.
  • Parsing a single well-defined language is (mostly) a solved problem with strong (and very pretty!) underlying theories, but parsing as a whole is not completely solved. Example: embedding/composing languages is common, especially on the web, but support in existing tools is not great so it’s mostly done by hand.
  • Unparsing (both serialization in general, and also faithful pretty-printing) has a much larger gap between theory and practice. Automated formatters are common for Go and Rust and C++, but there’s not much research/theory or tooling support for doing it well.

For the rest of the compiler, the pass-based architecture with a few judiciously chosen IRs seems like an obvious choice. A few more specific thoughts:

  • The details of abstract interpretation and Futamura projections are for sure advanced material, but I think intro students should get some intuition of the connection between compilation & interpretation, and how static analysis approximates dynamic semantics. It doesn’t have to be much: take a concrete snippet of code and show how the same operations are happening just at different phases/times.
  • Show how changing (properties of) the input language can make the compiler’s significantly job easier or harder. Potential examples: the difficulty of parallelization and vectorization for Fortran vs C vs NESL, or modularity and dependency management in C vs Go, or error messages for a language with local vs global type inference.
  • One concern that looms larger than in the past is latency, especially in the distributed systems context. The most obvious connection is with the runtime and GC, but the compiler can also play a role; many languages now feature generators, coroutines, or async/await, with compiler support.
  • Optimizations are fun, especially “heroic” optimizations, but they’re also a nice confluence of several camps/viewpoints: “let’s do it cuz it’s cool” versus “is the complexity worth it” versus “is the compile time hit worth it” versus “does this make the performance model more fragile”. And, of course, they’re a great starting point for exploring the ways in which modern architectures differ from historical assumptions.
  • I think it’s interesting that optimizations can have negative runtime cost, by reducing the workload on the remainder of the compiler (similar story with GC: the overhead of GC barriers can be more than offset by reducing the GC’s workload). Chez enforced this “pay their own way” rule.
  • Ultimately, a big question is what not to cover in an intro class. I’d go so far as to defer most “backend” issues, including data flow analysis (!) to an advanced class. Likewise, ignore pretty much all details related to concurrency and parallelism. It’s a tough balance between giving enough information to be enticing without going too deep nor too thin. I’d also avoid CYK/Earley/GLR/derivative-based parsers and most details of LR parsing’s internals. Coverage of lexing probably depends on what they’ve seen in theory of computation.
  • In my undergrad software engineering class, we ended up using ANTLR to parse a very ad-hoc DSL. In retrospect, the experience was valuable precisely because it was less clear-cut how to separate the input into lexical vs syntactic categories. Not super actionable, I suppose, but to the extent that students will go on to do parsing, they probably won’t be given a clean language design to start from.
  • An intro compilers class shouldn’t get too into the weeds on “adjacent” topics like type checking, but it would be nice to convey some of the range of what can be done with static analysis: substructural typing in Rust, definite assignment in Java/Swift, Extended Static Checking, maybe a whiff of SMT-based refinement type checking (Refined TypeScript?) Another nice example for “things you can do with static analysis but not so much with runtime checks” could be static enforcement of constant timing (e.g. for cryptographic code). Students should also understand that it’s become more common to separate type checking from compilation. For example, JavaScript has tools that do compilation without typechecking (Babel, webpack), and checking without compilation (ESLint) and both together (TypeScript).
  • Likewise, this is too far afield to spend much time on, but it might be worth taking a few minutes from a lecture to give a high level overview of how compilers integrate with modern software engineering practice and infrastructure: CI/CD, coverage-guided fuzzing, hermetic builds, massively parallel builds, distributed tracing, etc.
  • Bootstrapping is a fun puzzle, and meshes well with Reflections on Trusting Trust.

4 thoughts on “Undergrad Compilers from the Hive Mind

  1. Nice summary! Two things I thought of while reading your thoughts:
    “does this make the performance model more fragile” doesn’t get as much consideration as it ought to. I’ve seen countless developers complain that their code wound up being slow because it went over some internal limit in an optimizer’s heuristic. This sort of thing is really difficult to track down! Similarly when SpiderMonkey implemented its tracing JIT web developers would find they accidentally fell off the fast path and didn’t have useful tools to figure out why.

    “optimizations can have negative runtime cost, by reducing the workload on the remainder of the compiler” I’ve heard it recounted by many folks who have worked on the Rust compiler that even simple optimizations on MIR (Rust’s intermediate representation) can have a huge impact on compile times because it lowers the amount of IR that gets sent to LLVM and LLVM is not fast when presented with huge amounts of IR.

  2. I think that teaching a compilers course in reverse would be nice. The key idea is that at every stage (i.e. assignment), the complete compiler is runnable.

    You start with having students make an assembler. The assembler takes as input either a text file, or perhaps a different serialization format, and outputs a binary for some VM which you provide. The compiler front-, middle- and back-end is provided as a black box that emits stuff to to assembled.

    Next up is converting an internal compiler IR to the assembly format. The compiler IR is in the best case provided in terms of a serializable format (this would let students work in multiple languages), and in the worst but forgivable case, a C or C++ header file. The compiler front- and middle-ends are provided as a black box, and then the students plug in their backend and assembler.

    Next up is converting an AST to the middle-end IR. This is similar to the first stage. Finally, there is converting source code to the AST.

    This approach is heavy-handed, but I think would let you teach the essentials, and also means that if for some reason the full extent of the course cannot be completed (e.g. run out of time, strike, weather, etc.), then students can still always show that they’ve developed some lego bricks that are part of a working compiler.

    A way to introduce bonus points is to add a competition into the course. Students compete to generate the “most optimal” code. If you provide a set of benchmarks ahead of time, as opposed to a black box set of benchmarks, then students will discover creative ways to tailor their compilers to those benchmarks, which is a great learning outcome.

  3. “Several people wanted to highlight the gap between theory and practice in the design & structure of compilers. Traditionally, compilers have been batch processes, whereas IDEs want compilers to be an interactive service that they can query incrementally.”
    But why is it considered theory vs. practice? O_O

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s