I recently watched two videos from POPL and wanted to help make their ideas more widely known. Ten minutes of text to help you decide whether to watch eighty minutes of video, what a deal!
The core idea: create a program logic for finding bugs rather than preventing them.
The technical details of this work are beautiful in their own right, but I wanted to explain the broader picture of what (I think) it’s all about, and why it’s exciting. I suppose the target audience for the first half of this blog post is people who get why their compiler produces error messages, but aren’t so keen on inference rules.
One thing I really like about this work is that it’s the output of a virtuous cycle between theory and practice. O’Hearn (and others) worked on some promising theories in academia, then went to Facebook to build tools for working programmers. Seeing ways the tools might be improved, they went back in search of new theory to make the tools better. This sort of back-and-forth seems to produce great results when it happens, but unfortunately it’s rarer than one might hope.
A type system splits the universe of all possible programs, accepting some and rejecting others. We’d like to reject programs that do bad things and accept programs that don’t do bad things. Most type systems for general purpose languages have an extra category: programs that don’t do bad things but are rejected anyways. This is an inevitable consequence of trying to analyze Turing-complete programs with the guarantee that the type system will rule out every possible form of the particular badness it cares about.
A more sophisticated type system might rule out more bad programs, or accept more good ones, or both. One big cost of this power is that the programmer’s understanding of the system must also grow. When a program is rejected, the programmer first faces the burden of translating from the language of the type error they’ve been given into their own mental model of what the program might do, and then determining whether the error is spurious or not. Spurious errors—those due to limitations in the type system itself—can be avoided by rewriting code in some equivalent form that the type system can do a better job of reasoning about. For “real” errors, the behavior of the code itself has to change.
This disconnect, and the burden it places on the programmer, effectively limits how sophisticated a widely used type system can be.
I mentioned three categories before, but there’s really a fourth as well: programs which do bad things but are accepted because the purview of the type system is (intentionally) limited by the above dynamic. Some of these cases, like array bounds checking, will be caught by the language runtime; others, like SQL injection attacks, probably won’t be. Distributed systems are common, but type systems to make them robust… not so much.
We can try to eliminate this fourth category. Doing so leads to the realm of proof systems and dependent types and so on. At the end of this path, types become formal specifications of what it means for our program to be correct, and our type system is a proof checker. But for most code, it’s not clear what a complete formal specification would even look like (what’s the correctness condition for Slack, or Facebook’s news feed, or a video game?). The overwhelmingly common case is partial correctness, and therefore analyses that catch some bugs and not others.
O’Hearn’s Incorrectness Logic
Peter O’Hearn’s paper on Incorrectness Logic recognizes this situation, and turns it on its head. Most prior work (on type systems, program analysis, verification, etc) focuses on proving the absence of bugs, and is thereby shackled by the burden of false reports. O’Hearn instead examines the theoretical underpinnings of a system focused on proving the presence of bugs.
This rearrangement turns out to have several pleasing symmetries with traditional approaches to program analysis. It also results in a number of useful properties. Instead of a system limited to conveying messages of the form “I wasn’t able to prove that your code is safe in all circumstances” we can be much more concrete, telling the programmer “I was able to prove that your code is unsafe in the following circumstance.” No more false reports; no more struggling to reverse-engineer a bug from a type error.
There have, of course, been decades of tools and techniques focused on finding bugs. Those tools and techniques were by necessity concrete in their analysis, in order to find specific categories of bugs. Different tools used different techniques to find different bugs. The appeal of O’Hearn’s work is that it’s a (more) generic framework for doing the sort of reasoning that results in evidence of bugs.
There are three exciting consequences here (that I can see, and certainly more that I can’t!). First, this new logic can serve as an engine driving a pluggable definition of badness, which means that powerful bug finding tools should become cheaper and easier to build. As O’Hearn puts it,
“Incorrectness is an abstraction that a programmer or tool engineer decides upon to help in engineering concerns for program construction. Logic provides a means to specify these assumptions, and then to perform sound reasoning based on them, but it does not set the assumptions.”
The paper gives an example of an analysis for detecting flaky tests and characterizing why they aren’t reproducible. Such a tool would be widely useful in industry!
Second, the shift in perspective opens the door to considering much more sophisticated forms of reasoning. Rather than the hapless programmer being forced to intuit or reverse-engineer a problematic scenario from the failure to prove correctness, incorrectness logic can directly describe program states that are guaranteed to be problematic. Hiding the machinery needed to find those states allows the machinery to use tools on the basis of their power, rather than only those with simple mental models.
Third, soundness is a binary property, but unsoundness is a spectrum. O’Hearn describes a tool used at Facebook, and how by dialing back a simple parameter, the tool ran ~2.75x faster while still finding 97% of the bugs from the slower run. “Variable effort for variable results” is often an appealing property.
There’s still a considerable amount of work left to do, of course, but incorrectness logic strikes me as a plausible foundation for a new generation of powerful bug finding tools. Fingers crossed!
Cristina Cifuentes presented one of the keynotes at POPL this year. Her talk provides a great introductory overview to some of the challenges facing those who want to address (some of) the industry’s security woes with programming language technology. If you’re not familiar with this area, you’ll probably get a lot out of the talk! It does a nice job of providing enough detail—including code snippets—to make things concrete without going too deep into the weeds on any one area.
The categorization of bug classes is a bit fuzzier; they consider buffer overflows, injection attacks (such as XSS and SQL injections) and information leakage, such as sensitive information making its way into a log file. Other classes—like use-after-free errors, path traversal bugs, cryptographic issues, and access control errors—are not considered, because it’s unclear to the authors how those categories can be addressed with linguistic abstractions. The authors give brief coverage of the larger universe of security; a slide gives a nod towards work on practical and theoretic work on multi-language system building, including compartmentalization, multi-language runtimes, and linking types.
The talk ends with an exhortation to improve the state of affairs for the 18.5 million extant programmers dealing with imperfect languages: “It’s time to introduce security abstractions into our language design.”
The POPL audience then put forth a few big-picture questions but time limits precluded any real back-and-forth. Paraphrased:
- The talk gives a list of different security vulnerability categories to worry about. Does it leave out anything important?
- The talk suggests solving security flaws with new language abstractions. Do we need to build separate abstractions/mechanisms to address each individual problem? What about other approaches, such as formal verification? Or capability-based security?
- What happens if we don’t improve things, if we keep on using C?
I hope the talk (and this writeup) might catalyze further discussion. Here are some of my thoughts:
- Consider one specific sub-category of flaw: SQL injections. Languages like Python and PHP don’t prevent SQL injection attacks, but the real story is a bit subtler, in an interesting way. These languages provide library support for avoiding SQL injections, but have no way of forcing the use of safe versus unsafe functions. Do we understand why programmers choose insecure approaches when secure ones are right there?
- It’s tempting to say “Oh, the key is that we lack linguistic abstractions; library-based conventions are too easy to bypass.” But I don’t think that captures the whole story, especially for things like information flow. Linguistic abstractions for information flow security—whether we’re talking static labels, tagged values, faceted values, or whatever else—provide mechanisms to enforce some policy. But the choice of policy to enforce is left free.
Compare and contrast with something like GC. Language support for automatic memory management forcibly eliminates (most of) the errors of non-automatic memory management. But language support for information flow does not forcibly eliminate information flow errors; it provides infrastructure that can be used to eliminate such errors. Or phrased more starkly: infrastructure that, when used properly, can eliminate such errors. A subtle but important difference!
Enforcing security impacts program behavior. If you take a C program, delete all the
free()calls, and run it with the BDW GC, you will remove several failure modes (use after free, double free, etc) and add no new linguistic failure modes. In contrast, applying information flow security to a program absolutely risks adding new failure modes. If the logging policy is accidentally too restrictive, your logs will be missing crucial data. If you don’t declassify the output bit of the password checker, you can’t log in! Making the IFC policy too loose will allow bad behavior; making it too tight will prevent good behavior. Empirically, the world has a strong revealed preference towards not preventing good behavior.
It’s worth considering not only whether secure abstractions exist, but what properties they need to have to make adoption feasible.
- Bugs in the real world often have multiple contributing factors, which aren’t always accurately captured by the CWE classifications.
For example, CVE-2014-0574 was labeled with CWE-94 (Code Injection), but its root cause was a double-free error, which was not associated with the vulnerability, thereby under-counting the prevalence of temporal memory safety issues. Likewise, CVE-2014-1739 was labeled CWE-200 (Information Exposure) although the root cause was C’s lack of required initialization, i.e. type unsafety. Other CWEs are highly correlated: Code Injection tends to have a lot of overlap with Improper Input Validation, but usually each bug gets put in just one of those two buckets.
As with many classification efforts, it doesn’t take much digging to uncover fractal complexity.
The CVE/CWE data incorporates a certain amount of inherent subjectivity, and the labels it uses may not be the labels you would use. To be clear, I don’t think these labeling issues affect the paper’s argument. After all, the paper explicitly ignores half the dataset! But I do think it’s generally worth being aware of whatever gap, big or small, exists between expectations and reality.
- Sometimes researchers get the luxury of using precise internal jargon. Terms like “minimum-weight spanning tree” or “pullback functor” leave little room for misinterpretation. Other words risk a collision of terminology that already has meaning in the audience’s mind. “Secure” is one such word. Cifuentes & Bierman give a precise definition for what they mean when they say “secure programming language” but it’s a bit quirky. For example, because it doesn’t cover use-after-free bugs, and because return-oriented programming attacks don’t actually inject new code, their definition wouldn’t eliminate unintended arbitrary remote code execution. I don’t think that state of affairs matches what most people mean when they hear “secure programming language.”
What do people mean by “secure”? If you ask ten people to define what they want, expect eleven different answers.
- Okay, sure, defining “secure” is hard. Let’s forget about that and ask a different question: what properties should a definition have?
Then Spectre happens, and oops, untrusted JS can read the kernel’s heap.
Now… what actually changed? Not the code, for sure. Not the proofs. The definition of “secure”? Not in any glaringly obvious way: before Spectre, everyone agreed that being unable to read kernel memory was an important part of being secure, and believed that the implementation satisfied that property. And in some sense, they were right one day and wrong the next.
Maybe “secure” has more to do with observers/attackers of a codebase rather than the codebase itself. Or perhaps “secure” cannot be a property of code, but only of systems: it must bring together properties of language, compiler, CPU, etc. Put another way, security can only be defined relative to a threat model, and what Spectre changed is that the CPU became part of the threat model. One unintuitive consequence of this view is that a trusted and formally verified compiler would have to be considered part of the threat model. The systems view says that “secure” cannot be modularized, and there’s no such thing as a trusted computing base.
(As an aside, I don’t believe that these ideas about security are novel, but I’m unsure of the proper citations. Maybe Fred Schneider? Pointers from those more familiar with the security literature would be greatly appreciated!)
The talk’s authors are aware of most or all of these concerns. After writing this up, I saw that an earlier iteration covered a broader range of topics, including capabilities, cryptography, access control, the pros and cons of formal verification, and the library-versus-language question. Streamlining allowed deeper coverage of the most critical pieces. In the end, security is a big hairy problem, and even an hour-long talk can only capture a small piece of the overall picture.
In summary: security matters, and data-driven decisionmaking is a powerful technique. Maybe the combination can help crack a tough nut.