Overview of Mozilla’s C++ Static Analysis Tools

Manually making pervasive changes in a four million line codebase is all but impossible. Thus, in a bid to make far-reaching changes to the internals of their code, Mozilla has been actively pursuing better static analysis tools.

It turns out that analyzing C++ is remarkably difficult; the stuff of patents and at least one doctoral dissertation. The root cause is the painfulness of parsing C++. First, C++ is simply a large language, syntactically speaking, and due to vendor extensions, there is no single point of reference for the syntax found in the real world. Second, C++ is difficult to parse with automated tools. The C++ grammar, such that it is, does not fit into any of the usual formal language classes like LL(k) or LR(k). This parsing difficulty is partly due to its C heritage. To give one simple example of a problematic construction: return (a)(b); could involve a cast of b to type a, or a function call a(b). Which one applies depends on the type of a, but in a classical compiler, type information is only available after parsing completes! Finally, preprocessor macros create a disconnect between what the user sees in a source file and what the compiler sees in a translation unit. Any analysis tools must either bridge that divide, or risk being rejected for not properly handling macros.

Mozilla has several static analysis projects it runs, which build upon several more in turn. The tools are split into two related but separate categories: static analysis and rewriting. The following is my understanding of the current tool space:

Dehydra is a lightweight static analysis tool for C++. It is implemented as a GCC plugin that passes C++ AST information to JavaScript code, which can examine the provided AST nodes. Among other things, Dehydra is used as the smarts behind a very cool semantic Mozilla code browser called DXR (Dehydra Cross Reference).

Treehydra is Dehydra’s heavy duty companion. It deals with GCC’s internal GIMPLE intermediate representation. This allows Treehydra analyses to be run after any optimization pass in GCC. Treehydra scripts also get access to GIMPLE control flow graphs (CFGs). Access to CFGs enables more advanced path-based analyses. In effect, anything GCC “knows” about a piece of C++ is available to JavaScript code through Treehydra.

Right now, Dehydra and Treehydra require a custom, patched version of GCC. However, in the future, starting with GCC 4.5, plugin support (thanks largely to Mozilla’s Taras Glek!) will obviate the need for a patched GCC. That’s very exciting, especially because it will mean that static analysis of C++ code will become available to orders of magnitude more programmers!

Unfortunately, the *hydras are not sufficient for the task of automated rewriting, because GCC does not preserve all the necessary syntactic information about token positions and macro expansion and such. Thus, more tools are needed.

The primary C++ rewriting tool from Mozilla is called Pork. Pork is a tool for rewriting C++. Unlike GCC, it retains exact syntactic information about the code it parses, as well as the changes made while preprocessing. Pork is built on several other tools: Oink, Elsa, and MCPP. MCPP is a preprocessor that includes annotations in preprocessed source to help reconstruct the original source file. Elsa is a C++ parser, based on the GLR parser generator Elkhound. Finally, Oink is a tool for analyzing the parse trees constructed by Elsa.

Whew! I hope that information is accurate, +/- 10%.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s