Manually making pervasive changes in a four million line codebase is all but impossible. Thus, in a bid to make far-reaching changes to the internals of their code, Mozilla has been actively pursuing better static analysis tools.
It turns out that analyzing C++ is remarkably difficult; the stuff of patents and at least one doctoral dissertation. The root cause is the painfulness of parsing C++. First, C++ is simply a large language, syntactically speaking, and due to vendor extensions, there is no single point of reference for the syntax found in the real world. Second, C++ is difficult to parse with automated tools. The C++ grammar, such that it is, does not fit into any of the usual formal language classes like LL(k) or LR(k). This parsing difficulty is partly due to its C heritage. To give one simple example of a problematic construction:
return (a)(b); could involve a cast of
b to type
a, or a function call
a(b). Which one applies depends on the type of
a, but in a classical compiler, type information is only available after parsing completes! Finally, preprocessor macros create a disconnect between what the user sees in a source file and what the compiler sees in a translation unit. Any analysis tools must either bridge that divide, or risk being rejected for not properly handling macros.
Mozilla has several static analysis projects it runs, which build upon several more in turn. The tools are split into two related but separate categories: static analysis and rewriting. The following is my understanding of the current tool space:
Right now, Dehydra and Treehydra require a custom, patched version of GCC. However, in the future, starting with GCC 4.5, plugin support (thanks largely to Mozilla’s Taras Glek!) will obviate the need for a patched GCC. That’s very exciting, especially because it will mean that static analysis of C++ code will become available to orders of magnitude more programmers!
Unfortunately, the *hydras are not sufficient for the task of automated rewriting, because GCC does not preserve all the necessary syntactic information about token positions and macro expansion and such. Thus, more tools are needed.
The primary C++ rewriting tool from Mozilla is called Pork. Pork is a tool for rewriting C++. Unlike GCC, it retains exact syntactic information about the code it parses, as well as the changes made while preprocessing. Pork is built on several other tools: Oink, Elsa, and MCPP. MCPP is a preprocessor that includes annotations in preprocessed source to help reconstruct the original source file. Elsa is a C++ parser, based on the GLR parser generator Elkhound. Finally, Oink is a tool for analyzing the parse trees constructed by Elsa.
Whew! I hope that information is accurate, +/- 10%.