A Small Example Illustrating Why Unification-Based Type Inference Is Not Always The Most User-Friendly Choice

let base = 10
let bitsPerDigit = logBase 2.0 (fromIntegral base)
let activeBits = fromIntegral $
  ceiling (bitsPerDigit * (Prelude.length "123"))

There's a missing fromIntegral above, which causes GHC to underline logBase and complain that it has no instance for (Floating Int). This is, on the face of things, completely mystifying. Both parameters are “obviously” floating! A second error points to ceiling, saying no instance for (RealFrac Int).

The second error is the more suggestive one, since the problem is that the type of * causes GHC to equate the type of bitsPerDigit with that of (Prelude.length clean), i.e. Int.  But it’s more than a little strange to force the programmer to reason about their code in the same order that the type checker does!


Comment on Comments


if (vars.empty()) { 
  // Store null env pointer if environment is empty 
      /* isVolatile= */ false);



if (vars.empty()) {

Citation Chains

Here’s a fun game to play on Google Scholar:

  1. Search for a relatively old paper in your field of choice. It will, presumably, be the top result.
  2. Note the number N in the “Cited by N” link under the top result’s abstract. Also note the year of publication.
  3. Click the “Cited by N” link.
  4. GOTO 2.

For example, here’s what results from searching for Luca Cardelli’s paper A polymorphic lambda-calculus with Type : Type:

Paper Name Pub
by <N>
per year
A polymorphic lambda-calculus with Type: Type 1986 72 4
The calculus of constructions 1986 1053 44
A framework for defining logics 1993 1155 68
Proof-carrying code 1997 1873 145
Xen and the art of virtualization 2003 2375 340
Globus toolkit version 4: Software for service-oriented systems 2006 1179 295
A taxonomy of data grids for distributed data sharing, management, and processing 2006 175 88
A toolkit for modelling and simulating data Grids: an extension to GridSim 2008 38 20
Service and utility oriented distributed computing systems […] 2008 11 4


It’s sort of nifty to see the “arch” of citations, and citations/year, over time.

Other interesting searches:

  • program slicing
  • featherweight

Not very interesting:

  • grain elevators

Coroutine Performance

I was curious this afternoon about how fast (or slow) coroutines can be. So I wrote a small microbenchmark in C, using libcoro’s  handwritten-assembly implementation of stack switching for symmetric coroutines.

The benchmark builds a complete binary tree of configurable size, then visits it twice. First, a conventional recursively-written function takes the tree root and a function pointer, and calls the function pointer to process each value stored in the tree.

The benchmark next re-visits the tree with the iteration encapsulated by a Lua-style asymmetric coroutine class, which yield()s tree node values back to the main routine.

One interesting operational consequence of doing the recursive visiting inside the coroutine is that the main program stack remains at constant depth for every processed node. This constant-stack-space property is also an effect of CPS transformations – not surprising, since both continuations and coroutines are general control abstractions.

Anyways, I found that when compiled with –O2 (gcc 4.4.3) and run on a Core 2, the overhead of using resume()/yield() compared to call/return was about 110 cycles per yield(). That cost is on par with loading a single word from RAM. Stack switching is an impressively efficient way of implementing a feature that lends significant flexibility and power to a language!

ANTLR Grammar Tip: LL(*) and Left Factoring

Suppose you wish to have ANTLR recognize non-degenerate tuples of expressions, like (x, y) or (f(g), a, b) but not (f) or (). Trailing commas are not allowed. The following formulation of such a rule will likely elicit a complaint (“rule tuple has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2; …”):

tuple : '(' (expr ',')+ expr ')' ;

ANTLR’s error message seems unhelpful at first glance, but it’s trying to point you in the right direction. What ANTLR is saying is that after it reaches a comma, it doesn’t know whether it should expect expr ')' or expr ',', and parsing expr could requiring looking at an arbitrary number of tokens. ANTLR recommends left-factoring, but the structure of the rule as written doesn’t make it clear what needs to be factored. Re-writing the rule to make it closer to ANTLR’s view makes it clearer what the problem is:

tuple_start : '(' tuple_continue;
    :    ( expr ',' tuple_continue
    |      expr ')');

Now it’s more obvious where the left-factoring needs to be applied, and how to do it by writing expr once, leaving only the choice of seeing a comma or a close-paren:

tuple_start : '(' tuple_continue;
    :    expr ( ',' tuple_continue
              | ')'  );

The rewritten expansion corresponds to an alternative structuring of the tuple rule that will satisfy ANTLR:

tuple : '(' expr (',' expr)+ ')';

The ANTLR wiki has a page on removing global backtracking from ANTLR grammars, which has a nice presentation of how to do left-factoring when the opportunity is obvious.


I was reviewing Ras Bodik’s slides on parallel browsers and noticed in the section on parallel parsing that chart parsers like CYK and Earley are theoretically O(n^3) in time, but in practice it’s more like O(n^1.2).

I wondered: wait, what does O(n^1.2) mean for practical input sizes? Obviously for large enough n, any exponential will be worse than, say, O(n log n), but what’s the cutoff? So, I turned to the handy Online Function Grapher and saw this:

n^1.2 vs n lg n

That is, for n less than 5 million, n^1.2 < n lg n, and they remain close up to about n = 9 million. So if Bodik’s estimate holds, chart parsing could indeed be a viable parsing strategy for documents of up to several megabytes. Happily, this limit size matches the current and expected sizes of web pages to a T.

The blue line at the bottom shows linear growth. It’s all too easy to think of linearithmic growth as being “almost as good” as linear growth. To quote Jason Evans discussing how replacing his memory allocator’s log-time red-black tree operations with  constant-time replacements yielded appreciable speedups:

“In essence, my initial failure was to disregard the difference between a O(1) algorithm and a O(lg n) algorithm. Intuitively, I think of logarithmic-time algorithms as fast, but constant factors and large n can conspire to make logarthmic time not nearly good enough.”

Thirty Inches a Year

Next week is the one-year anniversary (if you will) of my purchase of a Dell 3007WFP-HC monitor. I snagged a refurbished model for about 60% of the going retail price.

While I’m overall happy with the purchase, and do not regret it, the pragmatics of such a large monitor have turned out to be not quite as rosy as I had initially expected.

The good:

  • With Chrome’s minimal UI, I can fit two vertically-maximized PDFs on the screen at once, or three if I squeeze them down to “merely” life-size. (The screen is about 25% taller and 3 times wider than a sheet of 8.5 x 11” paper). Making it easier to read research papers was one of my reasons for getting the screen, and this aspect has worked out nicely.
  • Eclipse and Visual Studio will eat as much screen space as you can give them. Having a 30” monitor means getting two pages of code side-by-side instead of just one. A portrait-mode 24” monitor at 1200 x 1920 can show one page with 100 vertical lines of code; 2560 x 1600 shows two pages with 80 vertical lines of code.
  • Two displays are easier to drive from one graphics card than three.  This means that a thirty inch monitor plus one is easier to set up than three smaller displays.

The bad:

  • Managing four million pixels turns out to be much more difficult than managing two million pixels. Because I have so much space, I tend to be less disciplined about keeping fewer windows open. I have about 150 Chrome tabs spread across 15 top level windows, plus six PuTTY sessions and six Explorer windows. I sometimes feel like I need a machete to hack my way through the jungle I created. There are apps that try to help keep windows aligned and orderly, but I haven’t managed to use them successfully.
  • It’s hard to find wallpaper for 30” displays!
  • The sweet spot for monitor prices on a pixels-per-dollar scale is (still) at 24” monitors. For less than the price of a heavily discounted refurbished 30” monitor, you can get two 24” monitors, and put one or both in portrait mode. For the price of a new 30” display, you can get a smaller monitor or two and a desktop or laptop to go along with it…

Would I recommend a 30” monitor to fellow programmers? I’m not sure. Two 24” monitors, or a 27” and a 24”, might make more sense. It depends on your personal opportunity cost for the extra premium over a nice 24” monitor. That extra money could be a few extra cores and double the RAM, or a dedicated box for running tests, or a nice suit and wool pea coat.

All in all, the upgrade from 20” to 24” made a much bigger difference than the upgrade from 24” to 30”. Make of that what you will.

Bootstrapping Compilers and T-diagrams

I came across a very nice notation in the book Basics of Compiler Design that greatly clarified the various choices for bootstrapping a compiler. The notation was originally created by Harvey Bratman in 1961 (!) , but every other reference I’ve found on the subject is just terrible for conveying understanding. Mogensen’s presentation, on the other hand, is refreshingly clear.

The rules for T-diagrams are very simple. A compiler written in some language “C” (could be anything from machine code on up) that translates programs in language A to language B looks like this (these diagrams are from Torben Mogensen’s freely-available book on compilers):

Now suppose you have a machine that can directly run HP machine code, and a compiler from ML to HP machine code, and you want to get a ML compiler running on different machine code P. You can start by writing an ML-to-P compiler in ML, and compile that to get an ML-to-P compiler in HP:

From there, feed the new ML-to-P compiler to itself, running on the HP machine, and you end up with an ML-to-P compiler that runs on a P machine!

The T-diagram notation can also easily be extended with interpreters, which are simply vertical boxes that can be inserted between any two languages. For example, here’s a diagram depicting a machine D running an interpreter for language C that is, in turn, running a compiler in language C to translate a program from language A to language B:

After thinking about bootstrapping in an ad-hoc way, I just love the structure that these diagrams provide! Unfortunately, while the book itself is freely available online, there’s no HTML version of the book, only a PDF version. The easiest way to see the full presentation is probably to use Scribd and search for “t-diagram”.

A Snippet of Functional C

MurmurHash2 is a pretty nifty bit of code: excellent performance in both the statistical and chronological senses, and under the ultra-liberal MIT license.

The kernel of the hash function comprises these 5 lines of C code. k is the next four bytes to be hashed, m and r are experimentally-found constants, and h is the accumulated hash value:

word mrmr_step(word k,
        word h, word m, word r) {
/*1*/ k *= m;
/*2*/ k ^= k >> r;
/*3*/ k *= m;
/*4*/ h *= m;
/*5*/ h ^= k;
      return h;

Like much low-level systems code in C, this takes advantage of mutability: the value of k in line 3 is not the same as k in line 5. But what if you wanted to do this computation without mutability, or even assignment? Even in the land of C, the rules, they are a-changin’ [pdf].

At first blush, it seems strange to do first-order functional programming in a language like C, which doesn’t provide much support for it. But it’s precisely because there’s so little magic to C that it’s such a good vehicle for seeing how things work under the covers. The real point of this post is not how to write functional code, as to explore how to compile functional code.

Anyways, back to the code. Mutability and immutability are sometimes interchangeable; for example, a += 3; a *= 5; a -= 1; does the same thing as a = ((a + 3) * 5) - 1. The key is the linear dependency chain: each mutated a value is used in precisely one place.

Examining the MurmurHash2 kernel above, most of the dependencies are linear. But there’s one branching dependency, from line 1 to line 2. There are two C-like ways of dealing with this. First, we could "copy" line 1 into line 2:

k = (k*m) ^ ((k*m) >> r);

The second way would be to introduce a new variable:

int km = k * m; km ^= km >> r;

But option 1 breaks down if line 1 has side-effects, and line 2 introduces a new assignment, which we wanted to avoid!

The trick to see is that functions provide variable bindings. Thus, we can translate the C code to functional pseudocode like so:

/*1*/ bind km  to k * m in
/*2*/ bind kx  to km ^ (km >> r) in
/*3*/ bind kxm to kx * m in
/*4*/ bind hm  to h * m in
/*5*/ hm ^ kxm;

Now, in this static single assignment form, it’s super easy to see what can be safely inlined; they are bindings that are only used once, such as hm in line 4. So we might optimize this to:

/*1*/ bind km to k * m in
/*2*/ bind kx to km ^ (km >> r) in
/*5*/ (h * m) ^ (kx * m);

I’ve left line 2 alone for clarity, but it’s easy to see that it could be trivially substituted into line 5. Now, how to translate back to C? In a language with anonymous functions and closures, it’s easy. Just like Scheme translates let into lambda, we could translate bind varname to arg in expr to JavaScript like (function (varname) { return expr; })(arg). So let’s do that to start:

(function          line2(km) {
  return (function line5(kx) {
/*5*/  return (h*m) ^ (kx*m);
/*2*/ })(km ^ (km >> r));
/*1*/ })(k * m);

Note that the code now reads, in some sense, bottom-to-top. Notice also that we’ve created closures, because the line marked 5 references h, which is defined in a parent scope (not shown). So this code isn’t valid C, because C does not have first-class functions. Instead, all C functions must be declared at the top level. Luckily for us, translating the above code to C is easy. We move the function definitions to the top level, and add parameters for their free variables. (This process is called closure conversion, or sometimes lambda lifting.) We’re left with this purely functional C code:

word line1(word km, word r) {
  return km ^ (km >> r);
word line5(word h, word kx, word m) {
  return (h * m) ^ (kx * m);
word mrmr_step_f(word k, word h,
                 word m, word r) {
  return line5(h, line2(k*m, r), m);

This is the code that would be generated by a compiler, given input like the 3-line pseudocode from above. Now, the interesting bit is: how efficient is this version, compared to the original? Since this is just C, we can use gcc‘s -S flag to compile both versions to optimized assembly and compare:

        imull   %edx, %edi
        movl    %edi, %eax
        sarl    %cl, %eax
        xorl    %edi, %eax
        imull   %edx, %eax
        imull   %esi, %edx
        xorl    %edx, %eax

        imull   %edx, %edi
        imull   %edx, %esi
        movl    %edi, %eax
        sarl    %cl, %eax
        xorl    %edi, %eax
        imull   %edx, %eax
        xorl    %esi, %eax

Pretty cool! The assembly from the purely functional version is essentially identical to the low-level imperative code. And a modern out-of-order CPU will ensure that the dynamic execution is, in fact, identical for both versions.

Update 2010/02/11

There are two ways of handling the “extra” arguments added closure conversion. The way above, where free variables become parameters, is easy to do manually, and clearer for small numbers of parameters. The alternative is to create a separate structure holding the free variables, and pass a pointer to the structure. GCC is strong enough to compile this example to the same assembly as well. Not bad!

struct mrmr_step2_env { word r; };
static inline word
step2(mrmr_step2_env* env, word km) {
  return km ^ (km >> env->r);

struct mrmr_step5_env { word h; word m; };
static inline word
step5(mrmr_step5_env* env, word kx) {
  return (env->h * env->m) ^ (kx * env->m);

word mrmr_step_f(word k, word h,
                 word m, word r) {
  mrmr_step2_env e2; e2.r = r;
  mrmr_step5_env e5; e5.h = h; e5.m = m;
  return step5(&e5, step2(&e2, k*m));

C++ was a victim of its own success

I came across an interesting paragraph when reading Bjarne Stroustrup’s 15-year retrospective, Evolving a language in and for the real world: C++ 1991-2006. This notion hadn’t occurred to me before, but it rings quite plausible.

There was also a curious problem with performance: C++ was too efficient for any really significant gains to come easily from research. This led many researchers to migrate to languages with glaring inefficiencies for them to eliminate. Elimination of virtual function calls is an example: You can gain much better improvements for just about any object-oriented language than for C++. The reason is that C++ virtual function calls are very fast and that colloquial C++ already uses non-virtual functions for time-critical operations. Another example is garbage collection. Here the problem was that colloquial C++ programs don’t generate much garbage and that the basic operations are fast. That makes the fixed overhead of a garbage collector looks far less impressive when expressed as a percentage of run time than it does for a language with less efficient basic operations and more garbage. Again, the net effect was to leave C++ poorer in terms of research and tools.

— Bjarne Stroustrup

For a peek at an example of “glaring inefficiencies,” check out what two IBM researchers have found about memory use in Java, and prepare to throw up in your mouth a little.

Building protobuf with MinGW GCC 4.4.0

So, after tearing my hair out for an hour this morning, I finally managed to build protobuf using MinGW and GCC 4.4.0, using the msys install from mozilla-build 1.4.

The two major sticking points are

  1. mozilla-build includes autoconf 2.59, whereas protobuf requires autoconf 2.61 at minimum. Solution: download the msys autoconf 2.63 package, and install to /usr/local/bin/autoconf-2.63/, and temporarily export a prefixed $PATH for building protobuf.
  2. protobuf attempts to link against /c/mozilla-build/msys/mingw/lib/gcc/mingw32/4.4.0/libstdc++.dll.a, which fails because that file doesn’t exist. Solution: edit the file libstdc++.la in that directory, replacing libstdc++.dll.a with libstdc++.a in the definition for library_names

Introducing Canvas Cacheviz

Eric Dramstad and I have been working on a new canvas-based tool. This time, it’s a cache visualizer.

Here’s what a naive matrix transposition looks like:naive-transpose

On the left is the sequence of memory touches; black pixels are touched first, and white pixels are touched last. The left side shows that each row of the upper triangle is accessed in tandem with the corresponding column of the lower triangle. On the right are the cache hits and misses. The upper triangle has a cache-friendly sequential access pattern, while the lower triangle has a cache-unfriendly strided access pattern. 56% of the memory accesses result in a cache miss.

In contrast, here’s a cache-oblivious matrix transposition:cache-oblivious-transpose

The left side shows the divide-and-conquer approach. Small blocks are transposed rather than whole rows and columns, which helps with cache locality, as reflected on the right. In this instance, only 28% of the memory touches were cache misses, a significant improvement over the naive algorithm.


There are several improvements I’d eventually like to make. One major shortcoming is that the cache model is only approximately like the cache system found in your average x86 chip. We currently only model a single-level fully-associative cache. We should eventually refine that model to a two-level set-associative cache, of configurable level sizes. I’d also like to show the cache effects of various sorting algorithms, like one-pivot quicksort versus insertion sort versus two-pivot quicksort.

The source code for canvas-cacheviz is hosted on GitHub. So far my experience with git has been negative in comparison to Mercurial. Git may be more powerful, but it has been all too easy to do strange and unwanted things with the repository DAG. And even simple questions like “what’s the equivalent of svn revert?” were difficult to answer using the tool’s help listing.

Summer 2009: Time and Books

When I started working this summer, I discovered three things, with varying levels of surprise:

  1. I had significantly less free time than I had envisioned. Buying groceries and cooking takes a lot more time than I thought!
  2. I was significantly more productive than I had envisioned. Having less time necessitated making it more structured and more efficient.
  3. I ended up getting more done even though I had less time to do it.

I would cook, read books, go to the gym, and do a little side programming during the week, and on the weekend I’d visit my parents (and usually mow their lawn for them). At school, I had three or four times as much free time (at least!), but I didn’t get as much done in a concrete sense. I spent much more time browsing proggit and Google Reader. Not time wasted, per se, but on the whole I was not as productive as I was under greater restrictions.

Books I read this summer:

  • Confessions of an Economic Hit Man
  • Life in Double Time
  • All seven Harry Potter books, including the sixth consumed in one glorious day at the beach
  • Types and Programming Languages (casually, at least)
  • A Thousand Splendid Suns
  • One Hundred Years of Solitude
  • and I’ve started reading more:
    • The Gold Bug Variations
    • Advanced Types and Programming Languages
    • Masterminds of Programming
    • Design Concepts in Programming Languages
    • Refactoring

Even conservatively counting the first three books of Harry Potter as one, that’s still ten books finished and five more started in the last three months. Much better than the, um, one or two that I probably read during the last semester of my undergrad program.

Canvas Spline Toys

I am pretty busy this week. For the past few days, I’ve been doing 8 or more hours of research, as well as packing and organizing my physical and virtual possessions. I hope to have several more posts over the next few days detailing some of the projects, medium and small, that I have worked on in the last few months.

The latest project is a simple interactive spline visualizer. I had a random itch at work to have a way of seeing splines constructed one-piece-at-a-time, so I created an initial mockup using the HTML canvas object. Eric Dramstad extended it with mouse support for rearranging control points, as well as cleaner code.

The blue line is a conventional second-order linear interpolation. From an OpenGL standpoint, it has the nice feature that the interpolated line lies entirely within the triangle defined by the three control points. The corresponding disadvantage is that the line does not pass through the second control point. The green line does the reverse. Instead of interpolating between L1(t) and L2(t), it interpolates between L1(t) and L2(-1 + 2*t). The green spline passes through all three control points, but is thus not nicely contained like the blue spline.

An interested programmer could extend this without too much difficulty. It would be particularly cool to be able to (A) add additional control points, and (B) interactively define new interpolation routines. Even without such features, Eric Dramstad’s interactive version makes for a surprisingly entertaining minute or two!