Firefox Add-On Compatibility

It looks like Firefox is moving to a compatible-by-default model for extensions. I find this interesting, since I proposed this back in 2004 to no avail.

To be clear, I don’t mean to say that compatible-by-default would have been the right choice in 2004… Predictions about the past are almost hard as predictions about the future, and at least two major factors have changed since then. First, the state of the world for extensions now is more stable than it was in the Firefox 1.0 timeframe. And second, Firefox’s rapid release cycle exacerbates spurious compatibility problems, making this proposal significantly more attractive.

The Lessons of Lucasfilm’s Habitat

We were initially our own worst enemies in this undertaking, victims of a way of thinking to which we engineers are dangerously susceptible. This way of thinking is characterized by the conceit that all things may be planned in advance and then directly implemented according to the plan’s detailed specification. For persons schooled in the design and construction of systems based on simple, well-defined and well-understood foundation principles, this is a natural attitude to have.

http://www.fudco.com/chip/lessons.html

I wonder what Intel is up to?

 

http://cufp.galois.com/2007/slides/AnwarGhuloum.pdf

“A proposed long term solution: a new functional language that features implicit parallelism, dependent typing, and an effects type system”

“We are also working with an ISV on a compiler for a parallel functional language
- Strict, but not call by value for implicit parallelism (maybe
with lightweight annotations)
- Arrays and array comprehensions for data parallelism
- Effects system to contain impure features
- Atomic for state updates”

 

http://cufp.org/conference/sessions/2010/functional-language-compiler-experiences-intel

“For five years Intel’s Programming Systems Lab (PSL) has been collaborating with an external partner on a new functional programming language designed for productivity on many-core processors. While the language is not yet public, this talk outlines motivations behind the language and describes our experiences in implementing it using a variety of functional languages. The reference interpreter is written in Haskell and compiled with GHC while PSL’s performance implementation is written in SML and compiled with Mlton. We have also generated Scheme code compiled with PLT Scheme as part of a prototyping effort.”

A Small Example Illustrating Why Unification-Based Type Inference Is Not Always The Most User-Friendly Choice

let base = 10
let bitsPerDigit = logBase 2.0 (fromIntegral base)
let activeBits = fromIntegral $
  ceiling (bitsPerDigit * (Prelude.length "123"))

There's a missing fromIntegral above, which causes GHC to underline logBase and complain that it has no instance for (Floating Int). This is, on the face of things, completely mystifying. Both parameters are “obviously” floating! A second error points to ceiling, saying no instance for (RealFrac Int).

The second error is the more suggestive one, since the problem is that the type of * causes GHC to equate the type of bitsPerDigit with that of (Prelude.length clean), i.e. Int.  But it’s more than a little strange to force the programmer to reason about their code in the same order that the type checker does!

Comment on Comments

OK:

if (vars.empty()) {
  // Store null env pointer if environment is empty
  builder.CreateStore(
      llvm::ConstantPointerNull::getNullValue(
          clo_env_slot->getType()
             ->getContainedType(0)),
      clo_env_slot,
      /* isVolatile= */ false);
}

 

Better:

if (vars.empty()) {
  storeNullPointerToSlot(clo_env_slot);
}

Citation Chains

Here’s a fun game to play on Google Scholar:

  1. Search for a relatively old paper in your field of choice. It will, presumably, be the top result.
  2. Note the number N in the “Cited by N” link under the top result’s abstract. Also note the year of publication.
  3. Click the “Cited by N” link.
  4. GOTO 2.

For example, here’s what results from searching for Luca Cardelli’s paper A polymorphic lambda-calculus with Type : Type:

Paper Name Pub
Date
Cited
by <N>
Citations
per year
A polymorphic lambda-calculus with Type: Type 1986 72 4
The calculus of constructions 1986 1053 44
A framework for defining logics 1993 1155 68
Proof-carrying code 1997 1873 145
Xen and the art of virtualization 2003 2375 340
Globus toolkit version 4: Software for service-oriented systems 2006 1179 295
A taxonomy of data grids for distributed data sharing, management, and processing 2006 175 88
A toolkit for modelling and simulating data Grids: an extension to GridSim 2008 38 20
Service and utility oriented distributed computing systems […] 2008 11 4

 

It’s sort of nifty to see the “arch” of citations, and citations/year, over time.

Other interesting searches:

  • program slicing
  • featherweight

Not very interesting:

  • grain elevators

Coroutine Performance

I was curious this afternoon about how fast (or slow) coroutines can be. So I wrote a small microbenchmark in C, using libcoro’s  handwritten-assembly implementation of stack switching for symmetric coroutines.

The benchmark builds a complete binary tree of configurable size, then visits it twice. First, a conventional recursively-written function takes the tree root and a function pointer, and calls the function pointer to process each value stored in the tree.

The benchmark next re-visits the tree with the iteration encapsulated by a Lua-style asymmetric coroutine class, which yield()s tree node values back to the main routine.

One interesting operational consequence of doing the recursive visiting inside the coroutine is that the main program stack remains at constant depth for every processed node. This constant-stack-space property is also an effect of CPS transformations – not surprising, since both continuations and coroutines are general control abstractions.

Anyways, I found that when compiled with –O2 (gcc 4.4.3) and run on a Core 2, the overhead of using resume()/yield() compared to call/return was about 110 cycles per yield(). That cost is on par with loading a single word from RAM. Stack switching is an impressively efficient way of implementing a feature that lends significant flexibility and power to a language!

ANTLR Grammar Tip: LL(*) and Left Factoring

Suppose you wish to have ANTLR recognize non-degenerate tuples of expressions, like (x, y) or (f(g), a, b) but not (f) or (). Trailing commas are not allowed. The following formulation of such a rule will likely elicit a complaint (“rule tuple has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2; …”):

tuple : '(' (expr ',')+ expr ')' ;

ANTLR’s error message seems unhelpful at first glance, but it’s trying to point you in the right direction. What ANTLR is saying is that after it reaches a comma, it doesn’t know whether it should expect expr ')' or expr ',', and parsing expr could requiring looking at an arbitrary number of tokens. ANTLR recommends left-factoring, but the structure of the rule as written doesn’t make it clear what needs to be factored. Re-writing the rule to make it closer to ANTLR’s view makes it clearer what the problem is:

tuple_start : '(' tuple_continue;
tuple_continue
    :    ( expr ',' tuple_continue
    |      expr ')');

Now it’s more obvious where the left-factoring needs to be applied, and how to do it by writing expr once, leaving only the choice of seeing a comma or a close-paren:

tuple_start : '(' tuple_continue;
tuple_continue
    :    expr ( ',' tuple_continue
              | ')'  );

The rewritten expansion corresponds to an alternative structuring of the tuple rule that will satisfy ANTLR:

tuple : '(' expr (',' expr)+ ')';

The ANTLR wiki has a page on removing global backtracking from ANTLR grammars, which has a nice presentation of how to do left-factoring when the opportunity is obvious.

O(n^1.2)

I was reviewing Ras Bodik’s slides on parallel browsers and noticed in the section on parallel parsing that chart parsers like CYK and Earley are theoretically O(n^3) in time, but in practice it’s more like O(n^1.2).

I wondered: wait, what does O(n^1.2) mean for practical input sizes? Obviously for large enough n, any exponential will be worse than, say, O(n log n), but what’s the cutoff? So, I turned to the handy Online Function Grapher and saw this:

n^1.2 vs n lg n

That is, for n less than 5 million, n^1.2 < n lg n, and they remain close up to about n = 9 million. So if Bodik’s estimate holds, chart parsing could indeed be a viable parsing strategy for documents of up to several megabytes. Happily, this limit size matches the current and expected sizes of web pages to a T.

The blue line at the bottom shows linear growth. It’s all too easy to think of linearithmic growth as being “almost as good” as linear growth. To quote Jason Evans discussing how replacing his memory allocator’s log-time red-black tree operations with  constant-time replacements yielded appreciable speedups:

“In essence, my initial failure was to disregard the difference between a O(1) algorithm and a O(lg n) algorithm. Intuitively, I think of logarithmic-time algorithms as fast, but constant factors and large n can conspire to make logarthmic time not nearly good enough.”

Thirty Inches a Year

Next week is the one-year anniversary (if you will) of my purchase of a Dell 3007WFP-HC monitor. I snagged a refurbished model for about 60% of the going retail price.

While I’m overall happy with the purchase, and do not regret it, the pragmatics of such a large monitor have turned out to be not quite as rosy as I had initially expected.

The good:

  • With Chrome’s minimal UI, I can fit two vertically-maximized PDFs on the screen at once, or three if I squeeze them down to “merely” life-size. (The screen is about 25% taller and 3 times wider than a sheet of 8.5 x 11” paper). Making it easier to read research papers was one of my reasons for getting the screen, and this aspect has worked out nicely.
  • Eclipse and Visual Studio will eat as much screen space as you can give them. Having a 30” monitor means getting two pages of code side-by-side instead of just one. A portrait-mode 24” monitor at 1200 x 1920 can show one page with 100 vertical lines of code; 2560 x 1600 shows two pages with 80 vertical lines of code.
  • Two displays are easier to drive from one graphics card than three.  This means that a thirty inch monitor plus one is easier to set up than three smaller displays.

The bad:

  • Managing four million pixels turns out to be much more difficult than managing two million pixels. Because I have so much space, I tend to be less disciplined about keeping fewer windows open. I have about 150 Chrome tabs spread across 15 top level windows, plus six PuTTY sessions and six Explorer windows. I sometimes feel like I need a machete to hack my way through the jungle I created. There are apps that try to help keep windows aligned and orderly, but I haven’t managed to use them successfully.
  • It’s hard to find wallpaper for 30” displays!
  • The sweet spot for monitor prices on a pixels-per-dollar scale is (still) at 24” monitors. For less than the price of a heavily discounted refurbished 30” monitor, you can get two 24” monitors, and put one or both in portrait mode. For the price of a new 30” display, you can get a smaller monitor or two and a desktop or laptop to go along with it…

Would I recommend a 30” monitor to fellow programmers? I’m not sure. Two 24” monitors, or a 27” and a 24”, might make more sense. It depends on your personal opportunity cost for the extra premium over a nice 24” monitor. That extra money could be a few extra cores and double the RAM, or a dedicated box for running tests, or a nice suit and wool pea coat.

All in all, the upgrade from 20” to 24” made a much bigger difference than the upgrade from 24” to 30”. Make of that what you will.

Bootstrapping Compilers and T-diagrams

I came across a very nice notation in the book Basics of Compiler Design that greatly clarified the various choices for bootstrapping a compiler. The notation was originally created by Harvey Bratman in 1961 (!) , but every other reference I’ve found on the subject is just terrible for conveying understanding. Mogensen’s presentation, on the other hand, is refreshingly clear.

The rules for T-diagrams are very simple. A compiler written in some language “C” (could be anything from machine code on up) that translates programs in language A to language B looks like this (these diagrams are from Torben Mogensen’s freely-available book on compilers):

Now suppose you have a machine that can directly run HP machine code, and a compiler from ML to HP machine code, and you want to get a ML compiler running on different machine code P. You can start by writing an ML-to-P compiler in ML, and compile that to get an ML-to-P compiler in HP:

From there, feed the new ML-to-P compiler to itself, running on the HP machine, and you end up with an ML-to-P compiler that runs on a P machine!

The T-diagram notation can also easily be extended with interpreters, which are simply vertical boxes that can be inserted between any two languages. For example, here’s a diagram depicting a machine D running an interpreter for language C that is, in turn, running a compiler in language C to translate a program from language A to language B:

After thinking about bootstrapping in an ad-hoc way, I just love the structure that these diagrams provide! Unfortunately, while the book itself is freely available online, there’s no HTML version of the book, only a PDF version. The easiest way to see the full presentation is probably to use Scribd and search for “t-diagram”.

A Snippet of Functional C

MurmurHash2 is a pretty nifty bit of code: excellent performance in both the statistical and chronological senses, and under the ultra-liberal MIT license.

The kernel of the hash function comprises these 5 lines of C code. k is the next four bytes to be hashed, m and r are experimentally-found constants, and h is the accumulated hash value:

word mrmr_step(word k,
        word h, word m, word r) {
/*1*/ k *= m;
/*2*/ k ^= k >> r;
/*3*/ k *= m;
/*4*/ h *= m;
/*5*/ h ^= k;
      return h;
}

Like much low-level systems code in C, this takes advantage of mutability: the value of k in line 3 is not the same as k in line 5. But what if you wanted to do this computation without mutability, or even assignment? Even in the land of C, the rules, they are a-changin’ [pdf].

At first blush, it seems strange to do first-order functional programming in a language like C, which doesn’t provide much support for it. But it’s precisely because there’s so little magic to C that it’s such a good vehicle for seeing how things work under the covers. The real point of this post is not how to write functional code, as to explore how to compile functional code.

Anyways, back to the code. Mutability and immutability are sometimes interchangeable; for example, a += 3; a *= 5; a -= 1; does the same thing as a = ((a + 3) * 5) - 1. The key is the linear dependency chain: each mutated a value is used in precisely one place.

Examining the MurmurHash2 kernel above, most of the dependencies are linear. But there’s one branching dependency, from line 1 to line 2. There are two C-like ways of dealing with this. First, we could "copy" line 1 into line 2:

k = (k*m) ^ ((k*m) >> r);

The second way would be to introduce a new variable:

int km = k * m; km ^= km >> r;

But option 1 breaks down if line 1 has side-effects, and line 2 introduces a new assignment, which we wanted to avoid!

The trick to see is that functions provide variable bindings. Thus, we can translate the C code to functional pseudocode like so:

/*1*/ bind km  to k * m in
/*2*/ bind kx  to km ^ (km >> r) in
/*3*/ bind kxm to kx * m in
/*4*/ bind hm  to h * m in
/*5*/ hm ^ kxm;

Now, in this static single assignment form, it’s super easy to see what can be safely inlined; they are bindings that are only used once, such as hm in line 4. So we might optimize this to:

/*1*/ bind km to k * m in
/*2*/ bind kx to km ^ (km >> r) in
/*5*/ (h * m) ^ (kx * m);

I’ve left line 2 alone for clarity, but it’s easy to see that it could be trivially substituted into line 5. Now, how to translate back to C? In a language with anonymous functions and closures, it’s easy. Just like Scheme translates let into lambda, we could translate bind varname to arg in expr to JavaScript like (function (varname) { return expr; })(arg). So let’s do that to start:

(function          line2(km) {
  return (function line5(kx) {
/*5*/  return (h*m) ^ (kx*m);
/*2*/ })(km ^ (km >> r));
/*1*/ })(k * m);

Note that the code now reads, in some sense, bottom-to-top. Notice also that we’ve created closures, because the line marked 5 references h, which is defined in a parent scope (not shown). So this code isn’t valid C, because C does not have first-class functions. Instead, all C functions must be declared at the top level. Luckily for us, translating the above code to C is easy. We move the function definitions to the top level, and add parameters for their free variables. (This process is called closure conversion, or sometimes lambda lifting.) We’re left with this purely functional C code:

word line1(word km, word r) {
  return km ^ (km >> r);
}
word line5(word h, word kx, word m) {
  return (h * m) ^ (kx * m);
}
word mrmr_step_f(word k, word h,
                 word m, word r) {
  return line5(h, line2(k*m, r), m);
}

This is the code that would be generated by a compiler, given input like the 3-line pseudocode from above. Now, the interesting bit is: how efficient is this version, compared to the original? Since this is just C, we can use gcc‘s -S flag to compile both versions to optimized assembly and compare:


_Z11mrmr_step_fiiii:
.LFB970:
        imull   %edx, %edi
        movl    %edi, %eax
        sarl    %cl, %eax
        xorl    %edi, %eax
        imull   %edx, %eax
        imull   %esi, %edx
        xorl    %edx, %eax
        ret

_Z9mrmr_stepiiii:
.LFB971:
        imull   %edx, %edi
        imull   %edx, %esi
        movl    %edi, %eax
        sarl    %cl, %eax
        xorl    %edi, %eax
        imull   %edx, %eax
        xorl    %esi, %eax
        ret

Pretty cool! The assembly from the purely functional version is essentially identical to the low-level imperative code. And a modern out-of-order CPU will ensure that the dynamic execution is, in fact, identical for both versions.

Update 2010/02/11

There are two ways of handling the “extra” arguments added closure conversion. The way above, where free variables become parameters, is easy to do manually, and clearer for small numbers of parameters. The alternative is to create a separate structure holding the free variables, and pass a pointer to the structure. GCC is strong enough to compile this example to the same assembly as well. Not bad!

struct mrmr_step2_env { word r; };
static inline word
step2(mrmr_step2_env* env, word km) {
  return km ^ (km >> env->r);
}

struct mrmr_step5_env { word h; word m; };
static inline word
step5(mrmr_step5_env* env, word kx) {
  return (env->h * env->m) ^ (kx * env->m);
}

word mrmr_step_f(word k, word h,
                 word m, word r) {
  mrmr_step2_env e2; e2.r = r;
  mrmr_step5_env e5; e5.h = h; e5.m = m;
  return step5(&e5, step2(&e2, k*m));
}

watch.py – run command when file changes

I decided that Ian Piumarta’s tiny watch utility looked nifty, and that it would be nice to have a cross-platform, no-compile-required version. Granted, watch.c compiles using default make rules, so “no compile” is a bit arbitrary.

Anyways, the result is watch.py.

Oddly enough, the two implementations have the exact same physical line counts!

C++ was a victim of its own success

I came across an interesting paragraph when reading Bjarne Stroustrup’s 15-year retrospective, Evolving a language in and for the real world: C++ 1991-2006. This notion hadn’t occurred to me before, but it rings quite plausible.

There was also a curious problem with performance: C++ was too efficient for any really significant gains to come easily from research. This led many researchers to migrate to languages with glaring inefficiencies for them to eliminate. Elimination of virtual function calls is an example: You can gain much better improvements for just about any object-oriented language than for C++. The reason is that C++ virtual function calls are very fast and that colloquial C++ already uses non-virtual functions for time-critical operations. Another example is garbage collection. Here the problem was that colloquial C++ programs don’t generate much garbage and that the basic operations are fast. That makes the fixed overhead of a garbage collector looks far less impressive when expressed as a percentage of run time than it does for a language with less efficient basic operations and more garbage. Again, the net effect was to leave C++ poorer in terms of research and tools.

– Bjarne Stroustrup

For a peek at an example of “glaring inefficiencies,” check out what two IBM researchers have found about memory use in Java, and prepare to throw up in your mouth a little.

Building protobuf with MinGW GCC 4.4.0

So, after tearing my hair out for an hour this morning, I finally managed to build protobuf using MinGW and GCC 4.4.0, using the msys install from mozilla-build 1.4.

The two major sticking points are

  1. mozilla-build includes autoconf 2.59, whereas protobuf requires autoconf 2.61 at minimum. Solution: download the msys autoconf 2.63 package, and install to /usr/local/bin/autoconf-2.63/, and temporarily export a prefixed $PATH for building protobuf.
  2. protobuf attempts to link against /c/mozilla-build/msys/mingw/lib/gcc/mingw32/4.4.0/libstdc++.dll.a, which fails because that file doesn’t exist. Solution: edit the file libstdc++.la in that directory, replacing libstdc++.dll.a with libstdc++.a in the definition for library_names