eschew it all

Thirty Inches a Year

March 4, 2010 · 1 Comment

Next week is the one-year anniversary (if you will) of my purchase of a Dell 3007WFP-HC monitor. I snagged a refurbished model for about 60% of the going retail price.

While I’m overall happy with the purchase, and do not regret it, the pragmatics of such a large monitor have turned out to be not quite as rosy as I had initially expected.

The good:

  • With Chrome’s minimal UI, I can fit two vertically-maximized PDFs on the screen at once, or three if I squeeze them down to “merely” life-size. (The screen is about 25% taller and 3 times wider than a sheet of 8.5 x 11” paper). Making it easier to read research papers was one of my reasons for getting the screen, and this aspect has worked out nicely.
  • Eclipse and Visual Studio will eat as much screen space as you can give them. Having a 30” monitor means getting two pages of code side-by-side instead of just one. A portrait-mode 24” monitor at 1200 x 1920 can show one page with 100 vertical lines of code; 2560 x 1600 shows two pages with 80 vertical lines of code.
  • Two displays are easier to drive from one graphics card than three.  This means that a thirty inch monitor plus one is easier to set up than three smaller displays.

The bad:

  • Managing four million pixels turns out to be much more difficult than managing two million pixels. Because I have so much space, I tend to be less disciplined about keeping fewer windows open. I have about 150 Chrome tabs spread across 15 top level windows, plus six PuTTY sessions and six Explorer windows. I sometimes feel like I need a machete to hack my way through the jungle I created. There are apps that try to help keep windows aligned and orderly, but I haven’t managed to use them successfully.
  • It’s hard to find wallpaper for 30” displays!
  • The sweet spot for monitor prices on a pixels-per-dollar scale is (still) at 24” monitors. For less than the price of a heavily discounted refurbished 30” monitor, you can get two 24” monitors, and put one or both in portrait mode. For the price of a new 30” display, you can get a smaller monitor or two and a desktop or laptop to go along with it…

Would I recommend a 30” monitor to fellow programmers? I’m not sure. Two 24” monitors, or a 27” and a 24”, might make more sense. It depends on your personal opportunity cost for the extra premium over a nice 24” monitor. That extra money could be a few extra cores and double the RAM, or a dedicated box for running tests, or a nice suit and wool pea coat.

All in all, the upgrade from 20” to 24” made a much bigger difference than the upgrade from 24” to 30”. Make of that what you will.

→ 1 CommentCategories: Uncategorized

Bootstrapping Compilers and T-diagrams

February 28, 2010 · Leave a Comment

I came across a very nice notation in the book Basics of Compiler Design that greatly clarified the various choices for bootstrapping a compiler. The notation was originally created by Harvey Bratman in 1961 (!) , but every other reference I’ve found on the subject is just terrible for conveying understanding. Mogensen’s presentation, on the other hand, is refreshingly clear.

The rules for T-diagrams are very simple. A compiler written in some language “C” (could be anything from machine code on up) that translates programs in language A to language B looks like this (these diagrams are from Torben Mogensen’s freely-available book on compilers):

Now suppose you have a machine that can directly run HP machine code, and a compiler from ML to HP machine code, and you want to get a ML compiler running on different machine code P. You can start by writing an ML-to-P compiler in ML, and compile that to get an ML-to-P compiler in HP:

From there, feed the new ML-to-P compiler to itself, running on the HP machine, and you end up with an ML-to-P compiler that runs on a P machine!

The T-diagram notation can also easily be extended with interpreters, which are simply vertical boxes that can be inserted between any two languages. For example, here’s a diagram depicting a machine D running an interpreter for language C that is, in turn, running a compiler in language C to translate a program from language A to language B:

After thinking about bootstrapping in an ad-hoc way, I just love the structure that these diagrams provide! Unfortunately, while the book itself is freely available online, there’s no HTML version of the book, only a PDF version. The easiest way to see the full presentation is probably to use Scribd and search for “t-diagram”.

→ Leave a CommentCategories: Uncategorized

A Snippet of Functional C

February 10, 2010 · Leave a Comment

MurmurHash2 is a pretty nifty bit of code: excellent performance in both the statistical and chronological senses, and under the ultra-liberal MIT license.

The kernel of the hash function comprises these 5 lines of C code. k is the next four bytes to be hashed, m and r are experimentally-found constants, and h is the accumulated hash value:

word mrmr_step(word k,
        word h, word m, word r) {
/*1*/ k *= m;
/*2*/ k ^= k >> r;
/*3*/ k *= m;
/*4*/ h *= m;
/*5*/ h ^= k;
      return h;
}

Like much low-level systems code in C, this takes advantage of mutability: the value of k in line 3 is not the same as k in line 5. But what if you wanted to do this computation without mutability, or even assignment? Even in the land of C, the rules, they are a-changin’ [pdf].

At first blush, it seems strange to do first-order functional programming in a language like C, which doesn’t provide much support for it. But it’s precisely because there’s so little magic to C that it’s such a good vehicle for seeing how things work under the covers. The real point of this post is not how to write functional code, as to explore how to compile functional code.

Anyways, back to the code. Mutability and immutability are sometimes interchangeable; for example, a += 3; a *= 5; a -= 1; does the same thing as a = ((a + 3) * 5) - 1. The key is the linear dependency chain: each mutated a value is used in precisely one place.

Examining the MurmurHash2 kernel above, most of the dependencies are linear. But there’s one branching dependency, from line 1 to line 2. There are two C-like ways of dealing with this. First, we could "copy" line 1 into line 2:

k = (k*m) ^ ((k*m) >> r);

The second way would be to introduce a new variable:

int km = k * m; km ^= km >> r;

But option 1 breaks down if line 1 has side-effects, and line 2 introduces a new assignment, which we wanted to avoid!

The trick to see is that functions provide variable bindings. Thus, we can translate the C code to functional pseudocode like so:

/*1*/ bind km  to k * m in
/*2*/ bind kx  to km ^ (km >> r) in
/*3*/ bind kxm to kx * m in
/*4*/ bind hm  to h * m in
/*5*/ hm ^ kxm;

Now, in this static single assignment form, it’s super easy to see what can be safely inlined; they are bindings that are only used once, such as hm in line 4. So we might optimize this to:

/*1*/ bind km to k * m in
/*2*/ bind kx to km ^ (km >> r) in
/*5*/ (h * m) ^ (kx * m);

I’ve left line 2 alone for clarity, but it’s easy to see that it could be trivially substituted into line 5. Now, how to translate back to C? In a language with anonymous functions and closures, it’s easy. Just like Scheme translates let into lambda, we could translate bind varname to arg in expr to JavaScript like (function (varname) { return expr; })(arg). So let’s do that to start:

(function          line2(km) {
  return (function line5(kx) {
/*5*/  return (h*m) ^ (kx*m);
/*2*/ })(km ^ (km >> r));
/*1*/ })(k * m);

Note that the code now reads, in some sense, bottom-to-top. Notice also that we’ve created closures, because the line marked 5 references h, which is defined in a parent scope (not shown). So this code isn’t valid C, because C does not have first-class functions. Instead, all C functions must be declared at the top level. Luckily for us, translating the above code to C is easy. We move the function definitions to the top level, and add parameters for their free variables. (This process is called closure conversion, or sometimes lambda lifting.) We’re left with this purely functional C code:

word line1(word km, word r) {
  return km ^ (km >> r);
}
word line5(word h, word kx, word m) {
  return (h * m) ^ (kx * m);
}
word mrmr_step_f(word k, word h,
                 word m, word r) {
  return line5(h, line2(k*m, r), m);
}

This is the code that would be generated by a compiler, given input like the 3-line pseudocode from above. Now, the interesting bit is: how efficient is this version, compared to the original? Since this is just C, we can use gcc’s -S flag to compile both versions to optimized assembly and compare:


_Z11mrmr_step_fiiii:
.LFB970:
        imull   %edx, %edi
        movl    %edi, %eax
        sarl    %cl, %eax
        xorl    %edi, %eax
        imull   %edx, %eax
        imull   %esi, %edx
        xorl    %edx, %eax
        ret

_Z9mrmr_stepiiii:
.LFB971:
        imull   %edx, %edi
        imull   %edx, %esi
        movl    %edi, %eax
        sarl    %cl, %eax
        xorl    %edi, %eax
        imull   %edx, %eax
        xorl    %esi, %eax
        ret

Pretty cool! The assembly from the purely functional version is essentially identical to the low-level imperative code. And a modern out-of-order CPU will ensure that the dynamic execution is, in fact, identical for both versions.

Update 2010/02/11

There are two ways of handling the “extra” arguments added closure conversion. The way above, where free variables become parameters, is easy to do manually, and clearer for small numbers of parameters. The alternative is to create a separate structure holding the free variables, and pass a pointer to the structure. GCC is strong enough to compile this example to the same assembly as well. Not bad!

struct mrmr_step2_env { word r; };
static inline word
step2(mrmr_step2_env* env, word km) {
  return km ^ (km >> env->r);
}

struct mrmr_step5_env { word h; word m; };
static inline word
step5(mrmr_step5_env* env, word kx) {
  return (env->h * env->m) ^ (kx * env->m);
}

word mrmr_step_f(word k, word h,
                 word m, word r) {
  mrmr_step2_env e2; e2.r = r;
  mrmr_step5_env e5; e5.h = h; e5.m = m;
  return step5(&e5, step2(&e2, k*m));
}

→ Leave a CommentCategories: Uncategorized

watch.py – run command when file changes

September 27, 2009 · Leave a Comment

I decided that Ian Piumarta’s tiny watch utility looked nifty, and that it would be nice to have a cross-platform, no-compile-required version. Granted, watch.c compiles using default make rules, so “no compile” is a bit arbitrary.

Anyways, the result is watch.py.

Oddly enough, the two implementations have the exact same physical line counts!

→ Leave a CommentCategories: Uncategorized

C++ was a victim of its own success

September 26, 2009 · 1 Comment

I came across an interesting paragraph when reading Bjarne Stroustrup’s 15-year retrospective, Evolving a language in and for the real world: C++ 1991-2006. This notion hadn’t occurred to me before, but it rings quite plausible.

There was also a curious problem with performance: C++ was too efficient for any really significant gains to come easily from research. This led many researchers to migrate to languages with glaring inefficiencies for them to eliminate. Elimination of virtual function calls is an example: You can gain much better improvements for just about any object-oriented language than for C++. The reason is that C++ virtual function calls are very fast and that colloquial C++ already uses non-virtual functions for time-critical operations. Another example is garbage collection. Here the problem was that colloquial C++ programs don’t generate much garbage and that the basic operations are fast. That makes the fixed overhead of a garbage collector looks far less impressive when expressed as a percentage of run time than it does for a language with less efficient basic operations and more garbage. Again, the net effect was to leave C++ poorer in terms of research and tools.

– Bjarne Stroustrup

For a peek at an example of “glaring inefficiencies,” check out what two IBM researchers have found about memory use in Java, and prepare to throw up in your mouth a little.

→ 1 CommentCategories: Uncategorized

Building protobuf with MinGW GCC 4.4.0

September 20, 2009 · 1 Comment

So, after tearing my hair out for an hour this morning, I finally managed to build protobuf using MinGW and GCC 4.4.0, using the msys install from mozilla-build 1.4.

The two major sticking points are

  1. mozilla-build includes autoconf 2.59, whereas protobuf requires autoconf 2.61 at minimum. Solution: download the msys autoconf 2.63 package, and install to /usr/local/bin/autoconf-2.63/, and temporarily export a prefixed $PATH for building protobuf.
  2. protobuf attempts to link against /c/mozilla-build/msys/mingw/lib/gcc/mingw32/4.4.0/libstdc++.dll.a, which fails because that file doesn’t exist. Solution: edit the file libstdc++.la in that directory, replacing libstdc++.dll.a with libstdc++.a in the definition for library_names

→ 1 CommentCategories: Uncategorized

Introducing Canvas Cacheviz

September 13, 2009 · Leave a Comment

Eric Dramstad and I have been working on a new canvas-based tool. This time, it’s a cache visualizer.

Here’s what a naive matrix transposition looks like:naive-transpose

On the left is the sequence of memory touches; black pixels are touched first, and white pixels are touched last. The left side shows that each row of the upper triangle is accessed in tandem with the corresponding column of the lower triangle. On the right are the cache hits and misses. The upper triangle has a cache-friendly sequential access pattern, while the lower triangle has a cache-unfriendly strided access pattern. 56% of the memory accesses result in a cache miss.

In contrast, here’s a cache-oblivious matrix transposition:cache-oblivious-transpose

The left side shows the divide-and-conquer approach. Small blocks are transposed rather than whole rows and columns, which helps with cache locality, as reflected on the right. In this instance, only 28% of the memory touches were cache misses, a significant improvement over the naive algorithm.

 

There are several improvements I’d eventually like to make. One major shortcoming is that the cache model is only approximately like the cache system found in your average x86 chip. We currently only model a single-level fully-associative cache. We should eventually refine that model to a two-level set-associative cache, of configurable level sizes. I’d also like to show the cache effects of various sorting algorithms, like one-pivot quicksort versus insertion sort versus two-pivot quicksort.

The source code for canvas-cacheviz is hosted on GitHub. So far my experience with git has been negative in comparison to Mercurial. Git may be more powerful, but it has been all too easy to do strange and unwanted things with the repository DAG. And even simple questions like “what’s the equivalent of svn revert?” were difficult to answer using the tool’s help listing.

→ Leave a CommentCategories: Uncategorized

Summer 2009: Time and Books

September 6, 2009 · 2 Comments

When I started working this summer, I discovered three things, with varying levels of surprise:

  1. I had significantly less free time than I had envisioned. Buying groceries and cooking takes a lot more time than I thought!
  2. I was significantly more productive than I had envisioned. Having less time necessitated making it more structured and more efficient.
  3. I ended up getting more done even though I had less time to do it.

I would cook, read books, go to the gym, and do a little side programming during the week, and on the weekend I’d visit my parents (and usually mow their lawn for them). At school, I had three or four times as much free time (at least!), but I didn’t get as much done in a concrete sense. I spent much more time browsing proggit and Google Reader. Not time wasted, per se, but on the whole I was not as productive as I was under greater restrictions.

Books I read this summer:

  • Confessions of an Economic Hit Man
  • Life in Double Time
  • All seven Harry Potter books, including the sixth consumed in one glorious day at the beach
  • Types and Programming Languages (casually, at least)
  • A Thousand Splendid Suns
  • One Hundred Years of Solitude
  • and I’ve started reading more:
    • The Gold Bug Variations
    • Advanced Types and Programming Languages
    • Masterminds of Programming
    • Design Concepts in Programming Languages
    • Refactoring

Even conservatively counting the first three books of Harry Potter as one, that’s still ten books finished and five more started in the last three months. Much better than the, um, one or two that I probably read during the last semester of my undergrad program.

→ 2 CommentsCategories: Uncategorized

Canvas Spline Toys

September 2, 2009 · Leave a Comment

I am pretty busy this week. For the past few days, I’ve been doing 8 or more hours of research, as well as packing and organizing my physical and virtual possessions. I hope to have several more posts over the next few days detailing some of the projects, medium and small, that I have worked on in the last few months.

The latest project is a simple interactive spline visualizer. I had a random itch at work to have a way of seeing splines constructed one-piece-at-a-time, so I created an initial mockup using the HTML canvas object. Eric Dramstad extended it with mouse support for rearranging control points, as well as cleaner code.

The blue line is a conventional second-order linear interpolation. From an OpenGL standpoint, it has the nice feature that the interpolated line lies entirely within the triangle defined by the three control points. The corresponding disadvantage is that the line does not pass through the second control point. The green line does the reverse. Instead of interpolating between L1(t) and L2(t), it interpolates between L1(t) and L2(-1 + 2*t). The green spline passes through all three control points, but is thus not nicely contained like the blue spline.

An interested programmer could extend this without too much difficulty. It would be particularly cool to be able to (A) add additional control points, and (B) interactively define new interpolation routines. Even without such features, Eric Dramstad’s interactive version makes for a surprisingly entertaining minute or two!

→ Leave a CommentCategories: Uncategorized

Sound and Complete

August 31, 2009 · Leave a Comment

A jury reaches a verdict: guilty! But was she, in fact, innocent after all?

Throughout life, we must make decisions: evaluate, then accept or reject. Life being what it is, there are more ways to screw up than not. The field of formal logic gives us two properties we can use to describe any evaluation procedure, such as the jury above: soundness and completeness. A sound jury will never send a guilty person free, while a complete jury will never send an innocent person to jail. (As an aside, I think the concrete example of a jury and its verdicts is much easier to grok than Wikipedia’s treatment, which says “Informally, a soundness theorem for a deductive system expresses that all provable sentences are true. Completeness states that all true sentences are provable.”) Intuitively, soundness and completeness are orthogonal: you can have any combination of (non-) soundness and (non-) completeness.

The ideal, of course, is sound and complete: the perfect justice system that will never exist. What we have is neither sound nor complete: innocent people are sent to jail, and guilty people go free.

Anyways, soundness and completeness are also pervasively used in static analysis and programming language type theory, each of which are close siblings to formal logic. Well, type theory is more like a mirrored reflection of formal logic, but that’s neither here nor there. A sound type system (or analysis) is one that rejects all erroneous programs, and a complete system accepts all correct programs.

Now, the interesting question is: which is more important? Is it better to have a complete but unsound static analysis, which will report every error that exists, and then maybe some errors that don’t exist? Or better to have the opposite, sound and incomplete, where every error that it reports is an actual error, but it might not find all the issues in the program? I happen to think sound and incomplete is better: first, because sound analyses compose, and second, because many of the properties you really want to have verified (such as “does this rendering look right?”) don’t fall in the scope of the verification system. Complete and unsound certainly has its appeal: after working through all the reported errors (and dealing with the wild goose chases), you’ll know that your program is free of all the bugs that the analysis is capable of finding. Of course, completeness is unachievable for many of the most interesting and critical problems, whereas sound-but-incomplete is more within reach.

What about for type systems? A sound type system is one that is not lenient, in that it rejects all invalid programs plus some number of valid programs. A complete type system accepts all valid programs, and some invalid ones to boot. Which to pick?

The situation isn’t quite as obvious as with a static analysis, for two reasons. The first difference is that a programming language with a complete and unsound (static) type system has a fallback in the form of dynamic typing. In fact, one of the most powerful type system features, dependent typing, is not statically verifiable unless carefully restricted. Dependent type systems therefore blur the distinction between compile-time type checking and runtime type checking, and thus between static and dynamic typing. The second complication is that the arithmetic mean programmer tends to value expressiveness rather more than compile-time verification. Programmers dislike having the computer reject a program that would have run fine, simply because the computer couldn’t make sure it would run fine without actually running it. In short, restrictive type systems drive programmers to more flexible environments. Consider how Mozilla Firefox is designed: a core of statically-typed C++ to do heavy lifting, with dynamically-typed JavaScript running the user interface. Not coincidentally, this static core/dynamic shell is shared by many large video games (like World of WarCraft: C++ and Lua).

The target domain matters, of course. One hopes that the language for a program controlling an airliner’s fuel supply preferred soundness over completeness. And type inference raises another issue: there exist type systems that are sound and complete, but are not amenable to type inference.

Decisions, decisions…

p.s. Since I’m a visual learner, I created a reference chart:

Chart of soundness and completeness properties

Chart of soundness and completeness properties

→ Leave a CommentCategories: Uncategorized

Overview of Mozilla’s C++ Static Analysis Tools

May 16, 2009 · Leave a Comment

Manually making pervasive changes in a four million line codebase is all but impossible. Thus, in a bid to make far-reaching changes to the internals of their code, Mozilla has been actively pursuing better static analysis tools.

It turns out that analyzing C++ is remarkably difficult; the stuff of patents and at least one doctoral dissertation. The root cause is the painfulness of parsing C++. First, C++ is simply a large language, syntactically speaking, and due to vendor extensions, there is no single point of reference for the syntax found in the real world. Second, C++ is difficult to parse with automated tools. The C++ grammar, such that it is, does not fit into any of the usual formal language classes like LL(k) or LR(k). This parsing difficulty is partly due to its C heritage. To give one simple example of a problematic construction: return (a)(b); could involve a cast of b to type a, or a function call a(b). Which one applies depends on the type of a, but in a classical compiler, type information is only available after parsing completes! Finally, preprocessor macros create a disconnect between what the user sees in a source file and what the compiler sees in a translation unit. Any analysis tools must either bridge that divide, or risk being rejected for not properly handling macros.

Mozilla has several static analysis projects it runs, which build upon several more in turn. The tools are split into two related but separate categories: static analysis and rewriting. The following is my understanding of the current tool space:

Dehydra is a lightweight static analysis tool for C++. It is implemented as a GCC plugin that passes C++ AST information to JavaScript code, which can examine the provided AST nodes. Among other things, Dehydra is used as the smarts behind a very cool semantic Mozilla code browser called DXR (Dehydra Cross Reference).

Treehydra is Dehydra’s heavy duty companion. It deals with GCC’s internal GIMPLE intermediate representation. This allows Treehydra analyses to be run after any optimization pass in GCC. Treehydra scripts also get access to GIMPLE control flow graphs (CFGs). Access to CFGs enables more advanced path-based analyses. In effect, anything GCC “knows” about a piece of C++ is available to JavaScript code through Treehydra.

Right now, Dehydra and Treehydra require a custom, patched version of GCC. However, in the future, starting with GCC 4.5, plugin support (thanks largely to Mozilla’s Taras Glek!) will obviate the need for a patched GCC. That’s very exciting, especially because it will mean that static analysis of C++ code will become available to orders of magnitude more programmers!

Unfortunately, the *hydras are not sufficient for the task of automated rewriting, because GCC does not preserve all the necessary syntactic information about token positions and macro expansion and such. Thus, more tools are needed.

The primary C++ rewriting tool from Mozilla is called Pork. Pork is a tool for rewriting C++. Unlike GCC, it retains exact syntactic information about the code it parses, as well as the changes made while preprocessing. Pork is built on several other tools: Oink, Elsa, and MCPP. MCPP is a preprocessor that includes annotations in preprocessed source to help reconstruct the original source file. Elsa is a C++ parser, based on the GLR parser generator Elkhound. Finally, Oink is a tool for analyzing the parse trees constructed by Elsa.

Whew! I hope that information is accurate, +/- 10%.

→ Leave a CommentCategories: Uncategorized

Programs I Currently Have In My Tray

May 6, 2009 · Leave a Comment

  • Skype
  • Last.fm
  • iTunes
  • Pageant
  • Twhirl
  • Yahoo! Widgets
  • NetMeter
  • ManicTime
  • ZoneAlarm
  • Logitech Mouse
  • AutoHotkey
  • Mozy Backup
  • Dropbox
  • UltraMon
  • WinSplit Revolution
  • Threeter
  • KeePass
  • Ad-Aware

That’s eighteen(!). If I had to trim the list down to seven, I’d probably choose these as the most useful in day-to-day tasks:

  • iTunes
  • Pageant
  • ManicTime
  • AutoHotkey
  • Mozy Backup
  • UltraMon
  • KeePass

→ Leave a CommentCategories: Uncategorized

Eight Reasons why VirtualBox is Awesome

April 27, 2009 · Leave a Comment

  • Free and Open Source software
  • Cross Platform squared: Run Windows/Linux/Solaris guests on OS X/Windows/Linux/Solaris
  • OpenGL acceleration
  • Support for disk images created with VMWare and Virtual PC
  • Software and hardware virtualization acceleration support
  • Seamless windows for Linux, Solaris, and Windows guests
  • USB 2.0 support
  • Shared folders

→ Leave a CommentCategories: Uncategorized

Rube Goldberg Speech Macros

April 26, 2009 · Leave a Comment

It’s been about a year since Microsoft released Speech Macros for Vista.

One thing I found very interesting was how similar their system seems to what I ended up creating for myself two and a half years ago: XML syntax for defining grammars, and AutoHotkey-like syntax for executable commands! Perhaps there simply aren’t that many ways to do flexible speech macros.

Anyways, the backstory: Sophomore year of college, I had more reading than usual to do for classes. I like reading and listening to music, but my bed was across the room from my desk, and I was growing tired of getting up every fifteen minutes to skip songs that weren’t conducive to reading. “I know!,” I thought. “I’ve got an omnidirectional microphone with an extra-long cord. If I string it under the rug, I can reach my bed, and control iTunes with my voice!” Great vision, but how would I go about doing such a thing?

The first step was scripting iTunes. It turns out that Apple publishes a Windows iTunes COM SDK that allow JScript (Microsoft’s version of JavaScript) to manipulate iTunes programmatically. Awesome! In fact, I had already been using this to automate an action I found myself repeating several times a day. When I heard a song I really didn’t like, I would add delete to the comments field in iTunes, and skip to the next track. Once a week or so, I would delete all the songs with delete in the comments field. It’s a simple hack that worked remarkably well. Anyways, I configured IntelliType to execute the following simple JScript when I pressed the “media” button on the keyboard. This greatly reduced the friction involved with marking songs for deletion.

var iTunesApp = WScript.CreateObject("iTunes.Application");
var currentTrack = iTunesApp.CurrentTrack;
if (currentTrack) {
	iTunesApp.NextTrack();
	currentTrack.Comment = currentTrack.Comment + ' delete';
} // else nothing to do...

The other half of the problem was to invoke that script. Specifically, how could I invoke it with a spoken command? I considered writing a C# program, but I wanted something a little more lightweight. Once again, JScript came to the rescue! This time, it was in the form of Microsoft’s free Speech API SDK 5.1. This SDK provided a very simple speech recognition engine, as well as documentation on how to use the engine from JScript. It was, admittedly, some of the most unfriendly documentation I’d ever seen, but I managed to get up and running after a day or two. If it hadn’t been for the sample code they provided, it probably would have taken closer to a week or two. The SAPI interfaces look like this; by now, I’ve forgotten exactly what each piece does:

var RContext = new ActiveXObject("Sapi.SpSharedRecoContext");
RContext.EventInterests = interests;
WScript.ConnectObject(RContext, "RContext_"); // Hook up event listeners on a by-name basis...
var Grammar = RContext.CreateGrammar();
var Recognizer = RContext.Recognizer;
Recognizer.State = SRSAlwaysActive;

In any case, I soon had a simple JScript-driven speech-to-text echoer. It would listen for English sentences, and print to the screen whatever the engine thought it heard. JScript’s native support for regular expression made it dead easy to search for key words and phrases in the output before it was printed. This would have been enough for my purposes, except for one major problem. Much like people seeing faces in toast, because the speech engine was expecting spoken English, it heard spoken English — even when all that the microphone was picking up was random noise from the room. Since I was using this while listening to music, song lyrics proved to be very confusing to the system as well. I ended up with a bunch of “recognized” phrases like these, or even stranger:

amnesty
the DN
and newington sea and then moves
a ton
of
top
and with Ababa and being

Luckily, there’s a common, easy solution to this problem in the speech recognition world: do less. Specifically, have the engine listen for a limited set of phrases, rather than arbitrary speech. This allows the engine to be much more discriminating when deciding whether or not what it hears in the microphone is one of the few spoken phrases you asked it to listen for.

The Speech API in Windows XP (Vista had yet to appear in retail stores, and even then, I didn’t try Vista for nearly a year) includes a way of creating textual grammars for the speech engine. The main syntax looks like this, with optional phrases <O>, list-of-alternatives <L> and required phrases <P>:

<RULE NAME="bloglines" TOPLEVEL="ACTIVE">
 	<O>please</O>
	<L>
  		<P>go to</P>
		<p>open</p>
	</L>
	<P>bloglines</P>
	<O>please</O>
</RULE>

You can speak any one of eight phrases to trigger the above rule: open bloglines, open bloglines please, please open bloglines, and please open bloglines please, plus the four phrases obtained by substituting go to for open. As the grammar writer, you can also require words that must be enunciated extra-clearly, embed “free form” sections in rules (for things like asking for addresses or names), and have rules reference other rules. It is a pretty powerful system, once you wrap your head around it.

So, the speech engine can now look for a small selection of phrases like iTunes pause and skip track please. The next question is, how does the system determine what action to carry out when a rule is recognized?

The original scheme, of searching the recognized text for keywords, could work. However, it’s possible to define rules that have many different possible phrases with no keywords in common. Plus, text-based searching would be a pain to keep up-to-date as the grammar evolves. Any changes would be have to be made twice: once in the grammar, and once in the search code. That’s a gross violation of the Don’t Repeat Yourself principle.

Luckily, speech grammar rules can be given a name that is passed to the JScript code when a rule is recognized. The snippet above defines a rule named bloglines. No matter which of the eight phrases you speak, the same rule name will be passed in, so the code can simply look for the rule name, instead of the spoken text.

Then the question becomes: what code should be executed for a given rule? This is, I think, the cleverest part of the system: rule names map to filenames. I used only AuotHotkey programs, but the system would execute the first file in the macros directory that started with the rule name. Specifically, my code ran dir rulename* /b to find a filename starting with the given rule name. For example, the rule bloglines would cause a script called bloglines.ahk, if it existed, to be executed. This kept the core of the JScript program very small:

function RContext_Recognition(num, pos, type, result) {
    var text = result.PhraseInfo ? result.PhraseInfo.GetText() : "not recognized";
    var name = result.PhraseInfo.Rule.Name;
    log(text + "::" + name);
    var toRun = getFileName(name);
    if( toRun != "" ) Shell.Run(toRun);
    if (RegExp("exit|good ?bye").test(text)) WScript.Quit();
}

By using AutoHotkey, I could essentially have the executed rules do anything: play sounds, open programs, move the mouse, type text, run other programs, whatever.

But back to the task that was at hand, or, perhaps, at lips: remote voice control of iTunes. At this point it was a simple matter to hook up the pieces I’d put together. I set up a rule to recognize “eye tunes delete this track” and hooked it up to a one-line AHK script that would emit the same keycode generated when I pressed the media button. And behold, it worked!

This is the part that makes me call it a Rube Goldberg system, though. Let’s trace the chain of events that would happen when I, reclining in bed with a book, said the magic words:

  1. The “zeroth” step, which happened even before I started speaking, was that my JScript would run and give the Speech API a grammar file to parse
  2. As I started speaking, the speech recognizer engine would examine the microphone-in streaming audio data. Once I’d said enough, it determined that I’d spoken a phrase that triggered a rule my JScript was interested in.
  3. The speech engine invoked a callback function from my JScript code, passing it information about the invoked rule, including rule name and recognized text.
  4. My callback logged the rule name and recognized phrase text to a dated log file.
  5. My callback searched the current working directory for a something to execute: any file that started with the rule name.
  6. Having found a file, my JScript callback passed it to the shell to execute. Since AHK files are text, not binary, this worked because AHK integrates with Explorer to interpret .ahk files with the real AutoHotkey.exe when double-clicked.
  7. The AHK file emitted a virtual scan key ({vkFFsc16D}) corresponding to the media key on my keyboard.
  8. The IntelliType software would recognize that the media key had been pressed, and dutifully invoked the iTunes-controlling JScript I had originally attached to that key.
  9. This second JScript file used iTunes’ COM automation interface to programmatically mark the playing track with delete and advance to the next track.

Complicated, eh? Using IntelliType instead of directly executing the iTunes script was just an extra dash of unnecessary intricacy to an already cobbled-together system.

And how did it perform? Well, when it worked, it worked great. I could implement new ideas in a minute or two, which was very positive reinforcement. But, sadly, even with a very restricted grammar, there were still too many false positives caused by ambient noise and song lyrics. After spending a week writing the framework, I only used it for a few days before becoming frustrated and shelving the whole project. Even without music playing, it sometimes took several attempts to have the system recognize what I was saying. And don’t get me wrong — with an omnidirectional microphone several feet from my mouth, I was basically setting up the system to fail. Still, I was somewhat disappointed that speech recognition didn’t do a better job of discriminating between my voice and music. No that it was surprising, since I was using the free and not-cutting-edge recognizer that came with the Speech SDK. Vista has a more advanced recognizer, but in later experiments, I found that even it wasn’t up to the task of dealing with a noisy room.

Oh well. The system was a lot of fun to write and see working. I’ve always been fascinated by systems with many moving parts. Having those “moving parts” be completely virtual doesn’t dampen the excitement one bit!

→ Leave a CommentCategories: Uncategorized
Tagged: ,

Google Summer of Code 2009 projects

April 21, 2009 · Leave a Comment

Boost:
Proposal for Generic Trie, Radix Tree, and Suffix Array Data Structures

Some nice Direct3D proposals for Wine.

I think Python continues to be one of the most popular projects, with 30 accepted proposals. Several proposals filed under other projects are in fact Python-related as well.

Mozilla gets some nice proposals as well: TraceMonkey register allocation, web-rsync, ECMAScript 3.1 in Rhino, and dupe detection in Bugzilla.

Mercurial gets inotify for OS X and a client for git repositories.

→ Leave a CommentCategories: Uncategorized