Sound and Complete

A jury reaches a verdict: guilty! But was she, in fact, innocent after all?

Throughout life, we must make decisions: evaluate, then accept or reject. Life being what it is, there are more ways to screw up than not. The field of formal logic gives us two properties we can use to describe any evaluation procedure, such as the jury above: soundness and completeness. A sound jury will never send a guilty person free, while a complete jury will never send an innocent person to jail. (As an aside, I think the concrete example of a jury and its verdicts is much easier to grok than Wikipedia’s treatment, which says “Informally, a soundness theorem for a deductive system expresses that all provable sentences are true. Completeness states that all true sentences are provable.”) Intuitively, soundness and completeness are orthogonal: you can have any combination of (non-) soundness and (non-) completeness.

The ideal, of course, is sound and complete: the perfect justice system that will never exist. What we have is neither sound nor complete: innocent people are sent to jail, and guilty people go free.

Anyways, soundness and completeness are also pervasively used in static analysis and programming language type theory, each of which are close siblings to formal logic. Well, type theory is more like a mirrored reflection of formal logic, but that’s neither here nor there. A sound type system (or analysis) is one that rejects all erroneous programs, and a complete system accepts all correct programs.

Now, the interesting question is: which is more important? Is it better to have a complete but unsound static analysis, which will report every error that exists, and then maybe some errors that don’t exist? Or better to have the opposite, sound and incomplete, where every error that it reports is an actual error, but it might not find all the issues in the program? I happen to think sound and incomplete is better: first, because sound analyses compose, and second, because many of the properties you really want to have verified (such as “does this rendering look right?”) don’t fall in the scope of the verification system. Complete and unsound certainly has its appeal: after working through all the reported errors (and dealing with the wild goose chases), you’ll know that your program is free of all the bugs that the analysis is capable of finding. Of course, completeness is unachievable for many of the most interesting and critical problems, whereas sound-but-incomplete is more within reach.

What about for type systems? A sound type system is one that is not lenient, in that it rejects all invalid programs plus some number of valid programs. A complete type system accepts all valid programs, and some invalid ones to boot. Which to pick?

The situation isn’t quite as obvious as with a static analysis, for two reasons. The first difference is that a programming language with a complete and unsound (static) type system has a fallback in the form of dynamic typing. In fact, one of the most powerful type system features, dependent typing, is not statically verifiable unless carefully restricted. Dependent type systems therefore blur the distinction between compile-time type checking and runtime type checking, and thus between static and dynamic typing. The second complication is that the arithmetic mean programmer tends to value expressiveness rather more than compile-time verification. Programmers dislike having the computer reject a program that would have run fine, simply because the computer couldn’t make sure it would run fine without actually running it. In short, restrictive type systems drive programmers to more flexible environments. Consider how Mozilla Firefox is designed: a core of statically-typed C++ to do heavy lifting, with dynamically-typed JavaScript running the user interface. Not coincidentally, this static core/dynamic shell is shared by many large video games (like World of WarCraft: C++ and Lua).

The target domain matters, of course. One hopes that the language for a program controlling an airliner’s fuel supply preferred soundness over completeness. And type inference raises another issue: there exist type systems that are sound and complete, but are not amenable to type inference.

Decisions, decisions…

p.s. Since I’m a visual learner, I created a reference chart:

Chart of soundness and completeness properties

Chart of soundness and completeness properties


Overview of Mozilla’s C++ Static Analysis Tools

Manually making pervasive changes in a four million line codebase is all but impossible. Thus, in a bid to make far-reaching changes to the internals of their code, Mozilla has been actively pursuing better static analysis tools.

It turns out that analyzing C++ is remarkably difficult; the stuff of patents and at least one doctoral dissertation. The root cause is the painfulness of parsing C++. First, C++ is simply a large language, syntactically speaking, and due to vendor extensions, there is no single point of reference for the syntax found in the real world. Second, C++ is difficult to parse with automated tools. The C++ grammar, such that it is, does not fit into any of the usual formal language classes like LL(k) or LR(k). This parsing difficulty is partly due to its C heritage. To give one simple example of a problematic construction: return (a)(b); could involve a cast of b to type a, or a function call a(b). Which one applies depends on the type of a, but in a classical compiler, type information is only available after parsing completes! Finally, preprocessor macros create a disconnect between what the user sees in a source file and what the compiler sees in a translation unit. Any analysis tools must either bridge that divide, or risk being rejected for not properly handling macros.

Mozilla has several static analysis projects it runs, which build upon several more in turn. The tools are split into two related but separate categories: static analysis and rewriting. The following is my understanding of the current tool space:

Dehydra is a lightweight static analysis tool for C++. It is implemented as a GCC plugin that passes C++ AST information to JavaScript code, which can examine the provided AST nodes. Among other things, Dehydra is used as the smarts behind a very cool semantic Mozilla code browser called DXR (Dehydra Cross Reference).

Treehydra is Dehydra’s heavy duty companion. It deals with GCC’s internal GIMPLE intermediate representation. This allows Treehydra analyses to be run after any optimization pass in GCC. Treehydra scripts also get access to GIMPLE control flow graphs (CFGs). Access to CFGs enables more advanced path-based analyses. In effect, anything GCC “knows” about a piece of C++ is available to JavaScript code through Treehydra.

Right now, Dehydra and Treehydra require a custom, patched version of GCC. However, in the future, starting with GCC 4.5, plugin support (thanks largely to Mozilla’s Taras Glek!) will obviate the need for a patched GCC. That’s very exciting, especially because it will mean that static analysis of C++ code will become available to orders of magnitude more programmers!

Unfortunately, the *hydras are not sufficient for the task of automated rewriting, because GCC does not preserve all the necessary syntactic information about token positions and macro expansion and such. Thus, more tools are needed.

The primary C++ rewriting tool from Mozilla is called Pork. Pork is a tool for rewriting C++. Unlike GCC, it retains exact syntactic information about the code it parses, as well as the changes made while preprocessing. Pork is built on several other tools: Oink, Elsa, and MCPP. MCPP is a preprocessor that includes annotations in preprocessed source to help reconstruct the original source file. Elsa is a C++ parser, based on the GLR parser generator Elkhound. Finally, Oink is a tool for analyzing the parse trees constructed by Elsa.

Whew! I hope that information is accurate, +/- 10%.

Programs I Currently Have In My Tray

  • Skype
  • iTunes
  • Pageant
  • Twhirl
  • Yahoo! Widgets
  • NetMeter
  • ManicTime
  • ZoneAlarm
  • Logitech Mouse
  • AutoHotkey
  • Mozy Backup
  • Dropbox
  • UltraMon
  • WinSplit Revolution
  • Threeter
  • KeePass
  • Ad-Aware

That’s eighteen(!). If I had to trim the list down to seven, I’d probably choose these as the most useful in day-to-day tasks:

  • iTunes
  • Pageant
  • ManicTime
  • AutoHotkey
  • Mozy Backup
  • UltraMon
  • KeePass

Eight Reasons why VirtualBox is Awesome

  • Free and Open Source software
  • Cross Platform squared: Run Windows/Linux/Solaris guests on OS X/Windows/Linux/Solaris
  • OpenGL acceleration
  • Support for disk images created with VMWare and Virtual PC
  • Software and hardware virtualization acceleration support
  • Seamless windows for Linux, Solaris, and Windows guests
  • USB 2.0 support
  • Shared folders

Rube Goldberg Speech Macros

It’s been about a year since Microsoft released Speech Macros for Vista.

One thing I found very interesting was how similar their system seems to what I ended up creating for myself two and a half years ago: XML syntax for defining grammars, and AutoHotkey-like syntax for executable commands! Perhaps there simply aren’t that many ways to do flexible speech macros.

Anyways, the backstory: Sophomore year of college, I had more reading than usual to do for classes. I like reading and listening to music, but my bed was across the room from my desk, and I was growing tired of getting up every fifteen minutes to skip songs that weren’t conducive to reading. “I know!,” I thought. “I’ve got an omnidirectional microphone with an extra-long cord. If I string it under the rug, I can reach my bed, and control iTunes with my voice!” Great vision, but how would I go about doing such a thing?

The first step was scripting iTunes. It turns out that Apple publishes a Windows iTunes COM SDK that allow JScript (Microsoft’s version of JavaScript) to manipulate iTunes programmatically. Awesome! In fact, I had already been using this to automate an action I found myself repeating several times a day. When I heard a song I really didn’t like, I would add delete to the comments field in iTunes, and skip to the next track. Once a week or so, I would delete all the songs with delete in the comments field. It’s a simple hack that worked remarkably well. Anyways, I configured IntelliType to execute the following simple JScript when I pressed the “media” button on the keyboard. This greatly reduced the friction involved with marking songs for deletion.

var iTunesApp = WScript.CreateObject("iTunes.Application");
var currentTrack = iTunesApp.CurrentTrack;
if (currentTrack) {
	currentTrack.Comment = currentTrack.Comment + ' delete';
} // else nothing to do...

The other half of the problem was to invoke that script. Specifically, how could I invoke it with a spoken command? I considered writing a C# program, but I wanted something a little more lightweight. Once again, JScript came to the rescue! This time, it was in the form of Microsoft’s free Speech API SDK 5.1. This SDK provided a very simple speech recognition engine, as well as documentation on how to use the engine from JScript. It was, admittedly, some of the most unfriendly documentation I’d ever seen, but I managed to get up and running after a day or two. If it hadn’t been for the sample code they provided, it probably would have taken closer to a week or two. The SAPI interfaces look like this; by now, I’ve forgotten exactly what each piece does:
var RContext = new ActiveXObject("Sapi.SpSharedRecoContext");
RContext.EventInterests = interests;
WScript.ConnectObject(RContext, "RContext_"); // Hook up event listeners on a by-name basis...
var Grammar = RContext.CreateGrammar();
var Recognizer = RContext.Recognizer;
Recognizer.State = SRSAlwaysActive;

In any case, I soon had a simple JScript-driven speech-to-text echoer. It would listen for English sentences, and print to the screen whatever the engine thought it heard. JScript’s native support for regular expression made it dead easy to search for key words and phrases in the output before it was printed. This would have been enough for my purposes, except for one major problem. Much like people seeing faces in toast, because the speech engine was expecting spoken English, it heard spoken English — even when all that the microphone was picking up was random noise from the room. Since I was using this while listening to music, song lyrics proved to be very confusing to the system as well. I ended up with a bunch of “recognized” phrases like these, or even stranger:

the DN
and newington sea and then moves
a ton
and with Ababa and being

Luckily, there’s a common, easy solution to this problem in the speech recognition world: do less. Specifically, have the engine listen for a limited set of phrases, rather than arbitrary speech. This allows the engine to be much more discriminating when deciding whether or not what it hears in the microphone is one of the few spoken phrases you asked it to listen for.

The Speech API in Windows XP (Vista had yet to appear in retail stores, and even then, I didn’t try Vista for nearly a year) includes a way of creating textual grammars for the speech engine. The main syntax looks like this, with optional phrases <O>, list-of-alternatives <L> and required phrases <P>:

  		<P>go to</P>

You can speak any one of eight phrases to trigger the above rule: open bloglines, open bloglines please, please open bloglines, and please open bloglines please, plus the four phrases obtained by substituting go to for open. As the grammar writer, you can also require words that must be enunciated extra-clearly, embed “free form” sections in rules (for things like asking for addresses or names), and have rules reference other rules. It is a pretty powerful system, once you wrap your head around it.

So, the speech engine can now look for a small selection of phrases like iTunes pause and skip track please. The next question is, how does the system determine what action to carry out when a rule is recognized?

The original scheme, of searching the recognized text for keywords, could work. However, it’s possible to define rules that have many different possible phrases with no keywords in common. Plus, text-based searching would be a pain to keep up-to-date as the grammar evolves. Any changes would be have to be made twice: once in the grammar, and once in the search code. That’s a gross violation of the Don’t Repeat Yourself principle.

Luckily, speech grammar rules can be given a name that is passed to the JScript code when a rule is recognized. The snippet above defines a rule named bloglines. No matter which of the eight phrases you speak, the same rule name will be passed in, so the code can simply look for the rule name, instead of the spoken text.

Then the question becomes: what code should be executed for a given rule? This is, I think, the cleverest part of the system: rule names map to filenames. I used only AuotHotkey programs, but the system would execute the first file in the macros directory that started with the rule name. Specifically, my code ran dir rulename* /b to find a filename starting with the given rule name. For example, the rule bloglines would cause a script called bloglines.ahk, if it existed, to be executed. This kept the core of the JScript program very small:

function RContext_Recognition(num, pos, type, result) {
    var text = result.PhraseInfo ? result.PhraseInfo.GetText() : "not recognized";
    var name = result.PhraseInfo.Rule.Name;
    log(text + "::" + name);
    var toRun = getFileName(name);
    if( toRun != "" ) Shell.Run(toRun);
    if (RegExp("exit|good ?bye").test(text)) WScript.Quit();   

By using AutoHotkey, I could essentially have the executed rules do anything: play sounds, open programs, move the mouse, type text, run other programs, whatever.

But back to the task that was at hand, or, perhaps, at lips: remote voice control of iTunes. At this point it was a simple matter to hook up the pieces I’d put together. I set up a rule to recognize “eye tunes delete this track” and hooked it up to a one-line AHK script that would emit the same keycode generated when I pressed the media button. And behold, it worked!

This is the part that makes me call it a Rube Goldberg system, though. Let’s trace the chain of events that would happen when I, reclining in bed with a book, said the magic words:

  1. The “zeroth” step, which happened even before I started speaking, was that my JScript would run and give the Speech API a grammar file to parse
  2. As I started speaking, the speech recognizer engine would examine the microphone-in streaming audio data. Once I’d said enough, it determined that I’d spoken a phrase that triggered a rule my JScript was interested in.
  3. The speech engine invoked a callback function from my JScript code, passing it information about the invoked rule, including rule name and recognized text.
  4. My callback logged the rule name and recognized phrase text to a dated log file.
  5. My callback searched the current working directory for a something to execute: any file that started with the rule name.
  6. Having found a file, my JScript callback passed it to the shell to execute. Since AHK files are text, not binary, this worked because AHK integrates with Explorer to interpret .ahk files with the real AutoHotkey.exe when double-clicked.
  7. The AHK file emitted a virtual scan key ({vkFFsc16D}) corresponding to the media key on my keyboard.
  8. The IntelliType software would recognize that the media key had been pressed, and dutifully invoked the iTunes-controlling JScript I had originally attached to that key.
  9. This second JScript file used iTunes’ COM automation interface to programmatically mark the playing track with delete and advance to the next track.

Complicated, eh? Using IntelliType instead of directly executing the iTunes script was just an extra dash of unnecessary intricacy to an already cobbled-together system.

And how did it perform? Well, when it worked, it worked great. I could implement new ideas in a minute or two, which was very positive reinforcement. But, sadly, even with a very restricted grammar, there were still too many false positives caused by ambient noise and song lyrics. After spending a week writing the framework, I only used it for a few days before becoming frustrated and shelving the whole project. Even without music playing, it sometimes took several attempts to have the system recognize what I was saying. And don’t get me wrong — with an omnidirectional microphone several feet from my mouth, I was basically setting up the system to fail. Still, I was somewhat disappointed that speech recognition didn’t do a better job of discriminating between my voice and music. No that it was surprising, since I was using the free and not-cutting-edge recognizer that came with the Speech SDK. Vista has a more advanced recognizer, but in later experiments, I found that even it wasn’t up to the task of dealing with a noisy room.

Oh well. The system was a lot of fun to write and see working. I’ve always been fascinated by systems with many moving parts. Having those “moving parts” be completely virtual doesn’t dampen the excitement one bit!

Google Summer of Code 2009 projects

Proposal for Generic Trie, Radix Tree, and Suffix Array Data Structures

Some nice Direct3D proposals for Wine.

I think Python continues to be one of the most popular projects, with 30 accepted proposals. Several proposals filed under other projects are in fact Python-related as well.

Mozilla gets some nice proposals as well: TraceMonkey register allocation, web-rsync, ECMAScript 3.1 in Rhino, and dupe detection in Bugzilla.

Mercurial gets inotify for OS X and a client for git repositories.

Netcraft Confirms: I Live In Browsers

In October of last year, about five months ago to the day, I installed a new app on my computer. A post from LifeHacker had alerted me to free service called JournalLive. The premise of JournalLive is that you install their client, which monitors user activity on your computer — when you’re on, when you’re off, what apps you use, and for how long.

The service has some downsides, especially for someone like me who essentially uses only one computer. In particular, JournalLive stores all user history on their web servers. This has obvious security/privacy issues. It also also means that examining data tends to be slow and unpleasant, and I don’t see any way to export data for doing custom analyses. JournalLive also seems not to record stats if it cannot connect to the JournalLive servers, which has resulted in several hours of missed log time. Thus I’ve ditched JournalLive in favor of the much prettier, more reliable, and purely-local app ManicTime.

A few interesting points on my usage of the five-month period from 2008/10/21 to 2009/03/20:

  • Total time in Firefox: 412 hours
  • Total time in Chrome: 101 hours
  • Total time in PuTTY: 87.5 hours
  • Total time in VMWare/VirtualBox/VPC: 76 hours
  • Total time in Prism: 18 hours
  • Total time in WinShell: 13.5 hours
  • Total time in Safari: 8 hours
  • Total time active: 950 hours

More than half the time I spend at my computer is in Firefox. Cool! JournalLive also claims I spent 23 hours each on reddit and Google (Reader); those both seem low, especially the Google Reader figure.

Promoting a Subversion Path to Standalone Repository

One of my jobs today required creating a Subversion repository out of a subset of a larger repository. There are many reasons why this might be useful. In my case, it was to prepare a clean repository for conversion to another version control system. The original repository contained useful checkins in one particular directory, and useless checkins elsewhere. Since a quick Googling doesn’t bring up anything useful, I figured I’d document what I did. Most of this information can be found scattered through The Subversion Book, provided you know where to look.

We’ll suppose that ORIG is a directory containing the original repository, and
NEW is the smaller repository containing only the directory DESIRED/DIR from the original repository. The goal is to have the repository NEW contain only the portion of the repository originally under (not including) DESIRED/DIR/.

Step 1 is to dump the original repository, preserving only the desired commits. This results in a dump file called NEW.svnrepo:

svnadmin dump ORIG | svndumpfilter include DESIRED/DIR --drop-empty-revs --renumber-revs > NEW.svnrepo

Step 2 is to back up the dump file in case the editing in step 3 goes wrong:

cp NEW.svnrepo NEW.svnrepo.copy

Step 3 is to edit the NEW.svnrepo file. This is the tricky step!
You should see something like this at the start of the dumpfile:


Node-path: calc
Node-action: add
Node-kind: dir
Content-length: 0


You want to delete the part in bold red font.
See also the section from the Subversion Book about repository filtering.

But you’re not done yet! You also want to remove the DESIRED/DIR/ prefix when it appears after Node-path: or Node-copyfrom-path:. What you do NOT want to do is a simple search and replace, because this will corrupt any files that contain that path, and corrupted files cause the repository import script to abort.

A sequence of vim substitution lines like the following should do the trick. The first two handle references to children of the desired directory. The last one handles property changes made on the desired directory itself.

:%s/Node-path: DESIRED\/DIR\//Node-path: /
:%s/Node-copyfrom-path: DESIRED\/DIR\//Node-copyfrom-path: /
:%s/Node-path: DESIRED\/DIR/Node-path: /

Step 4 is to create a new, empty repository:

svnadmin create NEW

Step 5 is to load the new repository from your edited dump file:

svnadmin load NEW < NEW.svnrepo

Step 6 is to import the repository from svn to bzr, using one of:

  • bzr svn-import NEW bzr-NEW
  • NEW.svnrepo bzr-NEW

Step 7 is to remove the dump files:

rm NEW.svnrepo*

In my particular case, getting the new repository was a means to an end. Since I was moving to Bazaar, using specialized conversion tools was an option. My findings:

  • svnadmin verify on the new repository from above works.
  • bzr svn-import on the new repository fails on Windows with shiny new bzr (1.12) and bzr-svn (0.5.2)
  • bzr svn-import on the new repository works on Ubuntu with old and rusty bzr (1.6.1) and bzr-svn (0.4.13)
  • bzr2svn on the new repository fails on Fedora with even older bzr (1.3.1)
  • bzr2svn on the ORIGINAL repository works on Fedora with even older bzr (1.3.1)

There’s not much rhyme or reason to what works and what doesn’t, so far as I can tell. With 20/20 hindsight, the easiest thing to do would have been to use bzr2svn on Linux from the get-go, and avoid svn-import on Windows. bzr2svn is especially nice because it has a --prefix option to import only a portion of a repository. That eliminates the need to muck around with svnadmin and svndumpfilter in the first place.

But what’s done is done, and my pain is your gain!

An Unexpected Side Effect of Combining C++ Features

So, a pop quiz. Suppose you write a C++ program using class B provided in a library, like so:

#include "some-header.h"
int main() {
  B* p = new B();
  return 0;

The snippet above compiles and links just fine. So the snippet below should build too, right? Mind you, the base class is not doing anything sneaky. In particular, it does not have a private destructor, or anything else designed to interfere with derivation.

#include "some-header.h"
class D : public B {};
int main() {
  D* p = new D();
  return 0;

The answer to the quiz is that this may or may not build. It will compile, but you could get unresolved symbol errors during linking. How and when, you ask? It happens when you’re linking against a dynamically linked library (at least on Windows) and B has a protected virtual method not exported in the DLL. The explanation (as far as I can reason) hinges on the fact that the .lib for a DLL contains only the symbols marked for export. When you’re using the class directly, your compilation unit can’t access the protected virtual symbol, so the linker doesn’t look for it. But when you use the derived class you just defined, well, D can now access that protected virtual method in B, so the linker needs to find the method’s symbol. Since the symbol is not there, boom.

This situation involves no less than four distinct features coming together for the express purpose of saddening you: separate compilation, method access, inheritance, and custom symbol visibility.

C/C++ Preprocessor Macro Idioms

I’ve been tormented the past few days about an article I read several years ago. I barely even remember what it was about, only that the author played very clever tricks with the preprocessor’s token pasting functionality to push the limits of what the preprocessor can do. Maybe it was a parameterized repetition macro? Something like that. I do remember it took advantage of the asymmetry in expansion between function-like and object-like macros.

Anyways, I’ve been poking around the (very nice!) Chromium source code. One thing I took a close look at is how Google organized their logging macros. In the process, I noticed two tricks they used that I think are effectively preprocessor idioms, and since Google doesn’t turn up much for that search phrase, I figured I’d document what I saw.

The external interface of the Google logging system is pretty simple: the LOG macro takes a severity parameter, such as INFO or ERROR and expands into a temporary object that derives from std::ostream. The tricks have to do with the severities: the tokens used as severities aren’t #defined to the preprocessor! Instead, whatever is passed in is pasted onto a longer token that expands to an object declaration with the desired parameters. This lack of actual #define statements effectively gives the severity tokens a specific local scope, even though the preprocessor deals only with explicitly global scope. And the one-to-many trick effectively forms a preprocessor-style switch() construct, where the macro parameter can affect the expansion of the rest of the macro.

Incidentally, now that I look, the Boost preprocessor library uses the same switch() trick to implement BOOST_PP_IF. Nifty.

Installing Mercurial

Mercurial’s pretty easy to install, actually. I keep per-system installations under ~/sw/local instead of ~/.

$ # grab latest release from
$ # add PATH=$PATH:~/sw/local/bin to .bashrc (or equiv)
$ # add PYTHONPATH=$PYTHONPATH:~/sw/local/lib/python to .bashrc (or equiv)
$ wget
$ tar xzvf mercurial-1.2.tar.gz
$ cd mercurial-1.2
$ make install-home HOME=`echo ~/sw/local`
$ cd ..
$ rm -rf mercurial-*

The J Programming Language, also known as Oh God My Eyes Are Bleeding

So, back towards the beginning of the semester, we had a simple assignment in Data Compression: compute pixel differences for an image file. The differences will be defined “going backwards” — the i’th difference will be the i’th value, minus the (i-1)th value (as opposed to the (i+1)th). So, including a “virtual” pixel value of 128 at the beginning, a list of numbers such as

0 1 4 4 3 10

would be transformed into the list

-128 1 3 0 -1 7. Because these are really 8-bit unsigned char values, this is equivalent to 128 1 3 0 255 7.

In a mainstream, Algol-derived language like C or Python, this problem has a simple, obvious solution that is intelligible even to programmers who don’t know the language. Something like this:

import sys

if len(sys.argv) != 2:
  print "Error! Must give an input filename."
  print "\tUsage: python in_file"

infile_name = sys.argv[1]
outfile_name = infile_name + ".diff"

infile = open(infile_name, "rb")
data = # Read all bytes from file

data = chr(128) + data # Prepend pseudo-pixel for diff. calculation

# Calculate the differences between each pixel in the data
diff_at = lambda i: ord(data[i]) - ord(data[i-1])
diff = [chr( diff_at(i) % 256 ) for i in range(1, len(data))]

outfile = open(outfile_name, "wb")
outfile.writelines(diff) # Write the difference values to the output file

This Python program is short and simple. It has six blocks, half of which are just one line. For the non-programmers who might be interested in commentary (hi Joe!): The first “big” block ensures that we have an input file to read values from and a file to write to; the second block just reads the value in. The line data = chr(128) + data just adds the value 128, our virtual pixel value, to the beginning of the list. The next line is a direct translation of our definition of a difference pixel, given above, into Python syntax. The trickiest line is the one defining diff, because it makes use of a list comprehension. From the inside out, chr( diff_at(i) % 256 ) computes the i’th pixel difference, modulo 256 (which is the maximum value a character (chr) can be. That computation is repeated for every pixel in the file, and the resulting list is assigned to diff. Finally, the last line prints out the differences to the output file.

So that’s nice: simple, clean, easy, boring. I wanted something more… esoteric.

Usually, when a computer science class gets an assignment, the instructor picks the language the students use, but for this assignment, we were given free reign to pick any language we wanted to get the job done. Since this particular professor puts a lot of emphasis on program readability, I thought it would be amusing to see the expression on his face when presented with a valid solution in a completely unintelligible language. But which language to use?

Perhaps the most amusing might have been a language called Whitespace (in which a valid solution would be a blank sheet of paper… or maybe two blank sheets of paper), but that would have been pushing it, even for me. But then I remembered reading something about J, somewhere. I thought it had been cited in an exhortation from Steve Yegge for programmers to learn a wider variety of languages, but I can’t seem to find it now.

So, J.

Here’s the equivalent program in J:


Fun, eh?

Here’s an easier-to-understand version:

infile  =: 2}ARGV            NB. grab command line args
outfile =: 3}ARGV
chars   =: 1!:1 infile       NB. 1!:1 means read contents of file
nums    =: a. i. chars       NB. convert chars to ascii indices
diff    =: 2 -~/\ ]          NB. uhh...
diffs   =: 256|diff 128,nums
(diffs {a.) 1!:3 outfile

Okay, so that’s not THAT bad for the first few lines. Sure, 1!:1 is a pretty atrocious syntax for reading in file contents. but we can look beyond that for now. a. i. chars is actually sort of cool. a. is a table of the ASCII characters, chars is a list of the characters from the file, and i. is (in this context) the index-of operator. It works like this:

'abcde' i. 'bed'
1 4 3

In essence, a. i. is the J equivalent of Python’s ord function, but built out of a more fundamental operator. Cool enough to forgive the, er, terse syntax. But what the heck is the next line?!? diff =: 2 -~/\ ] — are you kidding me?

Actually, it’s straight out of the J vocabulary reference. Sentences in J read right-to-left. The ] is an identity operator; it simply selects whatever comes to its right. In essence, here it stands in for the thing being diffed, much like the word “it” itself. The \ character means infix, where x is 2, u is -~/, and y is ]. I’ll cover u in a second, but first, a quick illustration of \. Suppose you wanted to select every successive pair of elements from the list 1 2 3 4. Then u would simply be the identity function ], like this:

2 ] \ 1 2 3 4
1 2
2 3
3 4

So what does -~/ mean? - is the subtraction operator, simple enough. / is the insert operator, so +/ 1 2 3 is 1+2+3, and -/ 1 2 is 1 – 2. But note that we don’t want 1 – 2, we want 2 – 1. That’s what ~ does — it swaps arguments, so 1 -~ 2 is the same as 2 – 1. Phew!

256| means compute values mod 256. { a. is the equivalent of the chr function in Python, converting integers back to characters. And, finally, 1!:3 prints.

Simple and intuitive, eh?

Here’s another example. The task is to take a binary file and figure out the Huffman codewords encoded therein. The file format is 256 words of 4 bytes, followed by 256 sets of 1-byte lengths. Each length gives how many of the low-order bits from the corresponding word are part of the Huffman codeword.

Python, I was pleased to see, has a module called struct that is built for doing exactly this kind of bit-level interpretation. Given a string of 4 chars, the value of those chars as an integer can be had with struct.unpack(“l”, chars). Cool! Unfortunately, Python doesn’t have built-in libraries for converting integers to binary strings. The end program ended up being about 25 lines, not including trivial things like file input.

J fares rather better. Negative numbers passed to the infix operator gives non-overlapping infixes, perfect for splitting our list of bytes into chunks of 4.

_3 ]\ 'abcdefghi'

And J has built-in operators for converting to and from binary representation of integers. So, given a list of four integers, we can select the low, oh, four bits like this:

(-4) {. , #: 0 0 0 9
1 0 0 1

That really speaks to the conciseness of J, I think. From 25 lines of Python to one line of J.

J is like concentrated Perl, with all the sugar evaporated out. Honestly, it makes my brain hurt to look at J for too long.

Summer of Code 2008 Projects

So, the Summer of Code 2008 projects were posted a few days ago.

A few of the ones I think are particularly interesting:

PHP: Zend LLVM Extension
Vim: On-the-fly Code Checker, Visual Studio 2005/2008 Plugin
Python: Django on the JVM
Mercurial: Rebasing, svn tools, partial cloning, TortoiseHG for Linux
LLVM: C++ classes support, llvm/clang distcc, STM support
GCC: Windows Improvements, C++0x lambda functions
Boost: Multi-core/SIMD optimizations for Boost::Math and Boost::Graph
Cairo: HDR image surface type
Portland State University: A System for Patent Categorization and Analysis,
X.Org: GPU-Accelerated Video Decoding

Other thoughts: Gosh, Python and KDE have a lot of projects! Looks like Cython and NumPy are popular Python subprojects. Mercurial is four for four in terms of interesting-to-me projects. Five of eight Linux projects are printing-related.

Good luck to everyone working on it this year!