Coroutine Performance

I was curious this afternoon about how fast (or slow) coroutines can be. So I wrote a small microbenchmark in C, using libcoro’s  handwritten-assembly implementation of stack switching for symmetric coroutines.

The benchmark builds a complete binary tree of configurable size, then visits it twice. First, a conventional recursively-written function takes the tree root and a function pointer, and calls the function pointer to process each value stored in the tree.

The benchmark next re-visits the tree with the iteration encapsulated by a Lua-style asymmetric coroutine class, which yield()s tree node values back to the main routine.

One interesting operational consequence of doing the recursive visiting inside the coroutine is that the main program stack remains at constant depth for every processed node. This constant-stack-space property is also an effect of CPS transformations – not surprising, since both continuations and coroutines are general control abstractions.

Anyways, I found that when compiled with –O2 (gcc 4.4.3) and run on a Core 2, the overhead of using resume()/yield() compared to call/return was about 110 cycles per yield(). That cost is on par with loading a single word from RAM. Stack switching is an impressively efficient way of implementing a feature that lends significant flexibility and power to a language!

ANTLR Grammar Tip: LL(*) and Left Factoring

Suppose you wish to have ANTLR recognize non-degenerate tuples of expressions, like (x, y) or (f(g), a, b) but not (f) or (). Trailing commas are not allowed. The following formulation of such a rule will likely elicit a complaint (“rule tuple has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2; …”):

tuple : '(' (expr ',')+ expr ')' ;

ANTLR’s error message seems unhelpful at first glance, but it’s trying to point you in the right direction. What ANTLR is saying is that after it reaches a comma, it doesn’t know whether it should expect expr ')' or expr ',', and parsing expr could requiring looking at an arbitrary number of tokens. ANTLR recommends left-factoring, but the structure of the rule as written doesn’t make it clear what needs to be factored. Re-writing the rule to make it closer to ANTLR’s view makes it clearer what the problem is:

tuple_start : '(' tuple_continue;
    :    ( expr ',' tuple_continue
    |      expr ')');

Now it’s more obvious where the left-factoring needs to be applied, and how to do it by writing expr once, leaving only the choice of seeing a comma or a close-paren:

tuple_start : '(' tuple_continue;
    :    expr ( ',' tuple_continue
              | ')'  );

The rewritten expansion corresponds to an alternative structuring of the tuple rule that will satisfy ANTLR:

tuple : '(' expr (',' expr)+ ')';

The ANTLR wiki has a page on removing global backtracking from ANTLR grammars, which has a nice presentation of how to do left-factoring when the opportunity is obvious.


I was reviewing Ras Bodik’s slides on parallel browsers and noticed in the section on parallel parsing that chart parsers like CYK and Earley are theoretically O(n^3) in time, but in practice it’s more like O(n^1.2).

I wondered: wait, what does O(n^1.2) mean for practical input sizes? Obviously for large enough n, any exponential will be worse than, say, O(n log n), but what’s the cutoff? So, I turned to the handy Online Function Grapher and saw this:

n^1.2 vs n lg n

That is, for n less than 5 million, n^1.2 < n lg n, and they remain close up to about n = 9 million. So if Bodik’s estimate holds, chart parsing could indeed be a viable parsing strategy for documents of up to several megabytes. Happily, this limit size matches the current and expected sizes of web pages to a T.

The blue line at the bottom shows linear growth. It’s all too easy to think of linearithmic growth as being “almost as good” as linear growth. To quote Jason Evans discussing how replacing his memory allocator’s log-time red-black tree operations with  constant-time replacements yielded appreciable speedups:

“In essence, my initial failure was to disregard the difference between a O(1) algorithm and a O(lg n) algorithm. Intuitively, I think of logarithmic-time algorithms as fast, but constant factors and large n can conspire to make logarthmic time not nearly good enough.”

Thirty Inches a Year

Next week is the one-year anniversary (if you will) of my purchase of a Dell 3007WFP-HC monitor. I snagged a refurbished model for about 60% of the going retail price.

While I’m overall happy with the purchase, and do not regret it, the pragmatics of such a large monitor have turned out to be not quite as rosy as I had initially expected.

The good:

  • With Chrome’s minimal UI, I can fit two vertically-maximized PDFs on the screen at once, or three if I squeeze them down to “merely” life-size. (The screen is about 25% taller and 3 times wider than a sheet of 8.5 x 11” paper). Making it easier to read research papers was one of my reasons for getting the screen, and this aspect has worked out nicely.
  • Eclipse and Visual Studio will eat as much screen space as you can give them. Having a 30” monitor means getting two pages of code side-by-side instead of just one. A portrait-mode 24” monitor at 1200 x 1920 can show one page with 100 vertical lines of code; 2560 x 1600 shows two pages with 80 vertical lines of code.
  • Two displays are easier to drive from one graphics card than three.  This means that a thirty inch monitor plus one is easier to set up than three smaller displays.

The bad:

  • Managing four million pixels turns out to be much more difficult than managing two million pixels. Because I have so much space, I tend to be less disciplined about keeping fewer windows open. I have about 150 Chrome tabs spread across 15 top level windows, plus six PuTTY sessions and six Explorer windows. I sometimes feel like I need a machete to hack my way through the jungle I created. There are apps that try to help keep windows aligned and orderly, but I haven’t managed to use them successfully.
  • It’s hard to find wallpaper for 30” displays!
  • The sweet spot for monitor prices on a pixels-per-dollar scale is (still) at 24” monitors. For less than the price of a heavily discounted refurbished 30” monitor, you can get two 24” monitors, and put one or both in portrait mode. For the price of a new 30” display, you can get a smaller monitor or two and a desktop or laptop to go along with it…

Would I recommend a 30” monitor to fellow programmers? I’m not sure. Two 24” monitors, or a 27” and a 24”, might make more sense. It depends on your personal opportunity cost for the extra premium over a nice 24” monitor. That extra money could be a few extra cores and double the RAM, or a dedicated box for running tests, or a nice suit and wool pea coat.

All in all, the upgrade from 20” to 24” made a much bigger difference than the upgrade from 24” to 30”. Make of that what you will.

Bootstrapping Compilers and T-diagrams

I came across a very nice notation in the book Basics of Compiler Design that greatly clarified the various choices for bootstrapping a compiler. The notation was originally created by Harvey Bratman in 1961 (!) , but every other reference I’ve found on the subject is just terrible for conveying understanding. Mogensen’s presentation, on the other hand, is refreshingly clear.

The rules for T-diagrams are very simple. A compiler written in some language “C” (could be anything from machine code on up) that translates programs in language A to language B looks like this (these diagrams are from Torben Mogensen’s freely-available book on compilers):

Now suppose you have a machine that can directly run HP machine code, and a compiler from ML to HP machine code, and you want to get a ML compiler running on different machine code P. You can start by writing an ML-to-P compiler in ML, and compile that to get an ML-to-P compiler in HP:

From there, feed the new ML-to-P compiler to itself, running on the HP machine, and you end up with an ML-to-P compiler that runs on a P machine!

The T-diagram notation can also easily be extended with interpreters, which are simply vertical boxes that can be inserted between any two languages. For example, here’s a diagram depicting a machine D running an interpreter for language C that is, in turn, running a compiler in language C to translate a program from language A to language B:

After thinking about bootstrapping in an ad-hoc way, I just love the structure that these diagrams provide! Unfortunately, while the book itself is freely available online, there’s no HTML version of the book, only a PDF version. The easiest way to see the full presentation is probably to use Scribd and search for “t-diagram”.

A Snippet of Functional C

MurmurHash2 is a pretty nifty bit of code: excellent performance in both the statistical and chronological senses, and under the ultra-liberal MIT license.

The kernel of the hash function comprises these 5 lines of C code. k is the next four bytes to be hashed, m and r are experimentally-found constants, and h is the accumulated hash value:

word mrmr_step(word k,
        word h, word m, word r) {
/*1*/ k *= m;
/*2*/ k ^= k >> r;
/*3*/ k *= m;
/*4*/ h *= m;
/*5*/ h ^= k;
      return h;

Like much low-level systems code in C, this takes advantage of mutability: the value of k in line 3 is not the same as k in line 5. But what if you wanted to do this computation without mutability, or even assignment? Even in the land of C, the rules, they are a-changin’ [pdf].

At first blush, it seems strange to do first-order functional programming in a language like C, which doesn’t provide much support for it. But it’s precisely because there’s so little magic to C that it’s such a good vehicle for seeing how things work under the covers. The real point of this post is not how to write functional code, as to explore how to compile functional code.

Anyways, back to the code. Mutability and immutability are sometimes interchangeable; for example, a += 3; a *= 5; a -= 1; does the same thing as a = ((a + 3) * 5) - 1. The key is the linear dependency chain: each mutated a value is used in precisely one place.

Examining the MurmurHash2 kernel above, most of the dependencies are linear. But there’s one branching dependency, from line 1 to line 2. There are two C-like ways of dealing with this. First, we could "copy" line 1 into line 2:

k = (k*m) ^ ((k*m) >> r);

The second way would be to introduce a new variable:

int km = k * m; km ^= km >> r;

But option 1 breaks down if line 1 has side-effects, and line 2 introduces a new assignment, which we wanted to avoid!

The trick to see is that functions provide variable bindings. Thus, we can translate the C code to functional pseudocode like so:

/*1*/ bind km  to k * m in
/*2*/ bind kx  to km ^ (km >> r) in
/*3*/ bind kxm to kx * m in
/*4*/ bind hm  to h * m in
/*5*/ hm ^ kxm;

Now, in this static single assignment form, it’s super easy to see what can be safely inlined; they are bindings that are only used once, such as hm in line 4. So we might optimize this to:

/*1*/ bind km to k * m in
/*2*/ bind kx to km ^ (km >> r) in
/*5*/ (h * m) ^ (kx * m);

I’ve left line 2 alone for clarity, but it’s easy to see that it could be trivially substituted into line 5. Now, how to translate back to C? In a language with anonymous functions and closures, it’s easy. Just like Scheme translates let into lambda, we could translate bind varname to arg in expr to JavaScript like (function (varname) { return expr; })(arg). So let’s do that to start:

(function          line2(km) {
  return (function line5(kx) {
/*5*/  return (h*m) ^ (kx*m);
/*2*/ })(km ^ (km >> r));
/*1*/ })(k * m);

Note that the code now reads, in some sense, bottom-to-top. Notice also that we’ve created closures, because the line marked 5 references h, which is defined in a parent scope (not shown). So this code isn’t valid C, because C does not have first-class functions. Instead, all C functions must be declared at the top level. Luckily for us, translating the above code to C is easy. We move the function definitions to the top level, and add parameters for their free variables. (This process is called closure conversion, or sometimes lambda lifting.) We’re left with this purely functional C code:

word line1(word km, word r) {
  return km ^ (km >> r);
word line5(word h, word kx, word m) {
  return (h * m) ^ (kx * m);
word mrmr_step_f(word k, word h,
                 word m, word r) {
  return line5(h, line2(k*m, r), m);

This is the code that would be generated by a compiler, given input like the 3-line pseudocode from above. Now, the interesting bit is: how efficient is this version, compared to the original? Since this is just C, we can use gcc‘s -S flag to compile both versions to optimized assembly and compare:

        imull   %edx, %edi
        movl    %edi, %eax
        sarl    %cl, %eax
        xorl    %edi, %eax
        imull   %edx, %eax
        imull   %esi, %edx
        xorl    %edx, %eax

        imull   %edx, %edi
        imull   %edx, %esi
        movl    %edi, %eax
        sarl    %cl, %eax
        xorl    %edi, %eax
        imull   %edx, %eax
        xorl    %esi, %eax

Pretty cool! The assembly from the purely functional version is essentially identical to the low-level imperative code. And a modern out-of-order CPU will ensure that the dynamic execution is, in fact, identical for both versions.

Update 2010/02/11

There are two ways of handling the “extra” arguments added closure conversion. The way above, where free variables become parameters, is easy to do manually, and clearer for small numbers of parameters. The alternative is to create a separate structure holding the free variables, and pass a pointer to the structure. GCC is strong enough to compile this example to the same assembly as well. Not bad!

struct mrmr_step2_env { word r; };
static inline word
step2(mrmr_step2_env* env, word km) {
  return km ^ (km >> env->r);

struct mrmr_step5_env { word h; word m; };
static inline word
step5(mrmr_step5_env* env, word kx) {
  return (env->h * env->m) ^ (kx * env->m);

word mrmr_step_f(word k, word h,
                 word m, word r) {
  mrmr_step2_env e2; e2.r = r;
  mrmr_step5_env e5; e5.h = h; e5.m = m;
  return step5(&e5, step2(&e2, k*m));

C++ was a victim of its own success

I came across an interesting paragraph when reading Bjarne Stroustrup’s 15-year retrospective, Evolving a language in and for the real world: C++ 1991-2006. This notion hadn’t occurred to me before, but it rings quite plausible.

There was also a curious problem with performance: C++ was too efficient for any really significant gains to come easily from research. This led many researchers to migrate to languages with glaring inefficiencies for them to eliminate. Elimination of virtual function calls is an example: You can gain much better improvements for just about any object-oriented language than for C++. The reason is that C++ virtual function calls are very fast and that colloquial C++ already uses non-virtual functions for time-critical operations. Another example is garbage collection. Here the problem was that colloquial C++ programs don’t generate much garbage and that the basic operations are fast. That makes the fixed overhead of a garbage collector looks far less impressive when expressed as a percentage of run time than it does for a language with less efficient basic operations and more garbage. Again, the net effect was to leave C++ poorer in terms of research and tools.

— Bjarne Stroustrup

For a peek at an example of “glaring inefficiencies,” check out what two IBM researchers have found about memory use in Java, and prepare to throw up in your mouth a little.

Building protobuf with MinGW GCC 4.4.0

So, after tearing my hair out for an hour this morning, I finally managed to build protobuf using MinGW and GCC 4.4.0, using the msys install from mozilla-build 1.4.

The two major sticking points are

  1. mozilla-build includes autoconf 2.59, whereas protobuf requires autoconf 2.61 at minimum. Solution: download the msys autoconf 2.63 package, and install to /usr/local/bin/autoconf-2.63/, and temporarily export a prefixed $PATH for building protobuf.
  2. protobuf attempts to link against /c/mozilla-build/msys/mingw/lib/gcc/mingw32/4.4.0/libstdc++.dll.a, which fails because that file doesn’t exist. Solution: edit the file in that directory, replacing libstdc++.dll.a with libstdc++.a in the definition for library_names

Introducing Canvas Cacheviz

Eric Dramstad and I have been working on a new canvas-based tool. This time, it’s a cache visualizer.

Here’s what a naive matrix transposition looks like:naive-transpose

On the left is the sequence of memory touches; black pixels are touched first, and white pixels are touched last. The left side shows that each row of the upper triangle is accessed in tandem with the corresponding column of the lower triangle. On the right are the cache hits and misses. The upper triangle has a cache-friendly sequential access pattern, while the lower triangle has a cache-unfriendly strided access pattern. 56% of the memory accesses result in a cache miss.

In contrast, here’s a cache-oblivious matrix transposition:cache-oblivious-transpose

The left side shows the divide-and-conquer approach. Small blocks are transposed rather than whole rows and columns, which helps with cache locality, as reflected on the right. In this instance, only 28% of the memory touches were cache misses, a significant improvement over the naive algorithm.


There are several improvements I’d eventually like to make. One major shortcoming is that the cache model is only approximately like the cache system found in your average x86 chip. We currently only model a single-level fully-associative cache. We should eventually refine that model to a two-level set-associative cache, of configurable level sizes. I’d also like to show the cache effects of various sorting algorithms, like one-pivot quicksort versus insertion sort versus two-pivot quicksort.

The source code for canvas-cacheviz is hosted on GitHub. So far my experience with git has been negative in comparison to Mercurial. Git may be more powerful, but it has been all too easy to do strange and unwanted things with the repository DAG. And even simple questions like “what’s the equivalent of svn revert?” were difficult to answer using the tool’s help listing.

Summer 2009: Time and Books

When I started working this summer, I discovered three things, with varying levels of surprise:

  1. I had significantly less free time than I had envisioned. Buying groceries and cooking takes a lot more time than I thought!
  2. I was significantly more productive than I had envisioned. Having less time necessitated making it more structured and more efficient.
  3. I ended up getting more done even though I had less time to do it.

I would cook, read books, go to the gym, and do a little side programming during the week, and on the weekend I’d visit my parents (and usually mow their lawn for them). At school, I had three or four times as much free time (at least!), but I didn’t get as much done in a concrete sense. I spent much more time browsing proggit and Google Reader. Not time wasted, per se, but on the whole I was not as productive as I was under greater restrictions.

Books I read this summer:

  • Confessions of an Economic Hit Man
  • Life in Double Time
  • All seven Harry Potter books, including the sixth consumed in one glorious day at the beach
  • Types and Programming Languages (casually, at least)
  • A Thousand Splendid Suns
  • One Hundred Years of Solitude
  • and I’ve started reading more:
    • The Gold Bug Variations
    • Advanced Types and Programming Languages
    • Masterminds of Programming
    • Design Concepts in Programming Languages
    • Refactoring

Even conservatively counting the first three books of Harry Potter as one, that’s still ten books finished and five more started in the last three months. Much better than the, um, one or two that I probably read during the last semester of my undergrad program.

Canvas Spline Toys

I am pretty busy this week. For the past few days, I’ve been doing 8 or more hours of research, as well as packing and organizing my physical and virtual possessions. I hope to have several more posts over the next few days detailing some of the projects, medium and small, that I have worked on in the last few months.

The latest project is a simple interactive spline visualizer. I had a random itch at work to have a way of seeing splines constructed one-piece-at-a-time, so I created an initial mockup using the HTML canvas object. Eric Dramstad extended it with mouse support for rearranging control points, as well as cleaner code.

The blue line is a conventional second-order linear interpolation. From an OpenGL standpoint, it has the nice feature that the interpolated line lies entirely within the triangle defined by the three control points. The corresponding disadvantage is that the line does not pass through the second control point. The green line does the reverse. Instead of interpolating between L1(t) and L2(t), it interpolates between L1(t) and L2(-1 + 2*t). The green spline passes through all three control points, but is thus not nicely contained like the blue spline.

An interested programmer could extend this without too much difficulty. It would be particularly cool to be able to (A) add additional control points, and (B) interactively define new interpolation routines. Even without such features, Eric Dramstad’s interactive version makes for a surprisingly entertaining minute or two!

Sound and Complete

A jury reaches a verdict: guilty! But was she, in fact, innocent after all?

Throughout life, we must make decisions: evaluate, then accept or reject. Life being what it is, there are more ways to screw up than not. The field of formal logic gives us two properties we can use to describe any evaluation procedure, such as the jury above: soundness and completeness. A sound jury will never send a guilty person free, while a complete jury will never send an innocent person to jail. (As an aside, I think the concrete example of a jury and its verdicts is much easier to grok than Wikipedia’s treatment, which says “Informally, a soundness theorem for a deductive system expresses that all provable sentences are true. Completeness states that all true sentences are provable.”) Intuitively, soundness and completeness are orthogonal: you can have any combination of (non-) soundness and (non-) completeness.

The ideal, of course, is sound and complete: the perfect justice system that will never exist. What we have is neither sound nor complete: innocent people are sent to jail, and guilty people go free.

Anyways, soundness and completeness are also pervasively used in static analysis and programming language type theory, each of which are close siblings to formal logic. Well, type theory is more like a mirrored reflection of formal logic, but that’s neither here nor there. A sound type system (or analysis) is one that rejects all erroneous programs, and a complete system accepts all correct programs.

Now, the interesting question is: which is more important? Is it better to have a complete but unsound static analysis, which will report every error that exists, and then maybe some errors that don’t exist? Or better to have the opposite, sound and incomplete, where every error that it reports is an actual error, but it might not find all the issues in the program? I happen to think sound and incomplete is better: first, because sound analyses compose, and second, because many of the properties you really want to have verified (such as “does this rendering look right?”) don’t fall in the scope of the verification system. Complete and unsound certainly has its appeal: after working through all the reported errors (and dealing with the wild goose chases), you’ll know that your program is free of all the bugs that the analysis is capable of finding. Of course, completeness is unachievable for many of the most interesting and critical problems, whereas sound-but-incomplete is more within reach.

What about for type systems? A sound type system is one that is not lenient, in that it rejects all invalid programs plus some number of valid programs. A complete type system accepts all valid programs, and some invalid ones to boot. Which to pick?

The situation isn’t quite as obvious as with a static analysis, for two reasons. The first difference is that a programming language with a complete and unsound (static) type system has a fallback in the form of dynamic typing. In fact, one of the most powerful type system features, dependent typing, is not statically verifiable unless carefully restricted. Dependent type systems therefore blur the distinction between compile-time type checking and runtime type checking, and thus between static and dynamic typing. The second complication is that the arithmetic mean programmer tends to value expressiveness rather more than compile-time verification. Programmers dislike having the computer reject a program that would have run fine, simply because the computer couldn’t make sure it would run fine without actually running it. In short, restrictive type systems drive programmers to more flexible environments. Consider how Mozilla Firefox is designed: a core of statically-typed C++ to do heavy lifting, with dynamically-typed JavaScript running the user interface. Not coincidentally, this static core/dynamic shell is shared by many large video games (like World of WarCraft: C++ and Lua).

The target domain matters, of course. One hopes that the language for a program controlling an airliner’s fuel supply preferred soundness over completeness. And type inference raises another issue: there exist type systems that are sound and complete, but are not amenable to type inference.

Decisions, decisions…

p.s. Since I’m a visual learner, I created a reference chart:

Chart of soundness and completeness properties

Chart of soundness and completeness properties

Overview of Mozilla’s C++ Static Analysis Tools

Manually making pervasive changes in a four million line codebase is all but impossible. Thus, in a bid to make far-reaching changes to the internals of their code, Mozilla has been actively pursuing better static analysis tools.

It turns out that analyzing C++ is remarkably difficult; the stuff of patents and at least one doctoral dissertation. The root cause is the painfulness of parsing C++. First, C++ is simply a large language, syntactically speaking, and due to vendor extensions, there is no single point of reference for the syntax found in the real world. Second, C++ is difficult to parse with automated tools. The C++ grammar, such that it is, does not fit into any of the usual formal language classes like LL(k) or LR(k). This parsing difficulty is partly due to its C heritage. To give one simple example of a problematic construction: return (a)(b); could involve a cast of b to type a, or a function call a(b). Which one applies depends on the type of a, but in a classical compiler, type information is only available after parsing completes! Finally, preprocessor macros create a disconnect between what the user sees in a source file and what the compiler sees in a translation unit. Any analysis tools must either bridge that divide, or risk being rejected for not properly handling macros.

Mozilla has several static analysis projects it runs, which build upon several more in turn. The tools are split into two related but separate categories: static analysis and rewriting. The following is my understanding of the current tool space:

Dehydra is a lightweight static analysis tool for C++. It is implemented as a GCC plugin that passes C++ AST information to JavaScript code, which can examine the provided AST nodes. Among other things, Dehydra is used as the smarts behind a very cool semantic Mozilla code browser called DXR (Dehydra Cross Reference).

Treehydra is Dehydra’s heavy duty companion. It deals with GCC’s internal GIMPLE intermediate representation. This allows Treehydra analyses to be run after any optimization pass in GCC. Treehydra scripts also get access to GIMPLE control flow graphs (CFGs). Access to CFGs enables more advanced path-based analyses. In effect, anything GCC “knows” about a piece of C++ is available to JavaScript code through Treehydra.

Right now, Dehydra and Treehydra require a custom, patched version of GCC. However, in the future, starting with GCC 4.5, plugin support (thanks largely to Mozilla’s Taras Glek!) will obviate the need for a patched GCC. That’s very exciting, especially because it will mean that static analysis of C++ code will become available to orders of magnitude more programmers!

Unfortunately, the *hydras are not sufficient for the task of automated rewriting, because GCC does not preserve all the necessary syntactic information about token positions and macro expansion and such. Thus, more tools are needed.

The primary C++ rewriting tool from Mozilla is called Pork. Pork is a tool for rewriting C++. Unlike GCC, it retains exact syntactic information about the code it parses, as well as the changes made while preprocessing. Pork is built on several other tools: Oink, Elsa, and MCPP. MCPP is a preprocessor that includes annotations in preprocessed source to help reconstruct the original source file. Elsa is a C++ parser, based on the GLR parser generator Elkhound. Finally, Oink is a tool for analyzing the parse trees constructed by Elsa.

Whew! I hope that information is accurate, +/- 10%.

Programs I Currently Have In My Tray

  • Skype
  • iTunes
  • Pageant
  • Twhirl
  • Yahoo! Widgets
  • NetMeter
  • ManicTime
  • ZoneAlarm
  • Logitech Mouse
  • AutoHotkey
  • Mozy Backup
  • Dropbox
  • UltraMon
  • WinSplit Revolution
  • Threeter
  • KeePass
  • Ad-Aware

That’s eighteen(!). If I had to trim the list down to seven, I’d probably choose these as the most useful in day-to-day tasks:

  • iTunes
  • Pageant
  • ManicTime
  • AutoHotkey
  • Mozy Backup
  • UltraMon
  • KeePass