Rube Goldberg Speech Macros

It’s been about a year since Microsoft released Speech Macros for Vista.

One thing I found very interesting was how similar their system seems to what I ended up creating for myself two and a half years ago: XML syntax for defining grammars, and AutoHotkey-like syntax for executable commands! Perhaps there simply aren’t that many ways to do flexible speech macros.

Anyways, the backstory: Sophomore year of college, I had more reading than usual to do for classes. I like reading and listening to music, but my bed was across the room from my desk, and I was growing tired of getting up every fifteen minutes to skip songs that weren’t conducive to reading. “I know!,” I thought. “I’ve got an omnidirectional microphone with an extra-long cord. If I string it under the rug, I can reach my bed, and control iTunes with my voice!” Great vision, but how would I go about doing such a thing?

The first step was scripting iTunes. It turns out that Apple publishes a Windows iTunes COM SDK that allow JScript (Microsoft’s version of JavaScript) to manipulate iTunes programmatically. Awesome! In fact, I had already been using this to automate an action I found myself repeating several times a day. When I heard a song I really didn’t like, I would add delete to the comments field in iTunes, and skip to the next track. Once a week or so, I would delete all the songs with delete in the comments field. It’s a simple hack that worked remarkably well. Anyways, I configured IntelliType to execute the following simple JScript when I pressed the “media” button on the keyboard. This greatly reduced the friction involved with marking songs for deletion.

var iTunesApp = WScript.CreateObject("iTunes.Application");
var currentTrack = iTunesApp.CurrentTrack;
if (currentTrack) {
	iTunesApp.NextTrack();
	currentTrack.Comment = currentTrack.Comment + ' delete';
} // else nothing to do...

The other half of the problem was to invoke that script. Specifically, how could I invoke it with a spoken command? I considered writing a C# program, but I wanted something a little more lightweight. Once again, JScript came to the rescue! This time, it was in the form of Microsoft’s free Speech API SDK 5.1. This SDK provided a very simple speech recognition engine, as well as documentation on how to use the engine from JScript. It was, admittedly, some of the most unfriendly documentation I’d ever seen, but I managed to get up and running after a day or two. If it hadn’t been for the sample code they provided, it probably would have taken closer to a week or two. The SAPI interfaces look like this; by now, I’ve forgotten exactly what each piece does:
var RContext = new ActiveXObject("Sapi.SpSharedRecoContext");
RContext.EventInterests = interests;
WScript.ConnectObject(RContext, "RContext_"); // Hook up event listeners on a by-name basis...
var Grammar = RContext.CreateGrammar();
var Recognizer = RContext.Recognizer;
Recognizer.State = SRSAlwaysActive;

In any case, I soon had a simple JScript-driven speech-to-text echoer. It would listen for English sentences, and print to the screen whatever the engine thought it heard. JScript’s native support for regular expression made it dead easy to search for key words and phrases in the output before it was printed. This would have been enough for my purposes, except for one major problem. Much like people seeing faces in toast, because the speech engine was expecting spoken English, it heard spoken English — even when all that the microphone was picking up was random noise from the room. Since I was using this while listening to music, song lyrics proved to be very confusing to the system as well. I ended up with a bunch of “recognized” phrases like these, or even stranger:

amnesty
the DN
and newington sea and then moves
a ton
of
top
and with Ababa and being

Luckily, there’s a common, easy solution to this problem in the speech recognition world: do less. Specifically, have the engine listen for a limited set of phrases, rather than arbitrary speech. This allows the engine to be much more discriminating when deciding whether or not what it hears in the microphone is one of the few spoken phrases you asked it to listen for.

The Speech API in Windows XP (Vista had yet to appear in retail stores, and even then, I didn’t try Vista for nearly a year) includes a way of creating textual grammars for the speech engine. The main syntax looks like this, with optional phrases <O>, list-of-alternatives <L> and required phrases <P>:

<RULE NAME="bloglines" TOPLEVEL="ACTIVE">
 	<O>please</O>
	<L>
  		<P>go to</P>
		<p>open</p>
	</L>
	<P>bloglines</P>
	<O>please</O>
</RULE>

You can speak any one of eight phrases to trigger the above rule: open bloglines, open bloglines please, please open bloglines, and please open bloglines please, plus the four phrases obtained by substituting go to for open. As the grammar writer, you can also require words that must be enunciated extra-clearly, embed “free form” sections in rules (for things like asking for addresses or names), and have rules reference other rules. It is a pretty powerful system, once you wrap your head around it.

So, the speech engine can now look for a small selection of phrases like iTunes pause and skip track please. The next question is, how does the system determine what action to carry out when a rule is recognized?

The original scheme, of searching the recognized text for keywords, could work. However, it’s possible to define rules that have many different possible phrases with no keywords in common. Plus, text-based searching would be a pain to keep up-to-date as the grammar evolves. Any changes would be have to be made twice: once in the grammar, and once in the search code. That’s a gross violation of the Don’t Repeat Yourself principle.

Luckily, speech grammar rules can be given a name that is passed to the JScript code when a rule is recognized. The snippet above defines a rule named bloglines. No matter which of the eight phrases you speak, the same rule name will be passed in, so the code can simply look for the rule name, instead of the spoken text.

Then the question becomes: what code should be executed for a given rule? This is, I think, the cleverest part of the system: rule names map to filenames. I used only AuotHotkey programs, but the system would execute the first file in the macros directory that started with the rule name. Specifically, my code ran dir rulename* /b to find a filename starting with the given rule name. For example, the rule bloglines would cause a script called bloglines.ahk, if it existed, to be executed. This kept the core of the JScript program very small:

function RContext_Recognition(num, pos, type, result) {
    var text = result.PhraseInfo ? result.PhraseInfo.GetText() : "not recognized";
    var name = result.PhraseInfo.Rule.Name;
    log(text + "::" + name);
    var toRun = getFileName(name);
    if( toRun != "" ) Shell.Run(toRun);
    if (RegExp("exit|good ?bye").test(text)) WScript.Quit();   
}

By using AutoHotkey, I could essentially have the executed rules do anything: play sounds, open programs, move the mouse, type text, run other programs, whatever.

But back to the task that was at hand, or, perhaps, at lips: remote voice control of iTunes. At this point it was a simple matter to hook up the pieces I’d put together. I set up a rule to recognize “eye tunes delete this track” and hooked it up to a one-line AHK script that would emit the same keycode generated when I pressed the media button. And behold, it worked!

This is the part that makes me call it a Rube Goldberg system, though. Let’s trace the chain of events that would happen when I, reclining in bed with a book, said the magic words:

  1. The “zeroth” step, which happened even before I started speaking, was that my JScript would run and give the Speech API a grammar file to parse
  2. As I started speaking, the speech recognizer engine would examine the microphone-in streaming audio data. Once I’d said enough, it determined that I’d spoken a phrase that triggered a rule my JScript was interested in.
  3. The speech engine invoked a callback function from my JScript code, passing it information about the invoked rule, including rule name and recognized text.
  4. My callback logged the rule name and recognized phrase text to a dated log file.
  5. My callback searched the current working directory for a something to execute: any file that started with the rule name.
  6. Having found a file, my JScript callback passed it to the shell to execute. Since AHK files are text, not binary, this worked because AHK integrates with Explorer to interpret .ahk files with the real AutoHotkey.exe when double-clicked.
  7. The AHK file emitted a virtual scan key ({vkFFsc16D}) corresponding to the media key on my keyboard.
  8. The IntelliType software would recognize that the media key had been pressed, and dutifully invoked the iTunes-controlling JScript I had originally attached to that key.
  9. This second JScript file used iTunes’ COM automation interface to programmatically mark the playing track with delete and advance to the next track.

Complicated, eh? Using IntelliType instead of directly executing the iTunes script was just an extra dash of unnecessary intricacy to an already cobbled-together system.

And how did it perform? Well, when it worked, it worked great. I could implement new ideas in a minute or two, which was very positive reinforcement. But, sadly, even with a very restricted grammar, there were still too many false positives caused by ambient noise and song lyrics. After spending a week writing the framework, I only used it for a few days before becoming frustrated and shelving the whole project. Even without music playing, it sometimes took several attempts to have the system recognize what I was saying. And don’t get me wrong — with an omnidirectional microphone several feet from my mouth, I was basically setting up the system to fail. Still, I was somewhat disappointed that speech recognition didn’t do a better job of discriminating between my voice and music. No that it was surprising, since I was using the free and not-cutting-edge recognizer that came with the Speech SDK. Vista has a more advanced recognizer, but in later experiments, I found that even it wasn’t up to the task of dealing with a noisy room.

Oh well. The system was a lot of fun to write and see working. I’ve always been fascinated by systems with many moving parts. Having those “moving parts” be completely virtual doesn’t dampen the excitement one bit!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s