Doing the transcriptions for Glenn Kelman’s presentation earlier this week ended up taking more time than I expected it to. I can type reasonably quickly, 50 words per minute, perhaps 90 if I concentrate hard and stay away from the Backspace key. But a normal rate of speech is about 130 words per minute, so I can only transcribe about a sentence at a time. So the transcription process goes like this: Listen to audio, transcibe a sentence, seek the audio back a few seconds, repeat. Not the most exciting work around.

So I figured, hey, what about having the computer transcribe stuff for me? Windows XP itself doesn’t come with speech recognition built-in, but Microsoft includes a speech recognition engine with Office XP and 2003, and also has one included as part of the free Microsoft Speech API (SAPI 5.1) SDK. And all five versions of Vista come with an updated, more-accurate engine. (For those of you keeping score at home, that would be Home Basic, Home Premium, Business, Enterprise, and Ultimate. Phew!)

Anyways, SAPI itself has a few simple functions that do most of the work for us in doing a transcription. Basically, you take a stream, bind it to a (WAV) file, pass that stream to the speech engine, enable “dicatation mode,” and print out what the the engine thinks it hears. File to stream to engine to recognized text.

That didn’t seem too complex, so I set out to write a quick C# command-line app to do that. For the most part, even though it wasn’t designed with C# in mind, the COM interop story for SAPI 5.1 is pretty good. Unfortunately, the C#-language binding for the SpStream.BindToFile function is a little iffy. The C++ type signature for the file name is const WCHAR *. Somehow, that got translated into ref ushort. Not so good. I found a post from one Microsoft employee, Dave Wood, which acknowledged this problem with SAPI, and also gave suggested workarounds. However, I figured that, considering how simple the app would be, I might as well just write it in C++.

Well, I did. The hardest part turned out to be learning about the multitudes of string representations in Win32 COM programming, and how to translate between them. A little reading of documentation, sample code, and other examples online, and I had something that worked. No error handling logic in place, but it worked.

For certain values of worked, anyways. By “worked,” I mean that it fed the WAV data to the speech recognition engine, and got results back. The usefulness of those results is another issue entirely.

First, the source of the WAV makes a pretty big difference in recognition accuracy. This makes a bigger difference when using the older version 5.1 recognizer on XP than Vista’s newer version 8.0, but it’s noticeable on both. A WAV file I created using Audacity and a speech-recognition-tuned microphone yielded decent results on XP and great results on Vista. A snippet of the audio from Kelman’s presentation, converted to WAV format, was… spottier. I manually ran Audacity’s “Normalize” command to make the waveform graph more similar to the mic’s recording, and that improved recognition accuracy somewhat. Unfortunately, the results on XP are comical and, at times, almost poetic.

I recorded this sample WAV file, which is just me saying “Computing research has made remarkable advances, but there’s much more to be accomplished. The next ten years of advances should be even more significant, and even more interesting, than the past ten,” which is simply the first two sentences from the abstract of Ed Lazowska’s UWash talk.

Here’s what the version 5.1 recognizer thought it heard: “Computing research has made remarkable advances but there’s much more of a published the next ten years of events this should be even more significant and even more interesting from the past ten.” Close, but not quite. The lack of periods and puctuation is expected, but turning “to be accomplished” into “of a published,” well, not so much.

Vista’s version 8.0 recognizer does much better: “Computing research has made remarkable advances but there’s much more to be accomplished the next 10 years of advances should be even more significant and even more interesting than the past 10.” Flawless! Note that Vista is smart enough to turn “ten” into “10.” Not such a big deal here, but it’s much nicer to read “5,313,852” than “five million three hundred thirteen thousand eight hundred two.” Anyways, I think that’s pretty darn impressive for a completely untrained speech-recognition system.

Unfortunately, the results on Kelman’s presentation weren’t nearly as good. I took a short clip of his presentation, which I transcribe as “… man. I learned how to do everything. And, uh, a couple of years later I started Plumtree Software with a few of my friends. And if you can start a company, um, everyone will tell you it’s too soon, um. And I’m sure they’ll be right, there were so many things that I didn’t know, ah, when I started Plumtree, and I erred egregiously, uh, and really suffered for it. But, ah, you’ll never know everything…”

First, Vista’s valiant attempt. It gets points, I suppose, for including recognizable phrases from the original audio. Knowing what the real transcription is, we can see a mutilated version of it here, and I think you can get a general sense of what the original audio was saying. Not all the fine points, but the general sense, sure.

A man I learned how to do everything
and got a couple years later I started onto software will hit my friends
and if you can set a company of everyone will tell you that it’s too soon
hung at sure it’ll be right there are some things that I didn’t know how I started entree am a grievously on
and really suffered for it but the deal ever now everything

Meanwhile, here are a few of XP’s attempts. These are all with the exact same input file, mind you. Same input, very different outputs.

Air guard had every hour at our doorstep start of his software NFS at its start out
that are out how data it’s too soon
that ensured underwriter scientists I didn’t know that I saw it and trade vendor who displayed (suffered a heart
out ever known everything

Event that my head.)
and not to use their instead of his software Jennifer and 75¢ of the
outdoor elements out in a season
that ensure the main) so it’s I didn’t know that aside and trade at graciously I suffered during
the demo never know everything

That men have everything at Gottschalks understand outside of his software to the outset if you can start company
founder of how the united states suing
that ensure the main) scientist I didn’t know

It’s really sort of poetic, in its own twisted way.

So, anyways, if you have XP or Vista, and don’t mind tinkering with basically-untested software, you can download Transcribe.exe and test it yourself. XP users will probably need to download and install the SAPI SDK 5.1 in order to get a speech recognition engine installed. Users of both XP and Vista may (or may not) need to install the Visual Studio 2005 C Runtime components, depending on what other software you’ve already got installed.

Usage is simple: pass in the name of a WAV file to read from, and the name of a file (probably .txt) to output the results to. On my computer, a 2.0 GHz Core 2 Duo, it transcribes at roughly 3x realtime.

Having the ability to transcribe without easy speech synthesis felt like having yin without yang. So, enter Say.exe. It’s even simpler than Transcribe. It, too, is a command-line application.
say.exe usage
If the first argument is a file, the other arguments are ignored and whatever the file contains is passed to the default TTS voice to speak. If the first argument is not a file, whatever arguments are provided are passed to the default TTS voice to speak.

Say.exe is written in C#, and targets the .NET Framework version 3. It should run on Vista without any other downloads required, but XP computers that don’t have it yet will need to download .NET Framework 3.0 first.

Please note that both of these apps were written with no real error handling to speak of. I don’t think anything disastrous should go wrong with them, but for all I know they could turn mutant and eat your files, your leftovers, and your houseplants. Then again, I doubt it.


4 thoughts on “Transcribe-n-Say

    • Nonsense, dani! There were plenty of words there you understood.

      I’ll try to translate some of the more obscure jargon, though:

      Software Development Kit, a bunch of computer code and technical documentation on a specific topic that someone (usually large software companies) will make available for other people to use the code the larger company has written. So, in this case, the topic is speech recognition, the large company is Microsoft, the other person (a.k.a. third-party developer) is me.
      Abbreviation for Speech API
      Application Programming Interface. When a software developer writes a big, complex piece of code, like code to do speech recognition, and they want to make that code available for other people to use, they do so by writing (and documenting) an API. An API is supposed to separate the “what” from the “how.”

      Think of Internally, it’s a very large company, with warehouses all over the country and thousands of employees, not to mention millions of lines of computer code running their website. But as a customer, you don’t see all that complexity. You go to their site, search for the latest Tolkien anthology, click “Buy Now”, and boom, it arrives on your doorstep a few days later. You didn’t have to worry about which warehouse the book came from, or whether UPS would ship it cheaper than FedEx, or anything else like that. The same sort of thing applies for programming interfaces, too. As a programmer writing code that interacts with the Speech API, I don’t really care HOW it does what it does. I just want to give it some stuff, and get some stuff back, and the steps in between are their problem, not mine.

      A file format for audio, like MP3. WAV is a much simpler format for MP3, which makes it much easier for programs to figure out what sounds are actually in the file.
      C#, C++
      Different programming languages. C# is newer and was designed to avoid some of the flaws that people found in older languages like C++.
      COM interop, type signatures
      Trifling technical details
      Strings are variables (like x in math) that hold words instead of numbers.
      Free program for editing sound files.
      Run from within cmd.exe, as shown in the screenshot above

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s