Thursday, July 23, 2009

Uncovering speech-to-text labor

My most recent book concerns a form of information labor I refer to as "speech-to-text" labor — the work of transcribing and translating, whether after-the-fact or in realtime, a person's spoken words to printed text. For over a century, the use of special stenographic systems of listening, memorization, and notation has represented one means to accomplish this labor, aided by an ever-changing mix of technologies, from Stenotype keyboards to laptop computers. Another means, dating back not quite as long, has employed speech recording and playback devices, from the wax cylinders of early dictation machines to the embedded digitial audio recording chips of today. But either way, a human transcriber/translator was always involved at some point in the process.

For many decades, however, a third means to accomplish speech-to-text labor has been in the works: one which attempts to substitute computational algorithms for human listening and judgment, these days often quite succesfully. Whether for producing records of courtroom testimony, displaying captions for late-night television, or developing transcripts of global wiretapping efforts, the act of interpreting, understanding (to a degree), and transcoding human speech seems to be a task which, given a smart enough program and a fast enough machine, computers ought to be able to do.

An interesting posting over at the BBC technology blog "" caught my eye recently because I think it exemplifies the fact that even with the latest versions of these kinds of technologies, human labor is nearly always still present in the speech-to-text loop — sometimes because humans provide more accuracy in the final product, and sometimes because humans represent a more lower-cost, scalable, flexible way of accomplishing these tasks. The case in question is a venture called Spinvox, "a great British technology success story, using brilliant voice-recognition software to decode your voicemail messages and turn them into text." The blogger's question was, do machines really decode these voicemails, or do humans?

Still wishing to be convinced that it was people not machines listening to my messages, I tried another tactic. It was suggested to me that if I recorded a message and then sent it five times in a row to my mobile, then a computer would provide the same result every time. Well my message was deliberately stumbling and full of quite difficult words - including my rather tricky name. But every version that came back to me in text form was radically different - and pretty inaccurate. So unless Spinvox is employing a whole lot of rather confused computers to listen and transcribe messages, it sounds like the job was being done by a variety of agents.

Why does this matter? After all Spinvox has always been clear that there is a human element in the work - though when it says it can call on "human experts for assistance", you might imagine Cambridge boffins rather than overseas call centre staff. But the fact that so much of its work still appears to rely on people simply listening and typing could have implications for its finances and its data security.

I don't find it surprising that Spinvox would rely on such a spatial, temporal, skill and wage division of labor — farming snippets of complicated translations out, 24 hours a day, to a dispersed network of highly-structured and inexpensive spots around the globe for nearly-instant human decoding. I do find it interesting that "security" is the main concern here. The idea that a snippet of a voice mail, decoded by a low-wage call-center worker, could represent a security risk to the caller or the receiver reminds me of the late 19th century concerns (which I explored in my first book) that telegraph messenger boys would find insider investment knowledge by peeking into the printed versions of telegrams that they hand-carried into and out of the electrical wired networks. (Who knows, if this worry over the security of transcribed and translated voicemail takes hold, it might motivate the same kind of solution for some as the problem did a century ago — writing and speaking in code.)

For my part, I think the most interesting aspect of this case is that the boundary between what we think of as a problem amenable to a technolgical fix (speech-recognition software) versus a spatial/social fix (situating countless individuals in time and space who can provide piecemeal labor on demand) is still very blurry. Voicemail itself — especially when accessible through a personal, mobile device — is a technology meant to enable its privileged user to arrange the time and space of his or her own working day for maximum convenience, flexibility, and productivity. We need to remember that the freedom of one group's mobility and flexibility — even in such a small case as this — may very well come at the cost of another group's fixity and constraint.

(UPDATE: The story over at continues for another post, with a response from the firm.)

No comments: