Speech Transcription using Amazon's Mechanical Turk Rivaling Traditional Transcription Methods
Satanjeev Banerjee and Matthew Marge
SLTC Newsletter, July 2009
Speech transcription - the process of writing down the words spoken in a snippet of audio - is useful for many reasons. It is useful for humans, because consuming information from text is often faster than doing so from audio; it is useful for automated information storage and retrieval systems, and it is a necessary first step for speech research - for speech recognition, understanding, summarization, synthesis, etc. However, manual speech transcription is an expensive and time-consuming task - sometimes requiring up to ten times the actual duration of the audio snippet to transcribe it. Amazon's Mechanical Turk offers an opportunity to lower the cost of manual transcription while maintaining relatively high quality.
In 2006, Amazon.com, Inc. started Mechanical Turk (MTurk for short) - a website and web application that brings together software developers/researchers ("requesters") who have tasks that need human expertise, and human workers ("providers") who are willing to perform these tasks for monetary compensation. Tasks most suitable for submission to this system are those that can be trivially accomplished by (at least some) humans, but that cannot be done well, if at all, by automatic means. For example, while almost any human can quickly and accurately tell which way, left or right, a person is looking in a photograph, doing so is often difficult for a computer program. MTurk provides a platform for requesters to have such tasks completed through the following sequence: Requesters upload tasks to MTurk's website (manually or automatically through an API). These tasks are viewable to providers who can choose to accept one or more of these tasks. Once providers complete the tasks, the requester can retrieve and review the providers' outputs, and pay them for their efforts if the work appears to be satisfactory. MTurk keeps track of each provider's task-completion rate and other performance measures. Requesters can require that only providers with a particular past job performance level or higher can work on their tasks.
The task of speech transcription is well suited for MTurk - it is relatively easy for humans to do and difficult for speech recognizers. Additionally, speech transcription can be split into small jobs for multiple workers to do in parallel. (A good description - with instructions - of how to use MTurk for speech transcription for human consumption is provided by self-described journalist and programmer Andy Baio on his blog here.) Some commercial services, like CastingWords, use MTurk to provide transcriptions; such companies provide value by taking care of the details of job uploading/downloading, and performing quality assurance, etc.
Unlike transcriptions done for human consumption, however, transcriptions for speech research must be more "faithful" - with speech phenomena such as false starts and repeats transcribed accurately rather than cleaned up. We undertook a small exploratory study to examine the feasibility of using Amazon's Mechanical Turk to create faithful transcriptions of speech. In our experiment, we used 16 utterances ranging from 8 to 32 seconds in length. Each utterance was spoken by a single person (with multiple speakers across the dataset, most of them native), and contained a simple command, e.g. "Move forward for a few seconds, and then turn left". The audio was recorded using close-talking microphones, and was uniformly high quality. We had five MTurkers transcribe each utterance, and paid each MTurker $0.10 for their efforts. (We did not experiment with different payment amounts, and believe $0.10 to be high for the lengths of utterances we used.) MTurkers were instructed to transcribe faithfully - transcribing all false starts, repeats, etc., and calling them out with special notation (angle brackets, parentheses, etc.).
We evaluated MTurkers' transcriptions against transcriptions done in-house, and found that individual MTurkers' transcriptions had a word-level accuracy of 91% averaged across all transcriptions of all utterances. When false starts were cleaned up from both the MTurkers' transcriptions and the references, the average accuracy increased to 95%. It is likely that transcripts of even higher accuracies can be extracted from the five transcriptions per utterance by word-aligning the transcripts and doing word-level voting. We did not experiment with this idea, but to get a feel for the accuracy level that might be achievable, we picked the best transcript for each utterance (from the 5 available transcripts) with false starts, and found an average accuracy of 98%. While we will not know which transcript is the best when we don't have in-house transcripts, this result shows that for each utterance, there is at least one high quality transcript. With a voting scheme, it may be possible to extract this transcript.
In a separate study conducted by Dr. Dan Melamed, Principal Member of Technical Staff at AT&T Labs Inc., MTurk was evaluated for transcribing speech of low audio quality. This study was conducted using speech recorded over cell phones, with uniformly low audio quality. The utterances used in the experiment were short, containing a business name and a location, and were spoken by a single speaker. (There were different speakers across the dataset, most of whom were native speakers.) Five MTurkers were used, and their transcriptions merged by utterance-level voting. While individual MTurkers were found to be uniformly and substantially less reliable than in-house transcribers, the combined utterances were generally high quality. Specifically, by removing the top 25% utterances that MTurkers most disagreed on, the accuracy of the transcription was found to be equivalent to in-house transcription. Further, these accurate combined transcriptions cost an order of magnitude less than in-house transcription.
When asked whether he would use Mechanical Turk for speech transcription
purposes, Melamed said "Yes, we hope to use MTurk regularly in the future."


Ian McGraw - http://wami.csail.mit.edu - July 27, 2009, 2:33 pm
Great article! I think more speech researchers should be aware of this amazing tool.
Here in MIT's Spoken Language Systems group, we've found Amazon Mechanical Turk extremely useful for our research with large data sets. Here are two recent publications that make use of MTurk:
http://wami.csail.mit.edu/papers/QuizletInterspeech2009.pdf
http://wami.csail.mit.edu/papers/QuizletSlate2009.pdf
In these papers, we use Mechanical Turk to verify that corpora collected using educational games can be transcribed automatically using context. We wouldn't have been able to do transcribe so much data so quickly without Mechanical Turk!
Add A Comment