Interview: MTurk NAACL Workshop Organizers Talk Crowdsourcing, Speech, and the Future of Unsupervised Learning
SLTC Newsletter, July 2010
Last month, a NAACL workshop brought together researchers in speech and NLP that use crowdsourcing services like Amazon's Mechanical Turk and Crowdflower. We had a chance to interview the organizers about the workshop, the future of speech-related work on Mechanical Turk, and the prospects of unsupervised learning given the rise of crowdsourced data.
Crowdsourcing has seen a dramatic rise in popularity recently. With its growth has come the emergence of several tools that allow data collection and annotation to be serviced by a large workforce - one willing to perform these tasks for monetary compensation. Our previous article discussed the benefits of using one crowdsourcing tool, Amazon's Mechanical Turk (MTurk for short), for speech transcription. Other crowdsourcing examples include Crowdflower, CastingWords, HerdIt, reCAPTCHA, and Wikipedia.
Last month, a NAACL workshop brought together researchers in speech and NLP that use crowdsourcing. The "Creating Speech and Language Data with Amazon's Mechanical Turk" workshop provided a forum for crowdsourcing researchers to share cutting-edge research in a variety of topics, including research using crowdsourcing, toolkits to use with MTurk, and lessons learned. Leading the efforts to make this workshop possible were Dr. Chris Callison-Burch and Dr. Mark Dredze, researchers at Johns Hopkins University.
We had a chance to interview Chris and Mark about the workshop.
SLTC: Would you consider the workshop a success? What do you think attendees learned most from the workshop?
Yes! We had a huge attendance, great participation, many positive comments and people who want to do it again. I think the best outcome was the sharing of best practice knowledge among the participants. It was clear many people had learned valuable lessons in how to use MTurk. I know I learned a lot about how to attract turkers (pay turkers to recruit other turkers, advertise in the the turkers' native language), cultivate a trained workforce (pay people bonuses, use qualification HITs, write detailed directions), and how to break up tasks into manageable chunks (think of short HITs that don't require much training, chain many HITs together to get a more complex output). I got to meet other people doing similar things to what I am doing, and we shared ideas and advice.
Another significant takeaway was the importance of quality control and of the design of the HITs. The contributors on Mechanical Turk (and on most other crowdsourcing platforms) are not known to the person requesting the work. Consequently, we can't assume things about them. We can't assume that they are experts in any particular subject area. We can't assume that they are paying close attention. We can't assume that they are behaving in a conscientious fashion, and not simply randomly clicking. Therefore it is important to design the tasks so that (a) they have clear, concise instructions that non-experts can understand, (b) we insert quality control checks, preferably with gold standard data, that will allow us to separate the bad workers from the good workers (and there are a surprisingly high number of good workers on Mechanical Turk). I think that an excellent way of accomplishing both of these things is to iteratively design the task, to do many of the items yourself and to ask friends from outside your field to try it before doing a large-scale deployment on MTurk. That will give you a sense of whether it's doable, and will let you collect some "gold standard" answers that you can insert into the Turk data to detect cheaters. Another route is redundant annotation of the same items. Because labor is so cheap on Mechanical Turk, it's perfectly feasible to redundantly label items and take a vote over labels (assuming that Turkers perform conscientiously in general, and that their errors are uncorrelated).
For complex tasks, it is also a good practice to break them down into smaller, discrete steps. For instance, in my EMNLP-2009 paper , I used Mechanical Turk to test the quality of machine translation using reading comprehension tests. I did the entire pipeline on Mechanical Turk, from having Turkers create and select the reading comprehension questions, to administering the test by having other Turkers answer the questions after reading machine translation output, to hiring Turkers to grade the answers. It was a simple pipeline, but it allowed me to accomplish a complex task in a creative way.
SLTC: What do you foresee as the future of speech-related work with Mechanical Turk?
Certainly, if we want to collect speech data, this is a good way to get a huge diversity of data (different languages, accents, dialects, background noise). Most speech training data is fairly clean and consistent. This tackles the problem from the opposite end. The data you could collect is so diverse, it makes the problem much more challenging, and realistic. You can also use MTurk for speech annotation, etc, and I think Chris's NAACL paper  speaks to that.
Transcribing English data is perfectly feasible on Mechanical Turk. It's unbelievably inexpensive. In the paper that I wrote with Scott Novotney, a PhD student here at Hopkins, we collected transcriptions at the cost of $5 per hour of transcription. Let me say that it again. It cost us $5 per hour transcribed. This is a tiny fraction of the cost that we normally pay professionals to transcribe speech. The costs are so low, that a successful business, CastingWords, has sprung out of doing transcription on Mechanical Turk.
In my opinion, English is not so interesting. Personally, I'm really interested in diversifying to languages that we don't already have thousands of hours worth of transcribed data. I'm interested in creating speech and translation data for low resource languages. It is still an open question as to whether we can find speakers of our languages of interest on Mechanical Turk. To that end, I have been collecting translations for a number of languages to see if it is feasible. So far I have experimented with using Mechanical Turk to try to solicit translations for Armenian, Azerbaijani, Basque, Bengali, Cebuano, Gujarati, Kurdish, Nepali, Pashto, Sindhi, Tamil, Telugu, Urdu, Uzbek, and Yoruba. My initial experiments were simply to determine whether there were any speakers of those languages on MTurk by asking them to translate a list of words, and asking them to self-report on what languages they speak.
Here is the breakdown of the number of speakers of each language from Chris's initial results:
SLTC: You both mentioned in the overview workshop paper  that the affordability of collecting MTurk data means that unsupervised learning work may lose popularity. Can you share your viewpoints on this issue?
This is a point of disagreement between us. I still think there is a big market for unsupervised learning. MTurk has its limits, and not every task can be setup on MTurk. Instead, I think that there are many problems for which we want supervised solutions, but its difficult to invest in the data. Those problems, which may not be studied at all at the moment, will get attention as we can quickly generate data for those tasks.
Yes, we disagree on this point. In general, labeled training data is exponentially more valuable than unlabeled training data. Many conference papers about unsupervised learning motivate the research by claiming that labeled training data is too hard to come by or too expensive to create. Mechanical Turk lets us create data extremely cheaply. My conclusion is that if we want to actually make progress on a task, we ought to collect the data. That has a greater chance of succeeding.
I think that it also gives rise to other interesting research questions: Can we predict the goodness of an annotator? If we solicit a redundant label what is the chance that it'll be better than the ones that I have collected so far? It also opens up interesting research directions in semi-supervised learning and active learning, as opposed to fully unsupervised learning. Most of the active learning papers that I have read are all simulated. This is a platform that allows you to do real, non-simulated active learning experiments. It would also allow you to test things like whether the items that the model selects for labeling are somehow inherently harder than randomly selected items.
As Chris mentioned, one of the key lessons learned from the workshop was MTurk HIT (Human Intelligence Task) design. The design of MTurk HITs represents a fraction of the work involved in what data collected from MTurk is ultimately used for - but design of the HIT can have a strong impact on the quality of results. In this sense, the HIT designer has the most control over MTurk worker reliability. The general consensus from the workshop was that short, simple tasks are most preferable. The workshop showed that MTurk and other crowdsourcing tools are feasible for a variety of speech and natural language tasks. It should be interesting to see how crowdsourcing as a resource evolves over time in speech and language research.
-  Chris Callison-Burch (2009), "Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon's Mechanical Turk," in Proceedings of EMNLP, Singapore.
-  Scott Novotney and Chris Callison-Burch (2010). "Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription," in Proceedings of NAACL, Los Angeles, CA.
-  Chris Callison-Burch and Mark Dredze (2010). "Creating Speech and Language Data With Amazon's Mechanical Turk," in Proceedings of Creating Speech and Language Data With Amazon's Mechanical Turk NAACL Workshop, Los Angeles, CA.
If you have comments, corrections, or additions to this article, please contact the author: Matthew Marge, mrma...@cs.cmu.edu.