The Blizzard Challenge
Alistair Conkie
SLTC Newsletter, April 2009
This article provides an overview of the Blizzard Challenge, an annual speech synthesis event.
The Blizzard Challenge is a workshop where speech synthesis experts get together to talk about and compare their latest innovations. It has taken place anually since 2005, mostly as a satellite event of Interspeech, but in 2007 under the auspices of the 6th Speech Synthesis Workshop. The first pair of events were run by CMU, Pittsburgh. More recently CSTR (Univ. Edinburgh) has taken over the role of organizing.
Blizzard was inspired in part by the successful comparisons done for ASR in the DARPA workshops of the 1990s. A second motivation was the difficulty of comparing TTS systems directly. For data-based systems it's hard to know what improvements are due to the data and what is due to the synthesis techniques. Commercial systems in particular are often difficult to evaluate due to lack of information or availability. Any comparisons that can be carried out are often complicated and multi-faceted.
The only accepted method of evaluation of TTS systems currently available is generally agreed to be listener testing. There are no automatic methods that can be relied on, and in fact it is probably not controversial to say that the builders of a TTS system are rather paradoxically the worst people to evaluate it!
The organizers of the first workshop were well aware of these and other challenges, and set out to establish a framework for comparison that could be refined as experience with testing grew. Another objective was simply to bring the TTS research community together.
Their first goal was to standardize the databases used by the participants and so control at least one of the variables in the experiment. The original data chosen was from the set of CMU ARCTIC American English databases. One of the virtues of these particular databases is that they are available to anyone to use. Over the past few years the databases have changed and grown in size and the most recent challenge used both English and Mandarin databases.
Providing data in this way does presume that system builders need it. The challenge has an inbuilt bias in this way towards so-called data-based synthesis, although the organizers encourage innovation. One interesting discussion that arose was about whether it was fair to use a high quality synthesizer combined with voice conversion. This has been resolved in the 2009 event by allowing any data to be used for system building, although the output should sound like the speaker of the database provided to participants.
Participants are given some help. For each version of the challenge some sort of database labeling is provided to help with building a system. After a suitable time interval has passed -- to allow processing of the databases -- the participants are given previously unknown text examples and asked to provide synthesis. The intent is to make it very difficult to optimize a system for the synthesis to be evaluated by listeners. Each team is required to submit synthesis examples to the organizers by a cutoff date for evaluation.
In some ways evaluating the submissions is the most difficult part of the whole process. There are no hard rules about what to evaluate or what kind of test to use to evaluate it. There are typically tests for intelligibility or for naturalness. Often synthesized semantically unpredictable sentences are used in the evaluations. The challenge organizers have the responsibility for designing the tests, recruiting listeners and then analyzing and evaluating the results.
With the current level of participation (2008 had 16 English, 11 Mandarin entries) it is challenging to find a sound testing methodology. Evaluation is conducted over the web and participants volunteer or in some cases are paid a small sum. One critical factor seems to be that listeners only have limited patience for testing. Another concern is how to deal with the very diverse backgrounds that listeners have -- they are from all over the world with varying levels of English (and presumably now Mandarin) proficiency.
Results are presented as one of the papers in a special session at the conference or satellite workshop. Several earlier challenges have been very interesting in that a previously less-noticed TTS system or approach can instantly achieve a much higher profile on the basis of a good result.
The 2009 Blizzard Challenge is currently under way and the indications are that the workshop (to be held in Edinburgh, Scotland) following Interspeech in Brighton, UK, will be more popular than ever, and not just because of the venue (disclosure: the author is Scottish).
The workshop is open to non-participants. New participants in Blizzard are always welcome and the entry fee is modest.
Web page link: Blizzard Challenge
If you have comments, corrections, or additions to this article, please contact the author: Alistair Conkie, adc [at] research [dot] att [dot] com.



