Report of the Blizzard Challenge 2009 Workshop
Tomoki Toda
SLTC Newsletter, October 2009
This article provides a report of the Blizzard Challenge 2009 Workshop, an annual speech synthesis event, that took place on September 4th 2009 in Edinburgh, UK.
The Blizzard Challenge has been devised in order to better understand and compare research techniques in building corpus-based speech synthesizers on the same data. The basic challenge is to take the released speech database, build a synthetic voice from the data and synthesize a prescribed set of test sentences. The sentences from each synthesizer are then evaluated through listening tests.
The Blizzard Challenge 2009 was the fifth annual Challenge. This year, not only the main tasks, called hub tasks, but also several other tasks, called spoke tasks, were conducted for two languages, UK English and Mandarin Chinese. More details of individual tasks are as follows:
- Hub tasks in UK English
- EH1: build a voice from the full UK English database
- A voice was built using all the data of a single UK English male speaker. The speech corpus size was about 15 hours.
- Synthetic voices were submitted from 14 participants.
- EH2: build a voice from the specified ARCTIC subset of the UK English database
- A voice was built using only the ARCTIC subset of which size was about 1 hour.
- Synthetic voices were submitted from 15 participants.
- EH1: build a voice from the full UK English database
- Spoke tasks in UK English
- ES1: build voices from the specified small datasets
- Voices were built using three small subsets from the ARCTIC subset, (1) only the first 10 sentences, (2) the first 50 sentences, and (3) the first 100 sentences.
- Voice conversion, speaker adaptation techniques or any other technique were allowed to be used.
- Synthetic voices were submitted from 6 participants.
- ES2: build a voice suitable for synthesizing speech to be transmitted via a telephone channel
- A voice was built using the full UK English database suitable for synthesizing speech to be transmitted via a telephone channel.
- A telephone channel simulation tool was available to assist in system development.
- Synthetic voices were submitted from 9 participants.
- ES3: build a voice suitable for synthesizing the computer role in a human-computer dialogue
- A voice was built using the full UK English database suitable for synthesizing the computer role in a human-computer dialogue.
- A set of development dialogues were provided. The test dialogues were from the same domain.
- Although participants couldn't change any of the words in the sentences to be synthesized, they were allowed to add simple markup to the text, either automatically or manually, if they could be provided by a text-generation system; e.g., emphasis tags would be acceptable, but a handcrafted F0 contour would not.
- Synthetic voices were submitted from 2 participants.
- ES1: build voices from the specified small datasets
- Hub task in Mandarin Chinese
- MH: build a voice from the full Mandarin database
- A voice was built using all the data of a young female professional radio broadcaster. The speech corpus size was about 10 hours.
- Synthetic voices were submitted from 9 participants.
- MH: build a voice from the full Mandarin database
- Spoke tasks in Mandarin Chinese
- MS1: build voices from the specified small datasets
- Voices were built using three small subsets from the full Mandarin database, (1) only the first 10 sentences, (2) the first 50 sentences, and (3) the first 100 sentences.
- Voice conversion, speaker adaptation techniques or any other technique were allowed to be used.
- Synthetic voices were submitted from 5 participants.
- MS2: build a voice suitable for synthesizing speech to be transmitted via a telephone channel
- A voice was built using the full Mandarin database suitable for synthesizing speech to be transmitted via a telephone channel.
- A telephone channel simulation tool was available to assist in system development.
- Synthetic voices were submitted from 6 participants.
- MS1: build voices from the specified small datasets
Several listening tests such as an opinion test on naturalness, an opinion test on similarity, and an intelligibility test were conducted independently for each task. The following 4 systems were also evaluated as benchmark systems: 1) natural speech; 2) Festival: concatenative speech synthesis system based on unit selection; 3) HTS2005: speaker-dependent HMM-based speech synthesis system; and 4) HTS2007: speaker-adaptive HMM-based speech synthesis system. There were 19 teams from around the world in the challenge.
The Blizzard Challenge 2009 Workshop took place in the Centre for Speech Technology Research, the University of Edinburgh as a satellite event of Interspeech 2009. There were around 70 attendees! (This number of attendees was surprisingly comparable to those in the 5th ISCA speech synthesis workshop (SSW5) in Pittsburgh, 2004!) The workshop started from an overview and summary of results given by Dr. Simon King. And then, each team gave a 15 minute talk for presenting the developed system. After the system presentations, attendees had general discussion about the Blizzard Challenge. The attendees deepened exchanges at a pub after the workshop. It was truly a wonderful day!
Corpus-based speech synthesis techniques have been dramatically improved over the past several years. It may be no exaggeration to say that the Blizzard Challenges have substantially contributed to their improvements. We have learned a great deal from the challenges; we have found the effectiveness of statistical parametric speech synthesis such as HMM-based speech synthesis having a tremendous amount of potential for providing a very flexible synthesis framework; we have re-realized the effectiveness of concatenative speech synthesis based on unit selection; and we have organized our thoughts about the relationship between these two main approaches to corpus-based speech synthesis. Current techniques enable development of a general purpose TTS system capable of synthesizing quite natural speech. Some speaking styles are also achieved well. However, they still leave much to be improved. Especially it is worthwhile to develop speech synthesis techniques for providing appropriate speaking styles according to demands of various speech applications. In this challenge, only 2 participants submitted voices in ES3 task to build a voice suitable for synthesizing the computer role in a human-computer dialogue. It is expected that more research activities will soon be focused on development of these techniques.
Acknowledgements:
- Thanks to Dr. Simon King of CSTR, University of Edinburgh, for providing detail data of the Blizzard Challenge 2009.
For more information, see:
- Blizzard Challenge 2009 Workshop - SynSIG
- Blizzard Challenge 2009 - SynSIG
- Papers and Results of Previous Blizzard Challenges
- ``The Blizzard Challenge'' SLTC Newsletter, April 2009, by Alistair Conkie
If you have comments, corrections, or additions to this article, please contact the author: Tomoki Toda, tomoki [at] is [dot] naist [dot] jp.
Tomoki Toda is Assistant Professor of Graduate School of Information Science, Nara Institute of Science and Technology, JAPAN. His interests are statistical approaches to speech processing. Email: tomoki@is.naist.jp




Add A Comment