Speech recognition for live TV captioning
Filip Jurcicek
SLTC Newsletter, April 2009
As TV captions are important to many types of viewers, a lot of effort has been put into live captioning. Thanks to recent progress in speech recognition, TV broadcasters such as the BBC and Czech Television have recently increased the number of captioned TV programs they provide. The BBC, a pioneer of speech recognition usage for TV captioning, is currently captioning 100% of its broadcasted programs.
TV captioning is the process of transcribing and displaying the spoken portions of a television program. For pre-recorded programs, the captions can be prepared in advance. A human captioner listens to a program and then transcribes all speech. Later, the transcription is edited, positioned on the screen, and aligned with the speech. However, this is tedious and demanding work which can take up to 16 hours for 1 hour of a captioned program [1]. Furthermore, for live programs such as sport events, parliamentary broadcasts, and live news, the captions cannot be prepared in advance and highly skilled stenographers are needed. These stenographers must be capable of writing up to 200 words per minute and must have breaks every 15 minutes. Generally, it is very difficult and expensive to find enough skilled stenographers for a large event.
TV captioning is important to many types of TV viewers. TV captioning is essential for viewers with a hearing impairment, captions help non-native viewers improve their second-language skills, and captions can serve as a cost-effective way of improving first-language literacy. In India, it was shown that watching TV with captions improved literacy skills [2], and according to Ofcom, the media regulator in UK, out of the 7.5 million people who used TV subtitles in UK, six million had no hearing impairment at all [3].
For many years, live events were captioned by stenographers only; however, automatic speech recognition technology has now progressed enough to be useful for live TV captioning. Automatic speech recognizers convert speech into a sequence of words. State-of-the-art techniques use statistical methods to train acoustic and language models. The acoustic models capture probabilities of acoustic realizations of words, and the language models capture probabilities of word sequences in the language. These models are estimated from hundreds of hours of spoken speech and tens of millions of sentences.
The BBC is a pioneer in using speech recognition for TV captioning. The BBC started live captioning using speech recognition software in April 2001 [4]. Since May 2008, the BBC is captioning 100% of its broadcasted programs. This is only possible due to the large-scale use of speech recognition for live programs [5]. For example, the BBC Breakfast news and BBC News 24 programs are among those which are captioned with the help of speech recognition. At the BBC, speech recognition is also used for captioning pre-recorded programs. The captions are dictated and then manually edited if necessary; thus, the use of speech recognition significantly lower the amount of time needed for preparation of the captions.
A key challenge is producing live captions with a minimum delay. The time between when the words are spoken and when they are shown on a screen has to be less than a few seconds, otherwise the usability of such captions suffers. These real-time and low latency constraints limit the techniques that can be used for live captioning. Because of this, live captioning using speech recognition typically involves a captioner listening to a live program and then re-speaking the coverage, rephrasing and simplifying spoken speech, and entering punctuation and speaker-change information. To further reduce the number of errors, the speech recognition program is adapted to the particular voice of the captioner who is doing the re-speaking. This is because despite years of research into speech recognition, its performance is not perfect and is influenced by many factors. For example, speech recognition performance decreases with noise, background music, multiple simultaneous speakers, as well as a speaker's intonation and phrasing.
Use of speech recognition for TV captioning is not only limited to English. Since November 2008, the public service broadcaster in the Czech Republic, Czech Television, in cooperation with the University of Bohemia in Pilsen provides live captions for the Czech Parliament broadcasts. They currently use an alternative approach to live captioning, as detailed by Josef Psutka, the head of department of Cybernetics at the University of West Bohemia: "Speech recognition is directly used to transcribe the spoken speech without re-speaking. Live audio is sent from Prague to Pilsen 5 seconds ahead of time of broadcasting over an ISDN telephone line. Immediately after the speech recognition, the produced captions are sent back. It is currently in public testing and viewers can turn on the live captions on the teletext page 888." The acoustic and language models were trained on approximately 100 hours of Czech Parliament broadcasts and the stenographic records of the past parliament meetings.
Because the quality of captions produced by re-speakers is generally higher, live captioning involving trained captioners using speech recognition is in preparation. Josef Psutka adds, "... we also plan to caption other live programs, e.g. discussions, sport events, ... ; however, this time we will use re-speakers. Currently, we collect and develop topic specific vocabularies and language models." Developing a language model for a Czech language is especially hard because Czech and other Slavic languages have a high degree of inflection and a large number of prefixes and suffixes. Therefore, the vocabulary for a Czech speech recognizer must be 8 - 10 times larger than that for an English recognizer. "Our speech recognizer uses a vocabulary comprising of more than 200,000 words. And still, the speech recogniser is able to operate in real-time," says Josef Psutka.
Finally, there are also other areas of research connected to TV captioning, such as automatic translation of the captions into other languages. For example, YouTube offers machine translations of the captions included in videos into 41 languages [6], though at the moment it can only translate captions entered manually. However, the translation of live TV captions is more challenging than standard machine translation, especially if it is used together with speech recognition. Srinivas Bangalore, researcher at AT&T Labs - Research, says: "The main challenge besides the translation modelling itself is that the translation of speech recognition output requires a robust handling of error from the speech recognizer. Moreover, it requires a real-time, low latency translation model which is a big challenge for widely differing language pairs, for example English - Japanese." Although machine translation does not achieve the quality of human-generated translation, it is still beneficial. Even in the worst case, the machine translation can provide an idea of the topic of discussion, and gaps in the translation are usually covered by the video.
- M.J. Evans: Speech Recognition in Assited and Live Subtitling for Television, BBC R&D White Paper WHP 065, July 2003
- B. Kothari, A. Pandey, and A.R. Chudgar: Reading Out of the "Idiot Box": Same-Language Subtitling on Television in India, Information Technologies and International Development, Volume 2, Issue 1, September 2004
- Ofcom: Television access services - Summary, March 2006
- R. Griffiths: Giving voice to subtitling: Ruth Griffiths, director of Access Services at BBC Broadcast, July 2005
- BBC: Press Releases - BBC Vision celebrates 100% subtitling, May 2008
- Blog YouTube: Auto Translate Now Available For Videos With Captions, November 2008

