Following Global Events with IBM Translingual Automatic Language Exploration System (TALES)
Leiming Qian and Imed Zitouni
SLTC Newsletter, May 2010
This article describes IBM TALES, a multi-lingual, multi-modal foreign news media monitoring research prototype system. TALES incorporates multiple natural language processing technologies from IBM, such as speech-to-text, machine translation, text-to-speech, information extraction, etc.
Background
Modern communication technologies have made massive amounts of real-time news information in English and foreign languages readily available. This data takes the form of multilingual audio, video, and web content generated by broadcast media and social networking sites and is growing daily. This calls for effective, scalable, multilingual media monitoring and search solutions..
Several companies have developed systems based on Natural Language Processing (NLP) technologies to address the news monitoring task. Not only is it necessary for such a system to handle multimodal data, but also it must cross the language barrier, and process data with low latency. Here at IBM we made a big step toward this goal with the development of the Translingual Automatic Language Exploration System, codenamed TALES.
Overview
TALES is a news monitoring system that allows English-speaking users to monitor English and foreign-language news media in near real-time and search over stored content. TALES captures multilingual TV news broadcasts and crawls multilingual websites daily. The collected data is passed through a series of natural language processing engines to extract metadata. More specifically, for audio/video data, the following engines are applied:
- Speech-To-Text (STT) engine to create a original-language transcription of the media, including the timestamp for every recognized word [1].
- Statistical Machine Translation (SMT) engine to produce an English transcription of the media (if the media is non-English), including token-level alignments between the source and target sentences [2] [3].
- Text-To-Speech (TTS) engine to optionally produce an English sound track for the source media based on the English transcription generated in step 2) [4].
- Information Extraction (IE) engine to identify entities such as person, location, organization, events, and relations between these entities. This analysis is done on both the source language and English [5].
- Speaker and Gender Segmentation (SGS) engine to track speakers based on acoustic signatures and detect their gender [6].
- Language and Dialect (LD) identification engine to label speech segments with language and dialect information [7].
- All the metadata generated in the above steps are stored and indexed, making them available for search [8].
When dealing with web data, TALES first detects the page language [9], and then processes it with engines in step 2), 4), and 7), omitting the audio/video-specific processing steps.
For end-users, TALES provides two ways to access the system:
- Media Monitoring: users can monitor news video by streaming foreign news media with machine generated English closed caption with a latency of 4.5 minutes. Users are also able to listen to an English rendering of the foreign content using TTS. The closed caption also displays additional metadata such as speaker, gender, language/dialect, extracted named entities, etc. TALES supports dual-caption mode so a bilingual speaker can view both close caption streams at the same time.
End users can also upload their own media files in various formats (avi, wmv, flv, etc) for processing, or request TALES to process files on a USB thumb drive or CD/DVD media. - Media Search: multilingual media processed by the TALES system is immediately available for search by the end-user using either English keywords or keywords in the source language of the media. TALES supports a rich set of search syntax including Boolean operators, filter by language, data, and modality. Once a relevant news segment has been located, a user can further look at the details of the document by:
- Using a story board mode for video segments, where important key frames from the video are displayed together with associated transcription.
- Viewing the cached web page and re-translate the page on-demand with the page layout perfectly preserved.
In addition, TALES provides a web page translation tool named “TransBrowser” that allows end-users to seamlessly translate a web page and all its linked pages with perfect layout preservation, effectively enabling them to navigate foreign news websites as if they were in English. TransBrowser even allows the end-user to submit corrections to the machine translation to help improve the SMT engine.
TALES has been running 24x7 for about five years. At the moment it supports the following list of languages: English, Chinese (Simplified and Traditional), Modern Standard Arabic, Farsi, and Spanish. More languages are being added.
IBM is not the only company working on a news monitoring system. Examples of similar systems from other companies include BBN Technology’s Broadcast Monitoring System (BMS) [10], Autonomy’s Virage [11], Volicon’s Observer [12].
To summarize, TALES is a multilingual, multi-modal analytic system that lets English speakers collect, index and access information contained in English and foreign-languages news broadcasts and Websites. TALES technology is built on top of the IBM Unstructured Information Management Architecture (UIMA) [13] platform and uses multiple IBM natural language technology components. TALES enables users to search English and foreign-language news, play back streaming video with English closed captioning, monitor live video with low latency time, browse and translate foreign websites, etc. TALES has been deployed in multiple customer sites.
Acknowledgements
This work was partially supported by the Defense Advanced Research Projects Agency under contract No. HR0011-06-2-0001.
Screenshots

TALES Video Player with Dual-Caption Enabled

TALES Media Monitor, with a show being monitored

TALES Search UI with Embedded Video Player

TALES TransBrowser with Correction Mode
References
For more information, see:
- G. Saon, D. Povey and G. Zweig, “Anatomy of an Extremely Fast LVCSR Decoder”, Proc. Eurospeech 2005, Lisbon, Portugal.
- C. Tillmann and H. Ney, “Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation”, Computational Linguistics, v. 29, no. 1, pp. 97-133, 2003.
- Y. Al-Onaizan and K. Papinei, “Distortion Models for Statistical Machine Translation”, Proceedings of 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pp. 529-536, 2006.
- Pitrelli, J. F., R. Bakis, E. M. Eide, R. Fernandez, W. Hamza, and M. A. Picheny, ”The IBM Expressive Text-to-Speech Synthesis System for American English”, IEEE Transactions on Audio, Speech and Language Processing, v. 14, no. 16, July 2006, pp. 1099-1108.
- R. Florian, H. Hassan, A. Ittycheriash, H. Jing, N. Kambhatla, X. Luo, N. Nicolov, and S. Roukos, “A Statistical Model for Multilingual Entity Detection and Tracking”, Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pp 1-8.
- J. Huang, E. Marcheret, K. Visewswariah and G. Potamianos, “The IBM RT07 Evaluation Systems for Speaker Diarization on Lecture Meetings”, chapter in Multimodal Technologies for Perception of Humans, Springer, 2008.
- J. Navratil, “Automatic Language Identification,” chapter in “Multilingual Speech Processing,” Eds. T. Schultz & K. Kirchhoff, Academic Press, April 2006, ISBN-13.978-0-12-088501-5, pp. 233-272.
- Apache Solr: an open source enterprise search server based on the Lucene search library.
- J. M. Prager, “Linguini: Language Identification for Multilingual Documents”, HICSS, vol. 2, pp. 2035, Thirty-Second Annual Hawaii International Conference on System Sciences-Volume 2, 1999.
- D. Ferrucci and A. Lally, “UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment”, Nat. Lang. Eng., v. 10, no. 3-4, pp. 327-348, 2004.
- BBN Technology Broadcast Monitoring System:
- Autonomy Virage
- Volicon Observer
Leiming Qian is a Senior Software Engineer at IBM T. J. Watson Research Center and the software architect for TALES. His interests are the application of Natural Lanuage Processing technology to solve real life problems.
Imed Zitouni conducts research on a variety of NLP problems at IBM T. J. Watson Research Center. His research interests include language understanding, information extraction and machine translation on speech and text.

