An Overview of Translingual Automatic Language Exploitation System (TALES)
Tara N. Sainath
SLTC Newsletter, October 2011
Over the past few years, IBM Research has been actively involved a project known as Translingual Automatic Language Exploitation System (TALES). The objective of the TALES project is to translate news broadcasts and websites from foreign languages into English. TALES is built on top of the IBM Unstructured Information Management Architecture (UIMA) platform. In this article, we provide an overview of the TALES project and highlight in more detail some of the new research directions.
Overview of TALES
TALES allows foreign-language news broadcasts and websites to be accessed and indexed by English speakers. An example of the TALES system is shown in Figure 1. In the past, this project has focused in four main research areas .
Figure 1: Example of TALES system translating a Foreign Broadcast into English.
Speech-to-text is the problem of recognizing spoken words (for example from broadcast news dialogues) and converting this into appropriate text, in essence this is the main problem in automatic speech recognition (ASR). Challenges in this area included recognizing speech in noisy environments and processing speech with different accents. Better speaker and dialect detection engines helped to address these issues.
Named Entity Information Extraction
Named entities are spans of text corresponding to real-life entities, including persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. Special care of named-entities is vital for a good translation system. The TALES project looked at handling a wide variety of incoming texts, for example speech recognition output which is upper-case and has no punctuation.
Machine translation (MT) is the problem of translating text from one language to another, for example from French to English. The MT model was developed over many years, and IBM itself pioneered statistical machine translation in the early 90s. Initial challenges in the MT area included dealing with words not found in standard dictionaries, as well as proper translations of named entities. To address these issues, an MT model was developed that received daily updates based on user feedback of phrases which could not be translated correctly, for example out-of-vocabulary words.
Presentation of Translated Transcription
At the heart of any end-user system is the necessity to properly display appropriate information to the user. The TALES project displays translated transcriptions as subtitles/closed-captioning, together with the video, looking at intelligent segmentation and timing algorithms for effective display. In addition, a Web translation tool allows for translations to be available as soon as they are available, rather than waiting for an entire page to be translated.
Much of this research has led to a successfully deployed TALES system which lets users search search foreign-language news, play back streaming video with English closed captioning, monitor live video with low latency time, browse and translate foreign websites, etc. Currently, TALES supports Arabic, Chinese, Spanish and English.
Current Research Directions
One current research direction in the TALES project, inspired and sponsored by the DARPA GALE program, has focused on information extraction. Specifically, this allows users to search for information on people in broadcasts/websites, such as biographical information, where a person has been, acquaintances of the person, etc. This information is organized into various templates, and given a structured query by the user, the TALES system returns relevant information in an organized, detailed fashion.
An algorithm developed at IBM  which makes use of both statistical and rule-based systems, is used to return back relevant information to a structured query.  explores the problem of identifying a collection of relevant sentences (i.e. snippets) in response to a structured query. Rule-based systems are a popular approach to the snippet selection problem, as they are easy to implement and have good performance on small, structured tasks. However, if the data is noisy or large, rule-based systems are limited in performance. Alternatively, statistical-based approaches are much more generalizable for larger, noisier data sets, though require a significant amount of data to train a system.
 explores combining the benefits of both rule-based and statistical-based systems. Specifically, a rule-based system is used to bootstrap the annotation of training data for a statistical system. The motivation for this is twofold. First, a rule-based system allows a working prototype to quickly be designed that can guide the development of a statistical-system. Second, a rule-based system can be used to filter large amount of training data into a relevant subset that can be used by the statistical system. The proposed approach is compared against both statistical and rule-based systems alone on a snippet selection task, and was found to produce a better set of relevant queries compared to the rule/statistical-based approaches alone.
The TALES system was a result of contributions by many researchers at IBM, including Salim Roukos, Radu Florian, Todd Ward, Leiming Qian and Jerry Quinn. Also, thank you to Radu Florian and Sasha Caskey of IBM Research for useful discussions related to the content of this article.
 D.M. Bikel at al., "Snippets: Using Heuristics to Bootstrap a Machine Learning Approach", in Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation, 2011.
If you have comments, corrections, or additions to this article, please contact the author: Tara Sainath, tsainath [at] us [dot] ibm [dot] com.
Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are mainly in acoustic modeling. Email: email@example.com