An Overview of Watson and the Jeopardy! Challenge
Tara N. Sainath
SLTC Newsletter, February 2011
Over the past few years, IBM Research has been actively involved a project to build a computer system, known as Watson, to compete at the human championship level on the quiz show Jeopardy!. After four years of intense research, Watson can perform on the Jeopardy! show at the level of human expertise in terms of precision, confidence and speed. The official first-ever man vs. machine Jeopardy! competition will air on television February 14, 15, and 16.
A Deep question-answer (QA) architecture and technology [1] was developed as part of the Jeopardy! challenge to handle QA problems like Jeopardy!. In this article, we describe this DeepQA architecture in more detail, focusing specifically on the speech and natural language processing (NLP) components of the system.
Deep QA Architecture
Watson's Sources
The first step in any QA architecture such as DeepQA is to create a content server which contains relevant information that can be used when answering a question. The creation of the content server is done offline in two steps. First, example questions from the Jeopardy! problem space are analyzed to produce a description of the types of questions that must be answered, as well as to characterize the domain of the question. Given the large vocabulary nature of a task such as Jeopardy!, sources such as encyclopedias, dictionaries, Wikipedia, news articles, etc,. were analyzed to create the content server.
Second, a process called automatic corpus expansion was performed. In this process, a set of seed documents are identified. Then a set of text nuggets extracted from the retrieved web documents are scored to see which nuggets are most informative for each seed document. The most informative nuggets are added to an expanded corpus.
Question Analysis
When a question is received in real-time by the DeepQA system, the first step in the process is question analysis. In this step, the system tries to understand the question itself and performs initial analysis to determine how the question should be further analyzed by other components in the system. A parser, tuned to address the Jeopardy! question phrasing, is used to perform various analyses on the question. Furthermore, additional NLP analysis components are used to process the question, including named entity detection, relation detection, semantic role labeling, etc. The analysis is particularly challenging because Jeopardy! questions can have many nuances, including imprecise information and extraneous information. Therefore, questions must be analyzed to determine what is more vs. less important, and then different search queries are formulated based on this analysis.
Search and Scoring
In the search component, hypotheses are generated by taking the various search queries and searching through the large corpus described above, roughly the size of 1 million books. The search process reduces the 1 million books to less than a hundred candidate passages. This process has been heavily optimized for speed and takes approximately 1 second.
Next, the passages are analyzed by our NLP components much the same way the question was and a set of candidate answers are extracted. The candidates and the analyzed passages are the passed to a scoring component, which contains over 50 scorers that score different aspects of how well a proposed answer matches what the question was asking. These aspects could include temporal scoring, spatial relation scoring, categorical scoring, etc. The scores from different methods are combined to find the top hypothesis from the set of candidate answers.
Answer Generation
Once the textual answer is formulated, the computer then ``speaks" the answer. This is handled by a real-time unit-selection concatenative text-to-speech (TTS) engine, consisting of a language-dependent text-processing front-end and a back-end that handles unit search and waveform generation. One of the challenges in designing the TTS system for Jeopardy is the large open-ended vocabulary nature of the task, as most TTS systems are usually customized for applications requiring narrower vocabulary. A significant portion of this large vocabulary contains words of foreign origin that do not conform as well to the letter-to-sound rules of English, and which required special attention. Another aspect was how to handle text-normalization issues. Examples of these are commonly occurring Roman numerals, as well as idiosyncratic punctuation that is commonly observed in Jeopardy's categories, such as word-internal quotes, multiple dashes, etc. Homograph disambiguation is yet another issue that the system needs to contend with, as the answers to the questions are usually fairly short in length, and therefore the system has access to little or no disambiguating context to decide between alternative pronunciations for a given spelling.
In order to design a robust TTS system, researchers searched through records of previous games to identify and rank salient topics according to frequency of occurrence, and then used these to consult on-line and other sources (Wikipedia, dictionaries, etc.) and assemble a representative vocabulary for each topic.
Human listeners spent more than a year listening to the TTS outputs to identify and correct errors. Related groups of errors that were more systemic were corrected by improving the set of rules in the text-processing front-end. More exceptional cases (such as foreign words that violate the phonotactics of English) were addressed by adding them to exceptions dictionary (or, more generally, context-dependent or phrase dictionaries) that could be consulted for look-up before the standard text-processing rules of the front-end could be activated.
Future
The DeepQA system has certainly pushed state of the art in open-domain QA systems, in terms of depth, breadth and robustness. This QA architecture is now being extended to other applications with similar scenarios to Jeopardy, such as call center and medical domain applications.
Acknowledgements
Thank you to Jennifer Chu-Carroll, Raul Fernandez and Bhuvana Ramabhadran of IBM T.J. Watson Research Center for useful discussions related to the content of this article.
For further information about the project, please visit the following website:
http://www-03.ibm.com/innovation/us/watson/index.shtml
References
[1] D. Ferrucci et al., "Building Watson: An Overview of the DeepQA Project," Association for the Advancement of Artificial Intelligence, Fall 2010.
If you have comments, corrections, or additions to this article, please contact the author: Tara Sainath, tsainath [at] us [dot] ibm [dot] com.
Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are mainly in acoustic modeling. Email: tsainath@us.ibm.com

