Multimodal Voice Search of TV Guide in a Digital World
Harry Chang
SLTC Newsletter, September 2009
This article discusses the interesting characteristics of the language models associated with multimodal search applications in the area of digital TV guides in terms of the linguistic properties associated with their text content and from a user’s perspective with respect to spoken or typed queries.
Overview
Digital TV guides are one of the most popular sources of content on the web. Millions of Internet users access their favorite online TV guides regularly. Most of these online TV guides are organized as a 2-dimensional grid along channel and time axes–similar to the look and feel of an electronic programming guide (EPG) on TV. In a desktop environment, a computer mouse and full sized keyboard enable quick and easy selections of the shows in an online TV guide and allow a user to navigate between different search sub-categories. However, when the search interface moves from a desktop environment to television viewing in home settings, we find interesting opportunities for speech-driven multimodal search applications.
For much of television's history, a traditional TV remote control (RC) with 40+ buttons was sufficient for channel surfing. With the advent of broadcasting networks and the introduction of IPTV technology the number of available channels soared from 50 or so just 10 years ago to over 500 today. Fueled by an explosive growth of video content over the Internet, average TV viewers are now facing a daunting task to browse through vast repositories of on-demand content such as pay-per-view movies and sporting events. The search task can become highly complex and frustrating when using the limited capabilities of a traditional handheld RC. Adding a voice search modality to an RC device that would enable viewers to easily express search intent with spoken words is a natural and intuitive evolution in multimodal search interfacing for EPGs and other IPTV content.
One of the first steps in building an effective language model for a spoken language understanding (SLU) application is analyzing the language characteristics of the target content. Specifically, we need to examine the relationship between written descriptions found in the EPG of interest and the referring expressions used by viewers in their spoken or typed queries to interactive media search engines. A simple method of analyzing linguistic properties of written content in EPG is to examine the usage of words/phrases, such as their frequency f, relative rank r, and their relationship as described by George Zipf [1] sixty years ago in a mathematical function as follows: (where c is a constant for the corpus):
Linguistic Properties of EPG
While typical EPG data sources contain many text categories such as titles, descriptions, channel names, cast names, and so on, the title texts always occupy the largest screen space on TVs as well as in other printed media such as the TV section in local newspapers. It is reasonable to assume that the title texts would have the most significant influence in the users’ vocabulary. To study EPGs from this perspective, the speech technology researchers at AT&T Labs created a title corpus from an EPG covering ten TV markets in the U.S. over a 13-month period. For this article, the corpus is referred to as EPG10. Figure 1 shows the sentence length distribution of EPG10 where average sentence length = 3.1 words.
Figure 2 shows the Zipf curve versus the actual frequency data of the n-grams (n<=5) extracted from EPG10. As reported by others [2], the Zipf law clearly does not hold true for the words in the top-tier (r <= 135) and for the words in the bottom-tier (r > 3700) of EPG10. The sharp drop-off in the Zip curve at the end indicates that many low-frequency words are tightly clustered together. A closer analysis shows that many such clusters have over 50 members with the same rank order.
Experiments
In order to study the user language model from spoken queries for EPG contents, a multimodal search prototype was built and given to a small group of the company employees to try out in a realistic home television-watching environment. A total of 606 voice input sessions (1,112 spoken words) were recorded from the users. The average sentence length of spoken expressions is 2.8 words. When excluding expressions with an actor's name, the users’ language model is almost entirely made up of the title words.
To get a broader perspective of the user language models for the same domain, a much larger data set of typed queries was collected from a commercial website exclusively for searching the EPG offered by a TV service provider. The data collection took place during a 2-week period in December 2008, from an estimated 10,000 users in the U.S. The corpus contains about a half million typed queries with a vocabulary of 21,111 unique words. The analysis shows a similar influence of the EPG on the user’s query language. After excluding the 1-word queries matched to common 1-word program titles (e.g., Seinfeld, Lost, Heroes, or Frasier), the average sentence length is 2 words. Most interestingly, over 95% of all word tokens in the corpus are within the vocabulary of EPG10, namely, all the title words in an EPG for regular TV programs.
Summary
The experimental results seem to suggest a strong influence of online content on users’ choice of vocabulary when querying EPGs via spoken or typed phrases. The words in the title texts dominate the users’ query vocabulary. It is also interesting to note that the average query sentence length from the user population is very close to the average sentence length of the underlying corpus which represents the word content written by a small group of professional writers. However, the low-frequency n-grams from the title corpus (EPG10) do not follow Zipf’s law, making it difficult to model the relationship between their rank order and corresponding frequency. A future study will be to understand how the n-gram title texts with their frequency ranking in the bottom-tier of the Zipf’s curve may influence the user’s query language so that we can build more effective models for SLU-based search applications for EPGs where the underlying text content is constantly updated on a daily basis.
Acknowledgements and more information
Thanks to Bernard Renger and Michael Johnston for their collaboration in the research work that led to this article.
References
[1] G.K. Zipf. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley
[2] L.Q. Ha, E.I. Sicilia-Garcia, J. Ming, and F.J. Smith. 2002. Extension of Zipf’s Law to Words and Phrases In Proc. of the 19th international conference on Computational linguistics - Volume 1
If you have comments, corrections, or additions to this article, please contact the author.
Harry Chang is Lead Member of Technical Staff at AT&T Labs Research. His interests are multimodal dialog systems and IPTV content search. Email: harry_chang@labs.att.com


Add A Comment