Spoken Document Summarization - An Overview
Annie Louis
SLTC Newsletter, January 2009
Efficient organization, retrieval and convenient browsing of multimedia content is an attractive application today. A large proportion of multimedia documents involve speech-- news broadcasts, meetings, interviews, technical presentations, movies and lectures. Summaries for such spoken documents will be an integral part of the browsing interface, facilitating search, indexing and retrieval. These summaries can either be in text format or a selection of key audio snippets from the documents. Without doubt, spoken document summarization has generated a lot of interest lately.
Automatic text summarization has made great strides, thanks to decades of research pursuits in this area. We have a fairly good understanding of the techniques and evaluation methods that work best for both single document and multi-document summarization. Methods have been developed for texts ranging from newswire articles to scientific literature, biographies to online blogs and multilingual documents. Problems like reducing redundancy and organizing summary sentences in a readable format have all been examined and improvements are constantly being hypothesized, tested and accepted. Content selection for summaries has also been designed to cater to specific queries and user need.
Speech summarization has evolved more slowly and poses a different set of challenges [3, 6]. In the case of written text, there is usually a clear organization of content into titles, sentences and paragraphs. On the other hand, speech transcripts are significantly different from written texts in structure. Speech disfluencies and errors in ASR propagate into transcripts. There is no paragraph information and segmentation into utterances is not trivial. Moreover, as one moves from monologue to dialogue, there needs to be a significant change in the approach to summarization.
Consider meetings for example. A summary must be similar to the minutes of the meeting; give the proposals, plans, agreements/ disagreements and decisions reached during its course. The identification of these items of content in conversational speech needs special techniques based on speech acts, identity of the speaker, turn taking, acoustic and prosodic characteristics of the utterances [4, 5]. Apart from content selection, we must devise evaluation measures and also tackle the problem of producing coherent summaries. In text summarization, a common approach is to select important sentences from the input and present them with some ordering. Utterances chosen from a dialogue can be more difficult to combine into a coherent audio summary. The utterances may come from different participants and vary in acoustic properties.
A similar situation arises with other aspects of spoken document processing like named entity recognition, information extraction, segmentation and topic analysis [2]. To make faster progress with multimedia processing, valuable technology transfer from the text domain needs to be augmented with speech techniques. Some techniques used for text summarization have been adapted to work with speech transcripts [8, 9] and evaluation metrics from text domain also show promise for evaluating speech summaries [7]. To this end, it is necessary to actively seek opportunities for interaction, knowledge and technology transfer between text and speech communities.
The Document Understanding Conferences (DUC) have been conducted by NIST yearly since 2001. These started out as large scale text summarizer evaluation workshops but starting 2008, have been extended with Textual Entailment and Question Answering tracks as well. DUC is now a Text Analysis Conference (TAC) with a wider audience. At the planning session in TAC 2008, a proposal (from the International Computer Science Institute, Berkeley) to include a meeting summarization task was positively received by many participants as a new and exciting challenge. If accepted as a track in future TAC workshops, we can expect faster developments with resource building, an opportunity to compare methods on common test sets, standardization of evaluation techniques and a sense of larger community.
- I. Mani and M. T. Maybury, editors, "Advances in Automatic Text Summarization", MIT press, 1999.
- L. Lee and B. Chen, "Spoken Document Understanding and Organization", IEEE Signal Processing Magazine, September 2005.
- K. Zechner, "Summarization of Spoken Language - Challenges, Methods, and Prospects", Speech Technology Expert eZine, Issue 6, January 2002.
- D. Hillard, M. Ostendorf and E. Shriberg, "Detection of agreement vs disagreement in meetings: training with unlabelled data", Proceedings of HLT/ NAACL 2003.
- E. Shriberg, A. Stolcke, D. Hakkani-Tur and G. Tur, "Prosody-Based Automatic Segmentation of Speech into Sentences and Topics", Speech Communication, September 2000.
- K. McKeown, J. Hirschberg, M. Galley, and S. Maskey, "From text summarization to speech summarization", Proceedings of ICASSP, Special Session on Human Language Technology Applications and Challenges for Speech Processing, 2005.
- A. Nenkova, "Summarization evaluation for text and speech: issues and approaches", INTERSPEECH – 2006.
- A. Waibel, M. Bett and M. Finke, "Meeting browser: tracking and summarizing meetings", Proceedings of the DARPA Broadcast News Workshop, 1998.
- T. Kikuchi, S. Furui and C. Hori, "Two-stage automatic speech summarization by sentence extraction and compaction", Proceedings of the IEEE/ ISCA Workshop on Spontaneous Speech Processing and Recognition, 2003.

