NIST Conducts Rich Transcription Evaluation
Satanjeev "Bano" Banerjee
SLTC Newsletter, October 2009
The National Institute of Standards and Technology (NIST) recently conducted the 2009 Rich Transcription Evaluation - a public exercise to evaluate the current state of the art in automatic speech transcription. It was the latest of a series of such evaluations that started in 2002. The ultimate goal of this exercise series is to evaluate "rich transcription" - a process that not only converts audio into a stream of words, but also captures other information in the speech such as speaker identity, prosody, disfluencies, etc. Rich transcription is expected to make speech transcriptions more useful to both humans and to downstream processes; it is likely to be particularly useful in the context of multi-party human-human dialog, such as meetings.
As a step towards the ultimate goal of rich transcription, the just-concluded evaluation exercise featured three tasks:
- Speech-to-Text: Converting the audio into a sequence of words.
- Speaker Diarization: Detecting the speaker(s) at all points of time during the meeting.
- Speaker-Attributed Speech-to-Text: Assigning each transcribed word to one of the speakers.
The audio used for evaluation consisted of meeting-room recordings only (unlike the previous edition of the exercise held in 2007 in which data from broadcast news and conversational telephone speech was also used). The test audio was provided by the University of Edinburgh and the IDIAP Research Institute (60 minutes each of non-American English speech) and NIST (60 minutes of American English speech). As is typical of natural meetings, some of the data consisted of overlapping speech from multiple speakers; participants were required to transcribe the separate overlapping words. No limitation was placed on the time taken to produce the transcripts. System outputs were evaluated using the Word Error Rate metric - the ratio of the number of word tokens inserted, deleted or substituted by the system (as compared to the reference transcription), divided by the total number of tokens in the reference. Speaker diarization was evaluated using Diarization Error Rate - the percentage of the meeting time that was attributed to an incorrect speaker. Speaker-attributed speech-to-text was evaluated using Word Error Rate, with the added constraint that a word was deemed to be correctly transcribed only if it was attributed to the right speaker.
There were 8 participant teams that participated in one or more of these three tasks:
- AMI (Augmented Multi-party Interaction): University of Sheffield, IDIAP, University of Edinburgh, University of Technology Brno, and University of Twente.
- I2R/NTU: Infocomm Research Site and Nanyang Technological University.
- FIT: Florida Institute of Technology.
- ICSI: International Computer Science Institute.
- LIA/Eurecom: Laboratoire Informatique d'Avignon/Ecole d'ingénieurs et centre de recherche en Systèmes de Communications.
- SRI/ICSI: SRI International and International Computer Science Institute.
- UPM: Universidad Politécnica de Madrid.
- UPC: Universitat Politècnica de Catalunya.
The AMI, FIT and SRI/ICSI teams took part in the Speech-to-Text task, every team except FIT and SRI/ICSI took part in the speaker diarization task, and the AMI and the SRI/ICSI teams participated in the speaker-attributed speech-to-text task. The primary input condition was produced by three distant microphones placed in the center of the meeting-room table. Participants also had the option of comparing their results on this primary condition to other conditions including speech collected using array microphones and individual head-mounted microphones.
Using multiple distant microphones and with at most 4 overlapping speakers, the word error rate of the systems was between 40 and 50%. The error rate was close to 30% for meeting segments in which the number of speakers was at most 1, and around 25% for the individual head-mounted microphone condition. These evaluation numbers are similar to the numbers achieved in the 2007 Rich Transcription (RT-07) exercise. For speaker diarization, there was a wide range of results among the six participants - from less than 10% Diarization Error Rate to over 30% in the multiple distant microphone condition. Results were slightly worse for both the array microphone condition and the single distant microphone condition. An analysis of the errors revealed that most of the errors were from attributing speech to the wrong speaker, rather than not detecting or falsely detecting speech. Compared to RT-07, the results were in the same range, although there was much variation from meeting to meeting in the test set. Unlike previous years, participants were allowed to use video recorded at the meeting to help with the diarization - this resulted in about 5% absolute improvement in the diarization error rate in the single distant microphone condition. Finally, the speaker-attributed speech-to-text word error rate was close to 50% for the multiple distant microphone condition, and around 60% for both the microphone array and the single distant microphone conditions.
The main conclusion of the exercise was that the results represented little or no improvement over those obtained during RT-07. It was also clear that the distant microphone condition remains significantly more difficult than the close-talking microphone condition, and that more research is needed to improve speech recognition using distant microphones.
For more information, see:
- Webpage of NIST's Rich Transcription Evaluation project.
- The 2009 RT Workshop agenda, including links to presentations.
If you have comments, corrections, or additions to this article, please contact the author: Satanjeev Banerjee, banerjee [at] cs [dot] cmu [dot] edu.
Satanjeev "Bano" Banerjee is a PhD student in the Language Technologies Institute at Carnegie Mellon University. His interests are in spoken language understanding of human-human dialog.


Add A Comment