The 2010 NIST Speaker Recognition Evaluation
Tina Kohler
SLTC Newsletter, July 2010
Evaluation Overview
The United States National Institute for Standards and Technology (NIST) sponsored a speaker recognition evaluation (SRE) in the spring of 2010. The evaluation concluded with a workshop at Brno University of Technology, Brno, Czech Republic 24-25 June. The 58 participants in this year’s evaluation were from five continents and submitted 113 core systems. At the June workshop participants described their algorithms, NIST reported on evaluation results, and all participants engaged in joint discussions regarding data, performance measures, current and future tasks. This was the thirteenth SRE since 1996.
Evaluation Tasks
As in to previous years, the SRE 2010 evaluation consisted of one required task and eight optional tasks with varied amounts of training and recognition data. The required task focused on the “core” condition for training and recognition: one five-minute speech segment from either a telephone conversation or an interview for both training and recognition. The optional tasks were combinations of training with 10-seconds of speech, one five-minute conversation, and eight five-minute conversations (summed and individual sides), or and testing with 10-seconds of speech, one five-minute conversation, or a 2-wire (summed) five-minute conversation.
NIST introduced a pilot task, human assisted speaker recognition (HASR), in the 2010 evaluation. The goal of this new task was to create a reasonable baseline for future research on forensic-like examination of voices using both human and automated systems. In this pilot, humans and systems could work separately or in combination to same/different decisions on pairs of speech samples about three minutes long. Both experienced and naïve listeners from 15 sites participated.
NIST modified the evaluation metric this year to focus on the low false-alarm region of the detection error tradeoff curve for the core task and the eight-conversation train/test task by decreasing the both cost of a miss (from 10 to 1) and the target probability (from 0.01 to 0.001). The new cost function caused the minimum cost operating point to be between 0.01% and 0.1% false alarm percentage.
Algorithms
Most participants submitted fused combinations of Gaussian mixture models and support vector machines trained on Mel-frequency cepstral coefficients, with variations of joint factor analysis to counter the limited training data and multiple channel conditions. Calibration seemed to be more difficult with the new performance measurement. Some systems leveraged the automated speech recognition transcripts provided by NIST, and a few systems modeled prosodic features or phonetics.
One of the challenges of this year’s evaluation was determining how to use existing data to train the non-speaker-specific components of the systems, including background models, voice and channel factors, nontarget statistics, and nontarget test models.
Evaluation Data
This year’s evaluation set contained considerably more core trials than previous years, over 570,000. Additionally, there were nearly 6.5 million core-extended trials, needed to provide more statistical significance at the low false-alarm rates for the new evaluation metric.
New this year in the data were various levels of vocal effort by the speakers and conversations from speakers who also participated in older speaker collections. Repeated this year in the data were various microphones (seven this year) and both conversational and interview sessions. The training and testing sessions contained both matched and mismatched conditions across these elements.
NIST selected two subsets of the core test for the HASR data. HASR1 contained 15 trials, and HASR2 contained 150 trials. HASR1 data was a subset of HASR2 data. These subsets were chosen to be especially difficult for both humans and machines.
General System Performance
The best-performing systems provided equal error rates at or below 2% in the core conditions. As expected, the performance for the matched microphone conditions was better than for the mismatched microphone conditions. Unexpectedly, though, the top systems performed better when training on normal vocal effort and testing on low vocal effort than when testing on normal vocal effort.
The effects of aging on voice could not be clearly distinguished, partly due to limited data. Only 14 speakers participated in collections more than ten years old, and most of the speakers participated in collections less than three years old.
Performance on the HASR pilot surprised many: on these data, automated systems generally performed better alone than with human assistance. Error rates of 15-30% were typical for human listeners. It is important to note the data did not represent a typical forensic task in detail, but the results should provide an interesting starting point for much relevant research in this area.
Future
NIST normally holds speaker evaluations and workshops in even-numbered years alternating with language evaluations and workshops in the odd years.
Acknowledgements and more information
Thanks to Alvin Martin and Jack Godfrey for providing input to this article.
You can learn more about NIST’s speaker recognition evaluations at NIST speaker recognition evaluation web page, http://www.nist.gov/itl/iad/mig/sre.cfm.
If you have comments, corrections, or additions to this article, please contact the author: Tina Kohler, m.a.kohler [at] ieee [dot] org.
Tina Kohler is a researcher for the U.S. Government. Her interests include speaker and language processing.

