The 2009 NIST Language Recognition Evaluation

Tina Kohler

SLTC Newsletter, July 2009

The U.S. National Institute for Standards and Technology (NIST) sponsored a language recognition evaluation in Spring 2009. This article describes the evaluation tasks, data, participant algorithms, and results.

Evaluation Overview

The United States National Institute for Standards and Technology (NIST) sponsored a language recognition evaluation (LRE) in the spring of 2009. The evaluation concluded with a workshop held in Baltimore, Maryland, U.S.A on June 24 and 25. The 16 participants in this year’s evaluation were primarily from Europe and Asia, with one participant from the United States. At the June workshop participants described their algorithms, NIST reported on evaluation results, and all participants engaged in joint discussions regarding data, performance measures, and future tasks. This was the fourth LRE since 2003.

Evaluation Tasks

This year’s evaluation contained three tasks: 23-language closed-set evaluation, 23-language open-set evaluation, and 23-language language-pair evaluation. The latter two tasks were optional.

Closed-set evaluation requires a language detection decision and score for each target language when the test segments are limited to the known set of (23) target languages. Open-set evaluation requires a language detection decision and score when the set segments are not limited to the target languages; i.e., out-of-set (OOS) languages are included. Language-pair evaluation requires a language detection decision and score when the test segments are limited to two languages; i.e., there is always a single alternative language hypothesis for each trial.

Each task contained three segment-duration test conditions: 3-, 10-, and 30-seconds. The open-set evaluation included 16 OOS languages, with considerable numbers of speech tested for each OOS language. There were significantly more OOS speech segments this year than in previous LREs.

The 23 target languages were:

Amharic Bosnian Cantonese
Creole (Haitian) Croatian Dari
English (American) English (Indian) Farsi
French Georgian Hausa
Hindi Korean Mandarin
Pashto Portuguese Russian
Spanish Turkish Ukrainian
Urdu Vietnamese  

Evaluation Data

Data collection for LRE has become increasingly challenging, so a new paradigm was used this year: found data. Instead of paying speakers for their telephonic voice recordings in various languages, narrowband voice segments were extracted from Voice of America (VoA) recordings, since these are most likely talkers using telephone, and therefore more likely to be conversational. This not only decreased the cost of data collection, it also provided larger numbers of test segments. The VoA recordings were collected across several years, providing a wide variety of languages and speakers. Researchers at Brno University of Technology’s (BUT) Faculty of Information Technology created an algorithm for detecting narrowband portions of the broadcast. VoA broadcast information provided language label information for most segments. Segments with no VoA broadcast information were labeled using automated language recognition. All labels were verified by the Linguistic Data Consortium (LDC) to insure correctness. The evaluation data contained some conversational telephone speech (CTS), but the bulk was from VoA narrowband segments.

Algorithms

Most participants submitted algorithms with combinations of Gaussian Mixture Models (GMM), Support Vector Machines (SVM), and phonetic tokenization. Most GMMs were discriminative, and most phonetic tokenization used tokens extracted from multiple languages. Most sites used feature vectors containing Mel-frequency cepstral coefficient (MFCC) and shifted delta cepstra (SDC), and they incorporated some sort of channel and noise compensation. Sites were requested to report their processing speeds. These ranged from 5 times faster than real time to 23 times slower than real time.

Evaluation Results

NIST defined the basic performance measurement to be a cost performance based on detection miss and false alarm probabilities, with equal cost for both types of errors. They offered an alternative cost measure using log-likelihood ratios, and also provided graphical performance representation using detection error tradeoff (DET) curves.

Overall, the performance was as expected, with the closed-set performance showing about half the error rates of open-set performance. For the 30-s segments, the top performers achieved equal error rates (EER) below 5% for open-set task and below 2% for the closed-set task.

The language-pair task performance was better for all systems performing this task in all durations than for the closed-set task. Performance for over 90% of the 253 language pairs had an average error rate less than 1%. The five language pairs exhibiting the biggest challenge for automated language recognition are all closely related to each other: Hindi/Urdu Bosnian/Croatian, Russian/Ukrainian, American English/Indian English, Dari/Farsi.

Future

NIST plans to hold language evaluations and workshops every other year (i.e., odd years). In the years between (i.e., even years), they will hold speaker evaluations and workshops. The found-data paradigm worked well this year, and it is hoped future evaluations will also take advantage of existing data.

Acknowledgements and more information

Thanks to Alvin Martin and Craig Greenburg for providing input to this article.

You can learn more about NIST’s language recognition evaluations at the NIST language recognition evaluation web page.

If you have comments, corrections, or additions to this article, please contact the author: Tina Kohler, m.a.kohler [at] ieee [dot] org.

Tina Kohler is a researcher for the U.S. Government. Her interests include speaker and language processing. Email: m.a.kohler@ieee.org


Add A Comment

This is a captcha-picture. It is used to prevent mass-access by robots. (see: www.captcha.net)

Code in the picture:
Title:
Your Name(*):
Email:
Notify me of any further comments to this thread:
Website:
Comment(*):