Keyword spotting in Czech: First evaluation complete, with funding from Ministries of Interior and Defense

Honza Cernocky

SLTC Newsletter, January 2009

Czech Republic, with a population of 10 million, is surprisingly home to three successful university research groups working in the area of speech recognition:

Since 2007, these three groups have been cooperating on a research project "Overcoming the language barrier complicating investigation into financing terrorism and serious financial crimes" sponsored by Czech Ministry of Interior under number VD20072010B16. The project aims at the analysis of spontaneous telephone calls from security and defense domains. Unlike English, where corpora such as Switchboard and Fisher provide sufficient amounts of training material, Czech lacked a well transcribed database of spontaneous telephone calls. This is why in 2007, the activities of the project concentrated on the creation of such a database – the consortium now holds almost 100 hours of transcribed and checked spontaneous speech data.

After several discussions with Interior and (later) Defense representatives, keyword spotting in Czech was defined as the top priority task for security and defense analysts. To compare the performances of systems, an evaluation of keyword spotting systems was organized in 2008, with the final run in November and post-evaluation workshop on November 21st. Systems were compared using standard metrics such as FOM (figure of merit) and EER (equal error rate), and their speed and ability to handle OOV (out of vocabulary) words were also compared.

TUL built on its extensive experience with the recognition of Czech and designed a system based on LVCSR with large vocabulary of 350k words and 410k pronunciation variants, without language model. Note that Czech is a highly inflective language so that 50k vocabularies common for English are not sufficient. The advantage of the system is its high speed of 0.15 xRT, and ability to detect colloquial variants even if a correct form of a word is entered. TUL group uses a simple acoustic modeling based on context-independent models with high numbers of Gaussians, with a state-of-the-art feature extraction: MFCC coefficients processed by Heteroscedastic Linear Discriminant Analysis (HLDA) transform.

UWB experimented with two systems: the first was a purely acoustic keyword model which works against a background model, with the resulting likelihood ratio thresholded. The second used LVCSR lattices. While the acoustic system provided better results for the development set with rather artificial selection of "good" (read: "long") keywords, the advantages of LVCSR-based system fully emerged in the test on evaluation data, where the selection of keywords was not limited. The acoustic modeling in this system uses not only discriminative training of HMMs, but also discriminative adaptation to individual conversation sides. UWB experimented also with fusion of individual systems and has shown their complementarity.

BUT tested 4 systems in this evaluation: FastLVCSR was based on LVCSR with insertion of keywords into language model; HybridLVCSR used full-fledged word and subword recognition and indexing; and two acoustic systems were based on GMM/HMM and NN/HMM. While LVCSR systems are more precise, the advantage of acoustic ones is in their speed. HybridLVCSR is worth mentioning as it allows for pre-processing large quantities of data off-line with subsequent very fast searches, including OOVs. BUT built on its experience in LVCSR and keyword spotting in European Community-sponsored AMI and AMIDA projects, as well as its participation in 2006 NIST Spoken Term Detection evaluation.

The research groups consider this event very important, as it creates more confidence in speech technologies in Czech security and defense community, and hope it will have a positive impact on their future funding.

Picture of the leaders of Czech speech groups (from the left): Honza Cernocky (BUT), Jan Nouza (TUL), Ludek Muller (UWB)

The leaders of Czech speech groups (from the left): Honza Cernocky (BUT), Jan Nouza (TUL), Ludek Muller (UWB)