Connect:   

Overview of Acoustic Modeling Techniques from ICASSP 2010

Tara N. Sainath

SLTC Newsletter, April 2010

The 35th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) was recently hosted in Dallas, Texas from March 14-19, 2010. The conference included lectures and poster sessions in a variety of speech and signal processing areas. In this article, some of the main acoustic modeling sessions at the conference are discussed in more detail.

New Directions in Direct Modeling

Many interesting papers were presented in a special session on Direct Modeling. [1] discusses using long-span segment level features directly related acoustic properties to words. The paper compares two direct-model approaches to model these long span features, namely Flat Direct Models (FDMs) and Segmental CRFs (SCRFs). The authors find that both models show improvements over a standard baseline HMM system on the Bing Mobile voice-search database.

[2] introduces a discriminative model for speech recognition which jointly estimates acoustic, duration and language model parameters. Typically, language and acoustic models are estimated independently of each other. This results in acoustic models ignoring language cues and language models ignoring acoustic information. The authors find that jointly estimating acoustic, durational and language model parameters results in a 1.6% absolute improvement in WER on a large vocabulary GALE task.

New Algorithms for ASR

The session "New Algorithms for ASR" featured many papers which explored new machine learning techniques for speech recognition applications. For example, [3] explores using Restricted Bolzmann Machines (RBMs) for phonetic recognition. RBMs are advantageous as they address the conditional independence assumption of Hidden Markov Models (HMMs). Specifically, they use a hidden state that collectively allows many different features to determine a specific output frame. The observations interact with the hidden features using an undirected model. On the TIMIT phonetic recognition task, the authors find that the RBM method offers a 0.6% absolute improvement PER over a conventional HMM system.

[4] explores digit recognition using point process models. Performance of HMM systems can degrade when noise conditions in test are unmatched from training. Point process models (PPMs) have been shown to more robustly detect keywords in noisy speech conditions. In this paper, digit recognition is performed via the following steps. First, the speech signal is transformed into a sparse set of acoustic events through time. Secondly, point process models of these events are used to hypothesizes possible digits. Finally, the sequence of hypothesized digits is reduced to a final hypothesis using a graph-based optimization method. On the Aurora 2 noisy digits task, the PPM model offers comparable performance to an HMM in clean speech but allows for significant improvements in noisy conditions.

Acoustic Modeling

Finally, subspace Gaussian Mixture Models (SGMM) was a popular topic in the acoustic modeling session ([5], [6]). In SGMMs, a common GMM structure is shared among all phonetic states. The means and weights of each state then vary in a subspace of the total parameter space. This allows for a much more compact representation of the acoustic model. Experiments on the English Callhome database in [5] indicate that compactly representing the acoustic model allows for improvements in WER over the standard GMM approach to acoustic modeling, and these gains continue to hold even after speaker adaptation is applied to the acoustic models. In addition [6] shows the benefit of SGMM for multilingual speech recognition. The traditional approach to multilingual recognition is to use a “universal phone set” covering multiple languages. In [6], phone sets from each language are treated distinctly but a SGMM method allows parameters of the models to be shared across multiple language. Absolute improvements in WER between 5%-10% are reported for the English, Spanish and German Callhome database.

For more information, please see:

[1] G. Zweig and P. Nguyen, "From Flat Direct Models to Segmental-CRF Models," in Proc. ICASSP, 2010.
[2] M. Lehr and I.Shafran, "Discriminatively Estimated Joint Acoustic, Duration, and Language Model for Speech Recognition," in Proc. ICASSP, 2010.
[3] A. Mohamed and G. Hinton, "Phone Recognition using Restricted Boltzmann Machines," in Proc. ICASSP, 2010.
[4] A. Jansen and P. Niyogi, "Detection-Based Speech Recognition with Sparse Point Process Models," in Proc. ICASSP, 2010.
[5] D. Povey et. al., "Subspace Gaussian Mixture Models for Speech Recognition," in Proc. ICASSP, 2010.
[6] L. Burget et. al., "Multilingual Acoustic Modeling for Speech Recognition Based on Subspace Gaussian Mixture Models," in Proc. ICASSP, 2010.

If you have comments, corrections, or additions to this article, please leave a comment below or contact the author: Tara Sainath, tsainath [at] us [dot] ibm [dot] com.

Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are in acoustic modeling. Email: tnsainat@us.ibm.com