An Overview of Acoustic Modeling Techniques from ASRU 2011
Tara N. Sainath
SLTC Newsletter, January 2012
The Automatic Speech Recognition and Understanding Workshop (ASRU) was recently hosted in Hawaii from December 11-15, 2011. The conference included lectures and poster sessions in a variety of speech and signal processing areas. The Best Paper award was given to George Saon of IBM Research and Jen-Tzung Chien from National Taiwan University for their paper "Some Properties of Bayesian Sensing Hidden Markov Models", while the Student Best Paper award was given to Anton Ragni of Cambridge University for his paper "Derivative Kernels for Noise Robust ASR". In this article, some of the main acoustic modeling sessions at the conference are discussed in more detail.
The acoustic modeling session featured papers across a variety of topics, including Deep Belief Networks (DBNs), boosting, log-linear models, sparsity for acoustic models and conditional random fields (CRFs).
DBNs are explored on a large vocabulary 300 hour Switchboard task in Seide et al.. The paper specifically explores the effectiveness of different feature transforms, namely LDA, VTLN and fMLLR, when these transformed features are used as inputes into DBNs. The authors find that while applying HLDA and VTLN does not provide any improvements for deep networks, roughly a 4% relative improvement is seen with fMLLR-like transforms. In addition, the authors introduce a discriminative pre-training strategy. Overall, on different test sets, the best performing DNN offers between a 28% to 32% relative improvement over a baseline discriminatively-trained GMM system.
Tachibana et al. explores a boosting algorithm for a large vocabulary voice search task. The algorithm, known as Anyboost, trains an ensemble of weak learners using a discriminative objective function. The data used for training is weighted proportional to the sum of posterior functions from the previous round of weak learners. Results on a voice search task trained with different amounts of data between 150 to 5,000 hours show that the boosting method provides gains between 5.1% - 7.5% relative.
The problem of automatically generating articulatory transcripts of spoken utterances is explored in Prabhavalkar et al.. While research has become popular over the past few years, it has been limited by the lack of labeled articulatory data. Given the cost of manual transcriptions, the preferred methodology is to automatically generate transcripts. The authors explore a factored model of the articulatory state space and an explicit model of asynchrony of articulators. They show that CRFs, which are able to take advantage of task-specific constraints, offer between 2.2% and 10.0% relative improvements compared to a dynamic Bayesian network.
The session on ASR Robustness presented various noise robustness techniques for improving ASR performance.
Ragni et al. extends a recent research idea of combining generative and discriminative classifiers, where generative kernels are used in deriving features for discriminative models. The paper presents three main contributions, namely (1) using context-dependent generative models to derive derivative kernels; (2) incorporating derivative kernels into discriminative models; and (3) addresses the problem of feature dimensionality and large number of parameters for derivative kernels. On a small vocabulary Aurora 2 task, and a large vocabulary Aurora 4 task, the authors demonstrate that the proposed derivative kernel method offers improvements over other commonly used noise robustness techniques, including VTS.
Rennie et al. explores using a model-based noise robustness method, known as Dynamic Noise Adaptation (DNA), can be made more robust to matched data without having to do system retraining. Specifically, the authors explore performing online model selection and averaging between two a DNA noise model which tracks the background noise and a DNA noise model that models the null mismatch hypothesis. The proposed approach improves a strong speaker-adapted, discriminatively trained recognizer by 15% relative at signal-to-noise ratios (SNRs) below 10 dB, and over 8% relative overall.
Finally the use of addressing both speaker and environmental adaptation jointly is explored in Seltzer et al.. This joint adaptation is performed using a cascade of constrained maximum likelihood linear regression (CMLLR) transforms which separately compensate for both environmental and speaker variability. The authors also explore the use of unsupervised environmental clustering so that the proposed method did not require knowledge of the training and test environments ahead of time. The proposed algorithm achieves relative improvements between 10-18% over standard CMLLR methods.
F. Seide, G. Li, X. Chen, D. Yu, "Feature Engineering In Context-Dependent Deep Neural Networks For Conversational Speech Transcription," in Proc. ASRU, December 2011.
R. Tachibana, T. Fukuda, U. Chaudhari, B. Ramabhadran, and P. Zhan, "Frame-level AnyBoost for LVCSR with the MMI Criterion," in Proc. ASRU, December 2011.
R. Prabhavalkar, E. Fosler-Lussier and K. Livescu, "A Factored Conditional Random Field Model for Articulatory Feature Forced Transcription," in Proc. ASRU, December 2011.
A. Ragni and M.J.F. Gales, "Derivative Kernels for Noise Robust ASR," in Proc. ASRU, December 2011.
S.J. Rennie, P. Dognin and P. Fousek, "Matched-Condition Robust Dynamic Noise Adaptation," in Proc. ASRU, December 2011.
M.L. Seltzer and A. Acero, "Factored Adaptation for Separable Compensation of Speaker and Environmental Variability," in Proc. ASRU, December 2011.
If you have comments, corrections, or additions to this article, please contact the author: Tara Sainath, tsainath [at] us [dot] ibm [dot] com.
Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are mainly in acoustic modeling. Email: email@example.com