An Overview of Speech and Related Papers @ ACL 2012

Asli Celikyilmaz

SLTC Newsletter, August 2012


ACL 2012 covered a broad spectrum of disciplines working towards enabling intelligent systems to interact with humans using natural language, and towards enhancing human-human communication through services such as speech recognition, automatic translation, information retrieval, text summarization, and information extraction. This article summarizes the speech papers specifically advances on acoustic and language modeling as well as speech related papers on language acquisition, phonemes and words. The selected articles are noteworthy of their individual fields as the novel approaches presented in these papers outperform the baselines. For further reading please refer to the proceedings [1].


This year, Jeju Island, South Korea hosted the 50th Anniversary of the Association for Computational Linguistics (ACL 2012). Renowned scholar Aravind K. Joshi, Henry Salvatori Professor, Computer and Cognitive Science University of Pennsylvania, delivered the conference keynote address, “Remembrance of ACL’s past,” discussing the research trends in ACL research papers over the last 50 years. During the three-day conference there were very interesting talks on machine translation, parsing, generation, NLP applications, dialog, discourse and social media. There were four tracks on speech and speech related topics and several other tracks also played host to speech-related papers, including the tracks on machine learning, phonemes, and words.

(i) ASR and Related Topics

One of the noteworthy papers in the first speech session was by Lee and Glass [2] on a non-parametric Bayesian approach to acoustic modeling that learns a set of sub-word units to represent the spoken data. Specifically, given a set of spoken utterances, their latent variable model jointly learns a segmentation to find the phonetic boundaries within each utterance, clusters the acoustically similar segments together and then models each sub-word acoustic unit using HMM. They compare the phonetic boundaries captured by their model to the manual labels in their dataset, which is described in Dusan and Rabiner’s paper [3]. Their results show striking correlations between the manual cluster labels and the indication English phones, which shows that without any language-specific knowledge, their model is able to discover the phonetic composition of a set of speech data. In addition, they compare their results to supervised acoustic models including English triphone and monophone models as well as Thai monophone models. Their results indicate that even without supervision their model can capture and learn the acoustic characteristics of a language and produce an acoustic model that outperforms a language-mismatched acoustic model trained with high supervision.

Another conversational speech paper was by Tang et al. [4], in which the authors introduce a Discriminative Pronunciation Modeling approach to tackle ASR issues while mapping between words and their possible pronunciations in terms of sub-word units (phones). A general approach to this problem is using a generative approach (which models the joint distribution between input and output variables), which provides distributions over possible pronunciations given the canonical ones [5], [6], [7], [8] In some recent work the generative models are discriminatively optimized to obtain better accuracy. In this paper a flexible discriminative approach is used to build a pronunciation model. They provide detailed information about the features of their linear model. They compare their proposed pronunciation model to Zweig’s CRF based speech recognizer [9]. Benchmark analyses indicate that their models are sparser, much faster and perform better when they use a large set of feature functions.

(ii) Language Modeling

Recent language modeling (LM) research work shows that incorporating long-distance dependencies and syntactic structure helps LM’s better predict words by complementing the predictive power of n-grams. Current state-of-the-art methods that capture the long term dependencies use generative or discriminative approaches, which can be slow, often cannot work directly with lattices or require rescoring large N-best lists. Additionally, such approaches use non-local features for rescoring (usually by using auxiliary tools such as POS or parsers) which introduces major inefficiencies. Thus, Rastrow [10] introduce Sub-Structure Sharing, which is a general framework that greatly improves the speed of auxiliary syntactic tools that utilize the commonalities among the set of generated hypotheses (lattices). Specifically, their key idea is to share substructure states in transition-based structure prediction algorithms where final structures are composed of a sequence of multiple individual decisions. Specifically, in a sub-structure sharing, a transition (or action) ??O is learned by a classifier trained on labeled data. The classification model summarizes the states as vectors of features, and g imposes the same distribution over actions if the states are equivalent. This way the states and their distribution over actions are stored to be re-used at test time. They also use an up-training method to improve the accuracy of their decoder’s performance. With their experiments on real data they show improvements on the speed of the language models without degrading performance

(iii) Lexical and Phonetic Acquisition

There were two notable papers on child language acquisition, each looking at different aspects of speech. Elsner [11] investigate infant language acquisition by jointly learning the lexicon and phonetics. Specifically, their joint lexical-phonetic model infers intended forms from segmented surface forms. This enables variability in modeling and improves the accuracy of the learned lexicon over a system that assumes each intended form as a unique surface form. They introduce a hierarchal Bayesian framework that generates intended word tokens using a language modeling structure and then transforms each token by a probabilistic finite-state transducer to produce the observed surface sequence. Their transducer is parameterized by probabilities obtained from another log-linear model, which uses features based on articulatory phonetics. They show that modeling variability improves the accuracy of the learned lexicon.

Shakian and Snyder [12] investigate automatic learning metrics for child language learning. Measuring language learning at early ages is crucial since these measures can help to diagnose early language disorders. The authors extract several features from CHILDES, a collection of corpora of child language based on episodic speech data, including histories of longitudinal studies of individual children. Beyond three standard metrics of language development, namely utterance length, syntactic complexity, and linguistic competence, they introduce additional measures: obligatory morpheme counts (articles and contracted auxiliary “be” verbs); preposition occurrences; and vocabulary-centric features (counts of function words, prepositions, pronouns, conjunctions, etc). For individual children evaluations, they use least-squares regression over introduced features to predict the age of a held-out language sample. They find that, on average, existing single metrics of development are outperformed by a weighted combination of proposed features.

This article briefly summarized selected speech and related articles presented at the ACL’2012 as three sections, namely speech recognition, language modeling and lexical and phonetic acquisition. For further details on the summarized papers and others please refer to the ACL’s online proceedings [1].

Asli Celikyilmaz is Senior Speech Scientist at Microsoft Silicon Valley. Her interests are in Natural Language Processing, Conversational Understanding and Machine Learning.


[1] "ACL Antology - ACL 2012 proceedings," [Online]. ACL Antology - ACL 2012.

[2] C.-y. Lee and J. Glass, "A Nonparametric Bayesian Approach to Acoustic Model Discovery," In ACL, 2012.

[3] S. Dusan and L. Rabiner, "On the relation between maximum spectral transition positions and phone boundaries," In INTERSPEECH, 2006.

[4] H. Tang, J. Keshet and K. Livescu, "Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach," In ACl, 2012.

[5] T. Holter and T. Svendsen, "Maximum likelihood modelling of pronunciation variation," In Speech Communication, 1999.

[6] M. Riley, W. Byrne, M. Finke, S. Khudanpur, A. Ljolje, J. McDonough, H. Nock, M. Saraclar, C. Wooters and G. Zavaliagkos, "Stochastic pronunciation modelling from hand-labeled phonetic corpora," In Speech Communication, 1999.

[7] B. Hutchinson and J. Droppo, "Learning nonparametric models of pronunciation," in Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2011.

[8] K. Filali and J. Bilmes, "A dynamic Bayesian framework to model context and memory in edit distance learning: An application to pronunciation classification.," In Proc. Association for Computational Longuistics (ACL), 2005.

[9] G. Zweig, P. Nguyen, D. V. Compernolle, K. Demuynck, L. Atlas, P. Clark, G. Sell, M. Wang, F. Sha, H. Hermansky, D. Karakos, A. Jansen, S. Thomas, S.G.S.V.S, S. Bowman and J. Kao, "Speech Recognition with Segmental Conditional Random Fields," In Interspeech, 2011.

[10] A. Rastrow, M. Dredze and S. Khudanpur, "Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining," In ACL, 2012.

[11] M. Elsner, S. Goldwater and J. Eisenstein, "Boostrapping a Unified Model of Lexical and Phonetic Acquisition," In ACL, 2012.

[12] S. Shakian and B. Snyder, "Automatically Learning Measures of Child Language Development," In ACL, 2012.

[13] X. Peng, D. Ke and B. Xu, "Automated Essay Scoring Based on Finite State Transducer: towards ASR Transcription of Oral English Speech," In ACL, 2012.

[14] T. Holter and T. Svendsen, "Maximum likelihood modelling of pronunciation variation," In Speech Communication, 1999.

[15] M. Riley, W. Byrne, M. Finke, S. Khudanpur, A. Ljolje, J. McDonough, H. Nock, M. Saraclar, C. Wooters and G. Zavaliagkos, "Stochastic pronunciation modeling from hand-labeled phonetic corpora," In Speech Communication, vol. 29, pp. 2-4, 1999.

If you have comments, corrections, or additions to this article, please contact the author: Asli Celikyilmaz, asli [at] ieee [dot] org.