Conditional Random Fields for Speech Recognition

Tara N. Sainath

SLTC Newsletter, January 2010

The use of Conditional Random Fields (CRFs) in speech recognition has gained significant popularity over the past few years. In this article, we highlight some of the main research efforts utilizing CRFs.

What are Conditional Random Fields?

Hidden Markov Models (HMMs), a very popular tool in speech recognition, can be thought of as a generative model. In other words, it provides the joint probability distribution over observation sequences and labels. In contrast discriminative models, such as conditional random fields (CRFs) and Maximum entropy markov models (MEMMS), model the posterior probability of a label sequence given the observation sequence. In a CRF, the conditional probability of an entire label sequence given an observation sequence is modeled with an exponential distribution. CRFs are attractive for a few reasons. First, they allow for the combination of multiple features, perhaps representing different sources of knowledge, while sill providing a discriminative framework. In addition, CRFs are more suited to the use of unconstrained optimization algorithms, which generative models such as HMM are not.

Phonetic Classification

There have been numerous efforts in using CRFs for phone classification. In [1], researchers explored the use of hidden CRFs (HCRFs), which are CRFs with hidden state sequences. This paper focuses on comparing the discriminative benefits of a CRF model to a generative HMM. Training the HCRF is done using stochastic gradient descent. The performance of HCRF is compared to maximum-likelihood (ML) trained HMMs, and maximum mutual information (MMI) trained HMMs on the TIMIT phonetic classification task. The HCRF offers roughly a 4% absolute improvement over the HMM-ML system and 3% over the HMM-MMI system, showing the benefits of a discriminative model compared to a generative model.

Phonetic Recognition

[3] explores the benefit that CRFs allow to combine multiple feature streams. Specifically, it looks at a combination of state features and transitions features. Transition features model the dependency between successive sequence labels. More importantly they provide a Markov framework which allows for the selection of the current label to be influenced by the previous labels. State feature functions are used to describe various phonetic attributes for a specific frame, including knowledge of voicing, sonority, manner of articulation, place of articulation, etc. Another advantage of the framework presented in [3] is that unlike HMMs, which assume that the observed data is independent given the labels, CRFs do not make any assumptions about the interdependence of the data.

The paper explores phone recognition in TIMIT, comparing a monophone CRF to a tandem system, which uses the output of a multi-layer perceptron (MLP) as features into a Gaussian-based HMM. In addition, a baseline HMM system using MFCC features was also used for comparison. The CRF is found to offer a 3% absolute improvement over the triphone HMM, a 4% absolute improvement over a tandem monophone HMM system. The CRF system offers comparable performance to a tandem triphone HMM system, but with much fewer parameters.

Word Recognition

The use of CRFs for large vocabulary continuous speech recognition (LVCSR) is explored in [5]. This paper explores the use CRFs at the segment level, rather than the frame level which is most common in speech recognition systems. The states in the segmental CRF represent words, and various segmentations are considered which are consistent with the word sequence. The features used within the segmental CRF framework are defined at the segment level, and thus span over multiple observations. A set of multi-scale detector steams are to generate a set of segment-level features. Each detector stream detects a specific feature set and operates at different time scales, include frame, phone, multi-phone, syllable and word level. Having features defined at the word level allows long span features such as formant trajectories, duration, and syllable stress patterns. On the Bing Mobile database, the segmental CRF offers a 2% absolute improvement over an HMM baseline.

Language Modeling

The use of discriminative language modeling using CRFs and the perceptron algorithm is explored in [4]. A typical language model to estimate word probabilities is a generative model which tries to maximize the word probabilities. In this paper, the CRF model is encoded as a deterministic weighted finite state automata. The automata is intersected with word-lattices generated from a baseline recognizer, which allows features to be estimated for model updates. The CRF training attempts to directly optimize error-rate, unlike generative methods. Using the CRF method for parameter estimation allows for a 1.8% absolute gain over the baseline generative LM.

Part-of-Speech Tagging

Finally, [2] explores the use of CRFs for building probabilistic models to segment and label sequence data. A HMM, a popular choice for segmenting and labeling sequence data, is a generative model which assigns a joint probability to observations and label sequences. This requires enumerating all possible observation sequences, making it difficult to use multiple features or capture long-range dependencies of the observations. In contrast to generative models, discriminative models such as CRFs and MEMMs do not require a lot of effort on modeling the observations. In MEMMs, the output label associated with a specific state is modeled with an exponential distribution. Modeling in this fashion leads to something known as the label bias problem, where transitions which leave a state can only compete against each other rather than other transitions in the model. CRFs address the label bias issue with MEMMs by using a single exponential model for the entire label sequence given observation sequence. The authors compare CRFs to MEMMs and HMMs on a part-of-speech tagging task, and observe improvements using the CRF approach.

For more information, please see:

[1] A. Gunawardana, M. Mahajan, A. Acero and J. C. Platt, "Hidden Conditional Random Fields for Phone Classification," in Proc. Interspeech, 2005.
[2] J. Lafferty, A. McCallum and F. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," in Proc. ICML, 2001.
[3] J. Moris and E. Fosler-Lussier, "Combining Phonetic Attributes Using Conditional Random Fields," in Proc. Interspeech, 2006.
[4] B. Roark, M. Saraclar, M. Collins and M. Johnson, "Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm," in Proc. ACL, 2004.
[5] G. Zweig and P. Nguyen, "A Segmental CRF Approach to Large Vocabulary Continuous Speech Recognition," in Proc. ASRU, 2007.

If you have comments, corrections, or additions to this article, please contact the author: Tara Sainath, tsainath [at] us [dot] ibm [dot] com.

Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are mainly in acoustic modeling.