Connect:  Facebook  Twitter  LinkedIn

The Kaldi Speech Recognition Toolkit

Arnab Ghoshal and Daniel Povey

SLTC Newsletter, February 2012

Kaldi is a free open-source toolkit for speech recognition research. It is written in C++ and provides a speech recognition system based on finite-state transducers, using the freely available OpenFst, together with detailed documentation and scripts for building complete recognition systems. The tools compile on commonly used Unix-like systems and on Microsoft Windows. The goal of Kaldi is to have modern and flexible code that is easy to understand, modify, and extend. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users. Kaldi is available from SourceForge.

Why another speech toolkit?

The work on Kaldi [1] started during the 2009 Johns Hopkins University summer workshop project titled "Low Development Cost, High Quality Speech Recognition for New Languages and Domains," where we were working on acoustic modeling using subspace Gaussian mixture model (SGMM) [2]. In order to develop and test a new acoustic modeling technique, we needed a toolkit that was simple to understand and extend; had extensive linear algebra support; and came with a nonrestrictive license that allowed us to share our work with other researchers in academia or industry. We also preferred to use a finite-state transducer (FST) based framework.

While there were several potential choices for open-source ASR toolkits -- for example, HTK, Julius (both written in C), Sphinx-4 (written in Java), and the RWTH ASR toolkit (written in C++, and closest to Kaldi in terms to design and features) -- our specific requirements meant that we had to write many of the components, including the decoder, by ourselves. Given the amount of effort invested and the continued use of the tools following the JHU workshop, it was a logical choice to extend the codebase into a full-featured toolkit. We had two follow-up summer workshops at the Brno University of Technology, Czech Republic, in 2010 and 2011, and further development of Kaldi is ongoing.

Design of Kaldi

Important aspects of Kaldi include:

  • Integration with Finite State Transducers: We compile against the OpenFst toolkit (using it as a library).
  • Extensive linear algebra support: We include a matrix library that wraps standard BLAS and LAPACK routines.
  • Extensible design: We attempt to provide our algorithms in the most generic form possible. For instance, our decoders work with an interface that provides a score for a particular frame and FST input symbol. Thus the decoder could work from any suitable source of scores.
  • Open license: The code is licensed under Apache v2.0, which is one of the least restrictive licenses available.
  • Complete recipes: We make available complete recipes for building state-of-the art speech recognition systems, that work from widely available databases such as those provided by the Linguistic Data Consortium (LDC).
  • Thorough testing: The goal is for all or nearly all the code to have corresponding test routines.

Kaldi has an open and distributed development model, with a growing community of users and contributors. The original authors moderate contributions to the project.

Features Supported in Kaldi

We intend Kaldi to support all commonly used techniques in speech recognition. The toolkit currently supports:

  • MFCC and PLP front-end, with cepstral mean and variance normalization, LDA, STC/MLLT, HLDA, VTLN, etc.
  • Modeling of context-dependent phones of arbitrary context lengths.
  • HMM/GMM acoustic models; phonetic decision trees.
  • Semi-continuous hidden Markov models [4].
  • Subspace Gaussian mixture models [2].
  • Speaker adaptation and adaptive training.
  • WFST-based decoders with lattice generation [3].
  • Lattice rescoring with acoustic and language models.
  • Discriminative training with MMI, boosted MMI (fMPE under development).

There is currently no language modeling code, but we support converting ARPA format LMs to FSTSs. In the recipes released with Kaldi, we use the freely available IRSTLM toolkit. However, one could potentially use a more fully-featured toolkit like SRILM. Current strands of development include: discriminative training with MPE, interface for transparent computation on CPUs and GPUs, hybrid ANN/HMM systems, etc.

Acknowledgements

The contributors to the Kaldi project are: Gilles Boulianne, Lukas Burget, Arnab Ghoshal, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Navdeep Jaitly, Stefan Kombrink, Petr Motlicek, Daniel Povey, Yanmin Qian, Korbinian Riedhammer, Petr Schwarz, Jan Silovsky, Georg Stemmer, Karel Vesely, and Chao Weng.

We would like to thank Michael Riley, who visited us in Brno to deliver lectures on finite state transducers and helped us understand OpenFst; Henrique (Rico) Malvar of Microsoft Research for allowing the use of his FFT code; and Patrick Nguyen for help with WSJ recipes and introducing the participants in the JHU workshop of 2009. We would like to acknowledge the help with coding and documentation from Sandeep Boda and Sandeep Reddy (sponsored by Go-Vivace Inc.) and Haihua Xu. We thank Pavel Matejka (and Phonexia s.r.o.) for allowing the use of feature processing code.

We would like to acknowledge the support of Geoffrey Zweig and Alex Acero at Microsoft Research, and Dietrich Klakow at Saarland University. We are grateful to Jan (Honza) Cernocky for helping us organize the workshop at the Brno University of Technology during August 2010 and 2011. Thanks to Tomas Kasparek for system support and Renata Kohlova for administrative support.

Finally, we would like to acknowledge participants and collaborators in the 2009 Johns Hopkins University Workshop, including Mohit Agarwal, Pinar Akyazi, Martin Karafiat, Feng Kai, Ariya Rastrow, Richard C. Rose and Samuel Thomas; and faculty and staff at JHU for their help during that workshop, including Sanjeev Khudanpur, Desiree Cleves, and the late Fred Jelinek.

References

[1] D. Povey, A. Ghoshal, et al., "The Kaldi Speech Recognition Toolkit," in IEEE ASRU, 2011.
[2] D. Povey, L. Burget et al., "The subspace Gaussian mixture model--A structured model for speech recognition," Computer Speech & Language, 25(2), pp. 404-439, April 2011.
[3] D. Povey, M. Hannemann, et al., "Generating Exact Lattices in the WFST Framework," in IEEE ICASSP, 2012 (to appear).
[4] K. Riedhammer, T. Bocklet, A. Ghoshal and D. Povey, "Revisiting Semi-Continuous Hidden Markov Models," in IEEE ICASSP, 2012 (to appear).

Kaldi page on Sourceforge

Arnab Ghoshal is a Research Associate at The University of Edinburgh. Email: a.ghoshal@ed.ac.uk

Daniel Povey is an Associate Research Scientist at The Johns Hopkins University Human Language Technology Center of Excellence. Email: dpovey@gmail.com