Automatic speech recognition: can we speak far from the microphones?
Maurizio Omologo
SLTC Newsletter, February 2011
Automatic Speech Recognition (ASR) technologies are often exposed to a usage under highly mismatched conditions, due to environmental noise and to room acoustics (in particular, reverberation) which combine together with speech input at microphone level. The degree of degradation in the input signal, and the consequent drop in ASR performance, can be very significant when the distance between user and microphone increases [1].
Although this limitation is not a crucial problem for some applications in which the user can adopt a hand-held microphone or wear a head-mounted one (e.g., dictation), in many other cases nowadays it represents one of the main reasons hindering a wide acceptance of voice technologies at customer level. During the last decade, ASR has become an increasingly popular core technology in several application domains under which no constraints on user-microphone distance should hold (e.g., automated home, support to impaired users, robot companions, gaming, etc.).
Over the years a significant progress has been made on microphone array processing and technologies. Very effective array solutions and devices have been realized, and are also available in the market, for capturing talkers in conferencing applications and meetings. However, if the design of the array processing has been determined by the primary goals of tracking the speaker, enhancing distant speech, and reducing acoustic echo in a communication, their use for ASR-based applications might often be limited to rather “controlled” and simple tasks. In other words, the current fragility of ASR technology to all the variability introduced by distant talking interaction cannot be overcome simply by replacing the close-talking microphone with a microphone array and its processing. For more complex recognition and understanding tasks, in general the characteristics of the array output signal are still far from the ideal one which would be expected at the ASR engine.
Recently, under the European Union (EU)-funded project DICIT (Distant-talking Interfaces for Control of Interactive TV) [2] we have observed that some benefits can be found by training acoustic models of the ASR engine with signals which have been pre-processed by the multi-microphone front-end (including processing for automatic speaker localization, adaptive beamforming, and acoustic-echo cancellation). The reason for this improvement is that typical distortions introduced by the given front-end represent a new knowledge learned and then exploited by the system. This is, however, a trivial and crude approach that does not allow us to understand deeply the problem.
Efforts so far spent at Fondazione Bruno Kessler (FBK)-IRST labs in realizing showcases and prototypes of distant-talking interaction also showed us that applying a speech enhancement technique that seems to be very good at perceptual level, or decreasing the relative word error rate for a given recognition task, often do not correspond to any significant impacts in terms of understanding capabilities.
Another evidence to recall (and many others are not reported here just for lack of space) regards the quality of the array device: our experience is that only a microphone array with very good performance (e.g., in terms of quality of Analog-to-Digital conversion and of immunity to electrical noise) can allow one to obtain experimental results that confirm the validity of a given theory and of related practical solutions.
Many innovations steps are necessary for a target as the human-computer interaction capability shown by HAL 9000 in 2001: Space Odyssey, or by Star Wars, i.e., recognizing, understanding and reacting intelligently to fluent spontaneous speech, with the human far from the microphones and free to move. Realizing this type of scenario, based for instance on a distribution in space of several microphone arrays, is extremely complex due to the diverse disciplines which are involved, from acoustics-oriented signal processing to spoken dialogue management and human-computer interaction design. The best results will probably come out from approaches based both on a deep knowledge of all the basic technical problems and on a synergetic combination of related techniques.
Different research communities are clearly interested in the above-mentioned technical problems. For a more effective process of innovation and progress (as very well highlighted in [1]), all of these communities of experts in different fields should interact, cooperate tightly, sharing corpora and tasks, tackling together the same final objectives and targeted scenarios, and eventually finding benefits from this complementarity.
Along this direction, over the last years some actions have been taken worldwide. This paper has not the scope of providing any state-of the-art with this regard. However, let us mention the EU-funded projects CHIL [3], AMI and AMIDA [4], which addressed both microphone array processing and speech recognition technologies, with application scenarios referred to lectures and meetings. More recently, the EU-funded DICIT project realized a spoken dialogue system for voice-enabled control of TV and access to related information, which supported three languages (i.e., English, German, and Italian). Public documents as well as video clips regarding the use of the final DICIT prototype in real noisy conditions can be found in the project web site [2].
In terms of international actions, it is also worth mentioning the HSCMA workshop [5] which will be held this year in Edinburgh, and aims to continue a tradition initiated originally by two camps of specialists, one mainly comprising experts of acoustics-oriented signal processing and Microphone Arrays (MA), and the other composed largely of experts in Hands-free Speech Communication (HSC), most especially in automatic speech recognition.
Another forthcoming event that deserves to be mentioned is the PASCAL CHiME Speech Separation and Recognition Challenge [6],[7] which addresses the problem of separating and recognizing speech artificially mixed with other speech. Speech separation represents, in fact, another frontier area to make distant-talking interaction systems robust to manage multiple subjects speaking simultaneously, to track them in space, and, in general, to reduce the negative impact on ASR of other possible active sources that would interfere with the user.
For any questions or comments, please contact Maurizio Omologo at the following e-mail address: omologo@fbk.eu. Information about the research activities being conducted at FBK-irst on the given topics are also available at http://shine.fbk.eu.
For more information, see:
References- [1] Matthias Wölfel and John McDonough, Distant Speech Recognition, John Wiley & Sons, 2009.
- [2] http://dicit.fbk.eu
- [3] http://chil.server.de
- [4] http://www.amiproject.org/
- [5] http://www.hscma2011.org/
- [6] http://www.dcs.shef.ac.uk/spandh/chime/challenge.html
- [7] http://www.signalprocessingsociety.org/technical-committees/list/sl-tc/spl-nl/2010-11/pascal-chime/
Maurizio Omolog is a Senior Researcher and Project Leader at the Fondazione Bruno Kessler in Trento, Italy. Email: omologo@fbk.eu

