Updates in Voice Browsing and Multimodal Interaction
SLTC Newsletter, July 2009
The first decade of 2000 has seen the spreading of speech applications in many different fields. This was not only the result of mature core speech techniques for automatic speech recognition (ASR), text-to-speech (TTS), speaker identification and verification (SIV), but also of new ways of crafting spoken dialog systems (SDS).
The first large scale SDS were created at the end of 1990 as a result of successful research projects, including ATIS or COMMUNICATOR for DARPA and ARISE for EU-funded projects. In the meantime, the most prominent research centers were in the process of exploiting their speech technologies in the market.In USA: old-Nuance from SRI, in France: Telisma from CNET, and in Italy: Loquendo from CSELT/Telecomitalia.
SDS and speech technologies in general were applied either by proprietary integrations or by proprietary SDKs for telephony platforms. It was very hard to find commonalities outside the core recognition technologies used for ASR, TTS and SIV.
This was the moment when standard bodies, primarily W3C, started to play a determining role. I’d like to give a brief presentation of how it happened, why and what had been so far the major achievements.
1. W3C Voice Browser birth
Something new was in the air, many research/industry people were thinking on how to change the paradigm of SDS implementation. One of them was a group of people (see Figure 1) from four leading companies: AT&T, Lucent, IBM and Motorola. They proposed a first specification draft and then, in early 2000, released a new XML-based language to implement spoken dialogs: the VoiceXML 1.0. These four companies founded an industrial Forum to promote the new language called VoiceXML Forum.
Figure 1. Preparing to announce VoiceXML 1.0 - Friday Feb. 25th, 1999 - Lucent, Naperville, Illinois - Left to right: Linda Boyer (IBM), Gerald Karam (AT&T), Ken Rehor (Lucent), Pete Danielsen (Lucent), Bruce Lucas (IBM), Dave Ladd (Motorola), Jim Ferrans (Motorola).
Almost in the meantime Jim Larson and Dave Raggett launched a W3C Workshop on Voice Browsing. Its success leaded to the creation of a W3C working group devoted to implement the Voice Browsing idea. In 2000, the W3C Voice Browsing working group (VBWG) started with Jim Larson and Scott McGlashan as co-chairman and it took the VoiceXML 1.0 as a first contribution to create a new standard.
The idea behind Voice Browsing was to find a convergence of SDS creation and Web techniques, which by that time were spreading the Web paradigm to shape the applications we are still using today. Concepts like separation of content producer (Web application) and content fruition (Web browser), HTTP as an interface protocol and XML as content format were at the core of the Internet adoption. The same paradigm was thought to benefit SDS application creation area, where speech application might be dynamically generated by Web application tools and the speech platform was transformed into a specialized browser, while bearing many similarities to Web browsers (HTTP, XML, document caching, etc.).
2. The W3C Speech Interface Framework
A seminal picture (see Figure 2) was drawn by Jim Larson and called W3C Speech Interface Framework. It was an attempt to capture all the components inside a speech platform and to name the standards that might take place. The idea was to break VoiceXML language into a set of specialized, but coordinated standards and then elaborate each of them into a W3C Recommendation, that means a real standard for the benefit of the speech industries.
Figure 2. Initial W3C Speech Interface Framework.
Who was attending the W3C VBWG conference calls at that time recalls the unfolding of these ideas in a large and collaborative group which reached soon around fifty leading companies and institutions.
The first outcomes were released in 2004; 3 years are very short time for a standardization body which takes into account public reviews, Implementation Reports and public revisions. The first three recommendations were: VoiceXML 2.0, SRGS 1.0 and SSML 1.0, the first for dialog, the second for speech recognition syntax, and the third for speech synthesis. These were the basic building blocks of the new architecture and immediately the industry adopted these standards to transform traditional platforms into voice browsing platforms. There was a radical change in the industry which started to transform all the platforms and the way the speech applications were created. No more proprietary SDKs, but Web application techniques to create applications. No more proprietary control of speech technologies, but standard control of them.
A second round of recommendations was generated in 2007 with the release of VoiceXML 2.1 and SISR 1.0. The first is an addition to VoiceXML 2.0 of eight features to empower the language and the second completes ASR grammars with a powerful semantic interpretation languages. At this point an ASR was completely controlled by standard speech grammars.
The following year 2008 was the time for another recommendation: PLS 1.0, a standard language for pronunciation lexicons to be used for ASR by SRGS 1.0 and TTS by SSML 1.0. Unfortunately, a standard for representation of N-grams is still to be completed. Currently, only a Working Draft called “Stochastic Language Models (N-Gram) Specification” is available. Moreover 2009 is likely to be the year for the last piece, the call control language: CCXML 1.0 to become W3C Recommendation.
If you look to the same picture (see Figure 3), you can see that many of the standards are a reality today and the majority of the components can be completely controlled by standard processes. This is a longstanding W3C achievement to the speech industry which shapes the applications in use today.
Figure 3. Current W3C Speech Interface Framework.
3. W3C Multimodal Interaction
Mostly in parallel and strictly intertwined was the work done in W3C on Multimodal Interaction. A new working group was created in 2002 to take care of application with multiple input/output modalities. If in the traditional telephony the modalities were restricted to speech and keypad, the application on multimedia PCs, tablet PCs, notebooks and down to small devices like PDAs, and Smartphones, may benefit of multiple modalities to complement each other and to concur to create a richer and more flexible user experience. This area includes the accessibility of the devices to differently able and the environmental conditions, for instance the use of hands-free devices while driving a vehicle.
So far, this appealing area has still been less explored than the Voice Browser one, but the time seems to be right today for the richness of multimodality.
As regards the standards produced by the W3C MMI working group (MMIWG), the first one is EMMA 1.0, released early this year, which offers a rich markup language to represent semantic content from different input modalities, such as speech, gesture, handwriting. EMMA is a candidate to become the format of the content of future protocols, web services, and applications. It is an exchange data format generated by modality components and to be consumed by interaction managers. The Voice Browsing will also benefit from its use. For instance with the introduction of word/semantic lattices in addition to the traditional N-best results from speech engine and spoken understanding components, or having EMMA 1.0 as the standard results produced by IETF MRCPv2 protocol.
Another standards almost mature to become a recommendation is InkML 1.0, a markup language to represent digital ink for drawings, handwriting and annotations. Several industry players are interested in adopting InkML as internal language for representing ink and to become accessible to developers in a near future.
A more long standing goal for MMIWG is the specification of a Multimodal Architecture that allows industrial players to define a highly distributed architecture and extensible modality components. A Working Draft of the W3C Multimodal Architecture is publicly available.
The advent of Voice Search applications, such as Tellme/Microsoft Mobile search, Google Mobile Search, and AT&T Speak4It, might be a driving factor to push for advanced multimodal interfaces on mobile devices. Another interesting approach is based on speech mash-ups applications like the one proposed by AT&T Research labs (see previous SLTC article here).
4. Next Generation of Voice Browsing Standards
If the MMI field is in a first phase of standard production, W3C VBWG completed the first round of standards and it is already active on a second generation to complete and improve the current framework with an updated one. Here is a brief summary of the main drivers.
4.1 Speaking in Tongues
One goal of W3C in general is to create the foundation for a multilanguage web, which means to address and to survey the evolution by means of W3C Internationalization activity (I18N). This activity has a direct impact on speech technology arena, because speech standards should be usable on any language in the world.
All the standards are XML languages, so that the encodings are offered by the container framework and the definition of languages is linked to the IETF BCP47 activity which aims to define language sub-tags covered by a registry to allow the standard definition of all the languages in the world.
Nevertheless specific language dependent aspect pertains to the capabilities covered by speech standards. For example, SSML 1.0, the markup language to direct a TTS engine in a standard way, was found to lack flexibility for Far East languages, such as a poor support for tonal languages (e.g., Mandarin Chinese) and capabilities to add controls under the word level, which is important for tonal languages.
These motivations were at the core of three workshops promoted by W3C in Beijing, Crete and Hyderabad to try to assess the internationalization needs. The result was SSML 1.1, a specification which is already in an advanced stage of standardization.
4.2 State-Charts Markup
Another interesting standardization area was driven by the need of a common language to define State Charts to control synchronization of different processors. This resulted in the definition of a new markup language called SCXML 1.0 which transposes to XML the Harel’s State Chart formalism (already adopted by UML).
This language will be used by Voice Browser platform to add flexibility and portability to the future voice applications, but in the meantime will be available for different usages. A very high interest is devoted to this new language and several open source implementations are already available, even if the standardization is still under development.
SCXML 1.0 will subsume the CCXML 1.0 which might be seen as a specialization of State Charts for call control handling.
4.3 VoiceXML next version: 3.0
If VoiceXML 2.0/2.1 is widely used and virtually present on all the voice and telephony platforms in the marked, new functionalities are requested to be added for the benefit of a new generation of voice applications. The new functionalities can include Speaker Identification and Verification which is a promising area to complement biometric techniques with the addition of speech biometrics, but other functionalities can be defined at a more fine grain of control of audio playback to simplify the creation of high density application for message recording and prompt playback. The latter is needed for the “convergence” of traditional telephony and IP one inside the future Telco networks, such as the IMS architecture. Also in that area the VoiceXML is the candidate of being the common language for application development. A Working Draft of VoiceXML 3.0 is already available.
The current update was meant to give a precise picture of the achievement reached so far by a fruitful area of speech related standards. I admit I was biased towards W3C activities, other standard bodies, for instance IETF is complementing W3C work in other important areas, such the protocols.
The story was a success that shaped the application of speech technologies in the market today. The most recent advances have been outlined; I hope they will provide benefits in the near future, even if the time for standardization bodies is a dependent variable per se.
References : General
- W3C http://www.w3.org
- VoiceXML 1.0 http://www.w3.org/TR/2000/NOTE-voicexml-20000505/
- VoiceXML Forum http://www.voicexml.org
- VBWG http://www.w3.org/Voice/
References : W3C Speech Interface Framework
- VoiceXML 2.0 http://www.w3.org/TR/voicexml20/
- SRGS 1.0 http://www.w3.org/TR/speech-grammar/
- SSML 1.0 http://www.w3.org/TR/speech-synthesis/
- VoiceXML 2.1 http://www.w3.org/TR/voicexml21/
- SISR 1.0 http://www.w3.org/TR/semantic-interpretation/
- PLS 1.0 http://www.w3.org/TR/pronunciation-lexicon/
- Stochastic Language Models (N-Gram) Specification http://www.w3.org/TR/ngram-spec/
- CCXML 1.0 http://www.w3.org/TR/ccxml/
References : Others
- W3C MMI working group http://www.w3.org/2002/mmi/
- EMMA 1.0 http://www.w3.org/TR/emma/
- IETF MRCPv2 http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-19
- InkML 1.0 http://www.w3.org/TR/InkML/
- Wokring Draft of the W3C Multimodal Architecture http://www.w3.org/TR/mmi-arch/
- Speech mash-ups http://www.research.att.com/viewProject.cfm?prjID=355
- I18N http://www.w3.org/International/
- IETF BCP47 http://www.rfc-editor.org/rfc/bcp/bcp47.txt
- SSML 1.1 http://www.w3.org/TR/speech-synthesis11/
- SCXML 1.0 http://www.w3.org/TR/scxml/
- VoiceXML 3.0 http://www.w3.org/TR/voicexml30/