IEEE

The WAMI Toolkit and Portal: Web-Accessible Multimodal Interfaces

Alex Gruenstein and Ian McGraw

SLTC Newsletter, April 2009

Multimodal interfaces which make use of speech and natural language technologies have long made for great technology demonstrations, but they are rarely studied outside of the lab in the hands of a large number of real users. The open-source WAMI Toolkit makes it easy for any developer to create a Web-Accessible Multimodal Interface, which can be accessed via a standard web browser or a mobile device. We have had success making applications developed in this way available to users around the world. Moreover, when paired with the WAMI Portal, which provides speech recognition as a network service, any developer can speech-enable a web application with a few lines of code and a grammar.

Motivation

For researchers interested in multimodal interfaces, it has long been difficult to design systems which can easily be put into the hands of a large number of users. Speech-only interfaces have benefitted greatly from the ease with which they can be deployed over the telephone, allowing researchers to collect data from large number of callers. Understanding how users interact with multimodal interfaces, however, has typically required bringing subjects into the lab -- making data collection expensive, time consuming, and restricted to an artificial environment.

Over the last several years, the Spoken Language Systems group at MIT's Computer Science and Artificial Intelligence Lab has been experimenting with a new way to make multimodal interfaces available to users: via the web. Using AJAX techniques in a Web 2.0 framework, we've developed a number of highly interactive, speech-enabled multimodal interfaces which are available to anyone with a web browser, and -- soon -- a mobile device like an iPhone. This has made it possible for us to collect and analyze interaction data generated by users from around the world.

Toolkit and Portal

Now, we're making the same infrastructure available publicly in two ways: via the WAMI Portal and through the open-source WAMI Toolkit. Our goal is to make it easy for any developer to deploy rich speech-enabled applications via the web, whether he or she is a web developer with no experience with speech technologies, or a speech expert who wants to make a sophisticated natural language interface available to a wider audience. WAMI significantly lowers the barrier to entry: web developers of any skill level can easily build speech applications, and experts in speech and natural language processing can deploy applications without significant web-development skills.

WAMI Toolkit Configurations

Three ways to use the WAMI Toolkit. We provide the components indicated in grey boxes, while application developers provide the ones in white. You can use WAMI: (a) With your own speech services, (b) Via a server-side interface to the WAMI Portal, (c) Via the Javascript API

The WAMI Toolkit provides the "plumbing" necessary to develop dynamic web applications that access speech services "in the cloud". Speech researchers with access to a recognizer and synthesizer can simply hook them up through a standard interface, plug in natural language understanding components, and deploy a rich web-accessible multimodal interface.

For the millions of web developers who don't have experience with, or ready access to, speech technology, we are providing network speech services, in English and Mandarin Chinese, via the WAMI Portal. Developers can use the WAMI Portal in two different ways. If they'd like to interact with speech services via server-side application logic, they can deploy an application built with the WAMI toolkit to their servers. Or, more simply, they can add speech capabilities to any web page by adding a single line of HTML which links their page to the WAMI Portal. Then, they can specify the speech recognizer's grammar, obtain recognition results, and request speech synthesis through a Javascript interface -- no expertise in server-side programming is required. Anyone who can make a web page, can add add speech capabilities to it with a grammar, and just a few additional lines of HTML and Javascript, as the example below shows. Developers specify at runtime a JSGF grammar to use as the language model -- there's no need to compile it in advance -- so it's easy to dynamically personalize or customize the application grammar.

WAMI Parrot application screenshot

A screenshot of a simple "Parrot" WAMI application, and the entire Javascript application code. Copy and paste this code into your own HTML file, and you'll have a working WAMI application that listens to you, plays back what it heard, and then uses speech synthesis to parrot back what was recognized. Its grammar is created inline and is very small; you can say: "Hello WAMI", "I want a cracker", or "Feed me!". Click on the screenshot to try the WAMI Parrot for yourself.

Mobile Devices

People access the web not just from browsers on their computers, but increasingly via mobile devices as well. Moreover, given the small screens and keyboards of these devices, speech is particularly useful. We are currently beta testing a WAMI Browser for the iPhone, which allows iPhone users to speak to any web site developed using the WAMI Toolkit. Users simply install the WAMI Browser on their iPhone, and use it to navigate to any WAMI-enabled website. Similar browsers for other mobile platforms are in the pipeline.

Conclusion

The WAMI Toolkit and Portal make it easy for any web developer to create a sophisticated multimodal interface which can be accessed by users from any standard web browser, and -- soon -- from the most popular mobile devices as well. Indeed, during a short 3-week course in January, MIT undergraduates with no previous speech technology experience used WAMI to develop a number of compelling speech-enabled web applications. You can download the toolkit, sign up for a WAMI portal developer key, follow a tutorial, and see a number of WAMI applications by visiting http://wami.csail.mit.edu.

Examples

City Browser

City Browser is a conversational interface which provides access to urban information.


Word War Vocabulary Game

Word War is a single or multiplayer game for practicing vocabulary words in English or Mandarin.


Flight Browser

Flight Browser is a conversational flight-information interface.


Word War on the iPhone Flight Browser on the iPhone Calculator on the iPhone

Word War and Flight Browser accessed via the WAMI iPhone browser; plus, a voice calculator written entirely in Javascript.