Speech Mashups – Speech Processing on Demand for the Masses
Giuseppe (Pino) Di Fabbrizio
SLTC Newsletter, January 2009
AT&TTM is now granting access to its speech processing technology for use in building speech recognition prototype applications that run on iPhones, BlackBerrys, and other networked devices. This access is being offered via a new approach to speech services, the speech mashup, that merges speech and web services into one consistent application framework without the need to install, configure, or manage speech recognition software and equipment.
Traditional speech-enabled services rely on telephony platforms that combine media processing, media interaction and network signaling into a single architecture driven by high-level programming languages such as VoiceXML. While this approach reduces latencies and increases channel density, many speech services require only minimal network interaction, are able utilize simpler media processing capabilities, and can tolerate larger latencies.
Speech access to information search, for example, is a growing business in the mobile services domain where broadband wireless data access is now a viable channel to transmit speech and create rich media interactions. Thanks to the proliferation of service-oriented architecture (SOA) and public web-based interfaces, producing new web services is now easy and within reach of ordinary developers. Major web industry players are opening up their walled garden of proprietary content (GoogleTM Web API, Yahoo! ® Developer API, Flickr® Services, YELLOWPAGES.COMTM API, etc.), allowing consumers and enterprises to access technology that would otherwise be unavailable. Mashups, or web application hybrids, are rapidly becoming the most popular approach to aggregate these web services and create new ones.
AT&T speech mashup architecture.
AT&T Labs – Research extended this successful paradigm by adding speech processing capabilities and created the AT&T Speech Mashups - a new software framework that casts AT&T’s WATSON speech recognition and Natural Voices Text-to-Speech Synthesis as a web service to economically bring speech processing technologies to the larger web and mobile developer community. This new capability provides network-hosted speech technologies for multimedia devices with broadband access (iPhone, BlackBerry®, IPTV set-top box, SmartPhones, etc.) without having to install, configure, and manage speech recognition software or equipment. Speech mashups enable easy and rapid development of new speech and multimodal mobile services as well as new web-based services. The software implementation is based on well-established web programming models, such as SOA, REST, AJAX, JavaScript and JSON.
AT&T CTO, John Donovan, talking to the press at the 2008 AT&T Technology Showcase.
The concept behind the speech mashup technology is intuitive and similar to the familiar web application approach. The speech is first captured on the device (the client) through the microphone and compressed using one of the available speech coders (for example the AMR coder at 12.2 kb/s). Then an HTTP connection is established with the speech mashup portal (the server), which delivers the bit stream to the AT&T WATSON speech recognizer engine along with a set of parameters including the reference to the grammar used to recognize the utterance. The recognition results are posted back to the client and used by the client to take the next action. Depending on the complexity of the task, a semantic interpretation could be added to the results, so that natural language variation of the same intent can be interpreted properly. The speech mashup portal makes the AT&T WATSON speech engine accessible from any network as web services and exposes it through a simple HTTP API. It takes care of uploading and compiling the user’s grammars, logging the service activities, and provides tools for utterance transcription. Full documentation and code samples are provided online as well.
iPizza - multimodal pizza ordering prototype.
Speech mashup technology was publicly demonstrated during the AT&T Technology Showcase held in New York City on September 15, 2008. AT&T showed several futuristic services that envision the combination of iPhone (or iPhone-like devices) and speech recognition as main service interaction mode. Among many service concepts, the integration of U-Verse, the AT&T IP-based TV service, and the iPhone inspired several potential new multimodal prototypes. One example is iMOD (Movie On Demand), a multimodal interface that combines speech input with graphical interaction on the iPhone to enable users to rapidly find movies on demand using a mobile device. Users can speak queries like "Action movies with Bruce Willis" or "Movies directed by Woody Allen and starring Diane Keaton" and play video clips on the phone itself or start watching the movie on TV.
iMOD - multimodal movies on demand.
Another service prototype, iPizza, implements a multimodal interface for ordering pizza ordering. It combines speech input with graphical interaction on the iPhone to enable users to rapidly select menu items on a mobile device. Users can speak naturally and request multiple items at the same time. The web interface allows user an easy navigation to update the items in the shopping cart. Full ordering requests can be formulated in one sentence like: "I’d like to order a pizza with mushrooms and ham, two Diet Pepsi and baked cinnamon sticks."
Speak4it - multimodal local business search.
Finally, created in collaboration with YELLOWPAGES.COM and available at the Apple Store for iPhone customers, Speak4it (http://www.speak4it.com) demonstrates how to access local business listing with natural language queries. Examples include "Italian restaurants in Florham Park New Jersey" or, relying on the phone GPS, “Show me the nearest Bank of America offices.”
AT&T is planning to make more tools available for the speech research community, including more code examples for the iPhone and more general purpose precompiled grammars. Send an email to watsonadm [at] research.att.com to request a speech mashup account for non-commercial use.

