Cloudy with a chance of ... speech

Danilo Giulianelli and Giuseppe (Pino) Di Fabbrizio

SLTC Newsletter, February 2010

This article provides a quick introduction about speech processing into the cloud to support the scalability challenges of speech-enabled applications running on mobile devices.

Cloud computing for the enterprise

Last October 2009 in New York City, Amazon Web Services (AWS), a division of Amazon.com, hosted an event about "AWS Cloud for the Enterprise". The venue was opened to technology and business 'stakeholders' interested in learning the latest bleeding-edge technologies in cloud computing. The auditorium was crowded with more than 200 attendees representing a broad range of enterprises from finance, entertainment, telecommunication, insurances, pharmaceutical, and more. An unusual mix of industries congregated in one place to listen to the latest news about outsourcing computational and networking resources into the cloud. At the opening remarks, Dr. Werner Vogels, chief technology officer and vice president of Amazon.com, defined cloud computing as "a style of computing where massively scalable IT-related capabilities are provided 'as-a-service' across the Internet to multiple external customers." Right after the introduction, happy customers such as Netflix, Wired Magazine, Nasdaq QMX, Sony Music, and New York Life insurance, shared their experience with the attentive audience. Then, the stage passed to a series of technical presentations pointing out security tricks, architectural configurations, and step-by-step procedures to get you started in putting enterprise applications into the AWS cloud.

What was motivating the devoted participants is a concrete and affordable new way to dispose immediately of highly available and powerful hardware on-demand, without the burden of hosting and managing the entire infrastructure needed to run your own data center. Instead of physical servers, computing resources are 'dispersed' in the internet cloud where IT-managers don't need to know where the machines are physically located, and don't have to worry about hardware maintenance. However computation resource virtualization is not new to IT companies. Before cloud computing other similar approaches such as distributed computing, software as a service (SaaS), and service-oriented architectures (SOA) exploited the concept. But only recently have companies like Amazon created the infrastructure to improve the way hardware procurement was handled and streamlined the process to bring new a service online.

The successes of cloud computing are also due to the introduction of sophisticated management and monitoring capabilities (e.g., Amazon EC2 auto-scaling), where the platform is capable of monitoring system resources, and automatically scaling those resources based on parameters such as CPU and memory utilization, or network traffic. The cloud infrastructure as a service (IaaS) enforces also a complete user separation by providing both CPU and network virtualization, i.e., not only are the processes from a given user separated from other users' processes, but also network traffic is kept separated and each running virtual image only receives and processes traffic destined for it.

Besides the Amazon EC2 product there are other examples of commercial cloud computing infrastructures including GoGrid, SunCloud, 3tera, Eucalyptus, and open source solutions, such as Nimbus and Xen Hypervisor. There are also cloud platforms available, delivering a computing platform with a complete solution stack (e.g., running native cloud applications) to customers. Cloud platforms include for example Heroku (Ruby on Rail), Google App Engine (Java), Microsoft Azure (.Net), Rackspace Cloud (PHP), Salesforce (Java/C# like), and, on the open source side, GridGain (Java).
One of the biggest advantages of cloud computing is that users can convert capital expenditures into operating expenses since there is no hardware or software to buy, and they only pay for whatever resources are being used. The cloud computing model in fact promotes acquiring resources on demand, releasing them when no longer needed, and paying for them based on usage. From the user perspective though, it's like the cloud has infinite capacity. Scalability is then achieved by providing APIs that allow starting and stopping a new server instance in a matter of minutes instead of weeks, as typically required when using a traditional in-house procurement process. Beside scalability, other important properties addressed by cloud computing are cost-effectiveness, reliability, high availability, and security.

Speech in the cloud

Cloud computing seems ideal for today's speech processing computational needs. To date, there are two major areas where this approach has been tested: in traditional speech processing applications such as IVR (Interactive Voice Response) systems and in more recently exploited web-based speech processing services or speech mashups [1][2].

IVR services typically require dedicated hardware and closed software in order to support speech in real-time, and are usually very CPU intensive. With CPU virtualization, adding and removing new processing resources can be driven by channel utilization rather than pre-allocated (and often not utilized) servers based on traffic analysis. An example of cloud-based communication approach that includes also IVR-like functionalities is Tropo.com where typical telecommunication functionalities are exposed on demand though common APIs.

The second class of services relies on network-enabled devices such as Blackberry, iPhone and Adroid-based mobiles, capable of accessing the Internet at broadband (e.g., 3G, WiFi) speeds and by capturing/playing the user's speech directly over the data channel. In this case, speech processing resources can be seen as network endpoints with a web-based interface that can be easily replicated in the cloud model by creating new cloud instances. Mobile services with this model are usually created by aggregating different web services with the speech processing component (e.g., speech mashups).
In general this class of services also applies to any device supporting HTTP, such as desktops, laptops, netbooks, and PDAs.

Considering that Apple recently announced 3 billion worldwide Apple Store downloads and that the number of applications available for the iPhone alone passed the 140,000 mark, it is fair to expect that more and more of these applications will eventually be voice enabled. The uncertainty of the success of such applications and the unpredictability of calculating the traffic hitting the speech processing engine, makes this scenario an ideal candidate for putting the speech engine into the cloud, thus enabling the service provider to easily scale the processing power up or down according to traffic demand, and without the need to buy and manage dedicated hardware in advance.

Figure 1: AT&T Speech Mashup Cloud Architecture

To validate and measure the performance of this model, the AT&T speech mashup prototyping platform has been ported to the Amazon cloud (Figure 1). In this illustrative architecture, the Speech Mashup Manager runs on its own virtual instance and takes care of the user accounts and resource management databases. It also forwards all the speech mashup client requests to the appropriate speech service (either WATSON ASR or Natural Voices TTS). The ASR and TTS engines are bundled together on the same machine image, and at runtime the instance can be automatically configured to run either one or both services. To support auto-scaling, an elastic load balancer sits in front of the ASR/TTS instances, and it takes care of balancing the load by sending requests to the instances in a round-robin fashion.
In a minimal configuration the system requires one instance for the SMM manager, one load balancer, and one instance for the ASR/TTS engines. The system is currently configured so that if the CPU utilization on the ASR/TTS engine instance reaches 80 percent, the Amazon auto-scaler will start another ASR instance and automatically add it to the load balancer pool.

The database containing the user accounts and application grammars is stored instead on an Elastic Block Storage volume mounted on the SMM instance. In order to make the users and built-in grammars available to the instances running the WATSON ASR engine, a virtual tunnel over TCP/IP is setup between the SMM and the ASR/TTS instances. Then the file system containing the grammars is NFS mounted over the tunnel to allow the WATSON ASR engines to view and receive updated grammars in real-time.

Figure 2: ASR CPU utilization during traffic load simulation

Testing the Speech Mashups in the Cloud involved using a traffic simulator generating a random traffic pattern based on a Poisson process [3]. The ASR engine was stressed at different traffic loads and grammar sizes. In all test condition, the auto-scale trigger was activated when the CPU load reached the 80% of the CPU usage for more than ten minutes, and a new instance was created in few minutes to absorb the extra load (Figure 2). Conversely, the extra instance was removed from the ASR pool when the load normalized below the 40% threshold.

Reliability and security

Compared to traditional carrier-grade telephony services where the uptime is typically a solid 99.999% (or 30 sec downtime per year), cloud computing is accountable for less 'nines' with an uptime in the order of 99.95% (see, for example, Amazon EC2 SLA). In fact, service instances running in a cloud infrastructure can sometimes crash or slow down in case the neighboring instances (e.g., processes running on the same physical server) are heavy CPU or network users. In heavy load conditions, load balancers can stop routing traffic, and network latencies can suddenly increase. A solution to this problem is architecting services for redundancy and handling failures gracefully. For example, running multiple instances in different geographical locations (i.e., deploying services in physically distinct data centers), and taking frequent snapshots of the service data might help to mitigate downtime in case of failures. However, it terms of user experience, failures usually translates in a sluggish response time that can be alleviated with visual feedback in the mobile screen.

Another frequently asked question surrounding cloud computing is about security. How secure is the cloud environment? The answer is as secure as the service designer is willing to make it, which translates into spending more time architecting application security. This typically may involve several techniques: 1) enforcing strict firewall policies; 2) opening only the minimum set of communication ports necessary to provide services; 3) use secure shell to connect to instances; 4) disable privileged logins; 5) encrypt data for extra protection, etc. This process is also known as system hardening and can be consolidated by using tools such as Bastille for Linux.

Conclusion

Speech processing in the cloud is a viable solution for applications running on mobile devices, desktops, laptops and IVR systems. The infrastructure is highly available, economical, scalable, and reliable. Moreover the cloud model ultimately translates capital expenses into operational costs, allowing service providers to focus on delivering new technologies and applications, thus reducing time to market. As an example, the AT&T Speech Mashup prototyping framework [1] was successfully ported to the Amazon cloud and the current speech mashup applications were migrated to the cloud framework.

Finally, it's easy to predict how the benefits of cloud computing will soon obsolete traditional IT solutions and will captivate in the near future more and more speech-enabled application providers.

References

[1] Giuseppe Di Fabbrizio, Jay G. Wilpon, Thomas Okken, A Speech Mashup Framework for Multimodal Mobile Services, The Eleventh International Conference on Multimodal Interfaces and Workshop on Machine Learning for Multi-modal Interaction (ICMI-MLMI 2009), Cambridge, MA, USA, November 2-6, 2009.

[2] A. Gruenstein, I. McGraw, I. Badr, The WAMI toolkit for developing, deploying, and evaluating web-accessible multimodal interfaces, ICMI '08: Proceedings of the 10th international conference on Multimodal interfaces, ACM, 2008, 141-148

[3] James F. Brady, Load Testing Virtualized Servers Issues and Guidelines

[4] George Reese, Cloud Application Architectures - Building Applications and Infrastructure in the Cloud, O'Reilly, 2009.

If you have comments, corrections, or additions to this article, please contact the authors: Giuseppe (Pino) Di Fabbrizio, pino [at] research [dot] att [dot] com, Danilo Giulianelli, danilo [at] research [dot] att [dot] com.

Danilo Giulianelli is Principal Member of Technical Staff at AT&T Labs Research.

Giuseppe (Pino) Di Fabbrizio is Lead Member of Technical Staff at AT&T Labs Research.


Add A Comment

This is a captcha-picture. It is used to prevent mass-access by robots. (see: www.captcha.net)

Code in the picture:
Title:
Your Name(*):
Email:
Notify me of any further comments to this thread:
Website:
Comment(*):