Living rooms getting smarter with multimodal and multichannel signal processing
Dimitrios B. Dimitriadis and Juergen Schroeter
SLTC Newsletter, July 2011
Nobody can predict with 100% accuracy what the future will bring for us, so we won't pretend we have all of the answers. However, we do know that technology use in the living room is closely related to the future of the Internet, affecting how we consume media, how we communicate with friends, how we play games, and how we shop. In this context, it is a given fact that platforms that bring the Internet into the home will allow seamless consumption of multimodal data across PCs, mobile devices and TVs under a thin application layer that is completely transparent to the user, often jumping from one device to the other within the same session. Users will have access to their multimodal content regardless of their location, in the house or outside, and regardless of which device is being used. However, given the enormous variety and the sheer amount of current and future media that we are, and will be, able to access, one of the difficult challenges that will define the success or failure of our vision of a "smart" living room is how we will access, search and retrieve this content in an easy, intuitive way. Possible solutions to these and other challenges, mostly related to a proper user interface for the "smart living room of the future", will be explored in this article. Our not-so-wild guess is that the battle of domination amongst the future content-related technologies will last for many years. It is obvious that, with the introduction of applications and devices like Google TV, Microsoft Kinect and other products, this battle has already started.
The rest of this article is not, by any means, meant to provide a deep dive into all the relevant topics, but it is rather meant as a short description of how emerging technologies, and more specifically, multimodal and multichannel signal processing, can improve the interaction between users and machines in the living-room “technology battle-field”.
Living room of the Present
Today, the most common home controlling interfaces are still based on buttons, touch screens, and keyboards, on remote controls and computer-like devices such as PDAs, smartphones, and tablet PCs. However, the principal trend is towards a more ubiquitous use of handheld devices in home automation, largely promoted by the success of the iPhone/iPad and Android-based devices. Despite the fact that more and more smartphone or tablet applications provide speech interfaces, they do so in a rather limited and strictly constrained way, mainly for command and control purposes, where the users speak highly stylized phrases into a close microphone. With these devices, the vision of a natural interaction platform using speech and/or gestures remains distant and seems rather exotic.
In any case, key-players in the present media industry already offer some technological options for users that want to turn their living room into a hi-tech room with a futuristic flavor. Some of these options (but certainly, not all) are:
- media centers: Media centers deliver content found either on the Web or offered by cable TV providers. Media centers are connected to one or more televisions and the Internet. Most of the current media centers can deliver different content to different rooms/TVs and also provide recording capabilities. Setting up and using a media center usually is very easy and straight-forward. Amongst the pioneering corporations having entered this market we find AT&T with their U-VERSE set-top box and service and Google with their Google TV. Also, Microsoft is pushing their “Media Center” application in some advanced versions of Windows 7 for this use, also enhanced by Xbox360’s as “clients” that stream contents to the connected televisions sets. Sony’s Playstation serves in a similar role.
"Fig.1: AT&T U-VERSE GUI"
- television sets: TVs remain unchallenged in their role as the centerpiece for any media-related home decor. Modern televisions feature 3D displays, delivery of online video content, and video games. Quite lately, however, the concept of a television has been extended by offering connections to the in-home broadband network via Wi-Fi or Ethernet, and by offering built-in applications like Skype, Netflix, YouTube and others. Manufacturers like Sony have developed and support platforms, e.g., the BRAVIA Internet Video platform, enabling streaming content to their TVs. These new TVs also show some “intelligence” like Sony's newly developed Track ID app powered by Gracenote that analyzes any selected song playing on the TV, identifies it, and provides artist, album, and song information. In our opinion, although much effort has been spent towards a simplified User Interface and improved functionality, the interaction between the users and the TV is still lacking robustness, efficiency and certainly, naturalness.
- wireless speakers: The vision of true wireless home theaters is considered a “Holy Grail” for numerous pristine audio-video quality lovers whose efforts have been hampered by the need to run speaker wires across their living rooms but who didn’t want to spend a fortune on them. However, due to recent advancements in wireless technology, things are changing fast; wireless audio systems have become better, cheaper, faster, and easier to use and install. Still, the term “wireless” for multimedia content delivery mostly stands for “less” wires! While such systems eliminate the use of speaker cabling, they still need power hook-ups and use cables for audio and video interconnections, although we expect the latter to change over the next few years. Quality, security, and pricing are among the main issues that need to be addressed before this technology can be delivered at a reasonable price. However, companies like Creative have developed their BlueTooth technology-based Pure Wireless technology to deliver high quality audio. Other, more high-end options include Aperion Audio’s $2,500 Intimus 4T Summit Wireless 5.1 Home Theater Speaker System that only demands line-of-sight between the speakers and the controller and a power outlet for each component.
"Fig.2: Depth representation and skeleton assignment by Kinect"
- Kinect: Finally, the most revolutionary user interface device of recent years is the Kinect from Microsoft. It combines multiple sensors and provides a working platform to build natural Human-Computer Interaction (HCI) applications. This device is the first of its kind and is expected to become the golden standard for natural HCI-related hardware. Amongst the provided functionalities is the control of movies and music with gestures and/or speech commands. In addition to that, Kinect uses motion sensors that track the entire body by “assigning” digital skeletons to each user present in its operating range, defined by its depth-measuring sensor. Another functionality advertised by Microsoft is facial recognition that personalizes the interaction according to a list of recognized faces. Finally, Kinect uses four microphones to recognize and separate speech signals from ambient noises in the room, making any interaction more natural even when users are talking to it across the room.
Living room of the Future
A current trend is the exponential growth of multimodal content, making media search and retrieval quite challenging. It is absolutely imperative to develop a novel and friendly way to sift through huge volumes of content, given that the currently common solutions for browsing content, based on some alphabetical listings, is certainly inadequate. The current search experience is frustrating but unavoidable given for example, the hundreds of TV channels present in any cable Electronic Program Guide (EPG). It appears that speech and multimodal interfaces could provide a more natural and efficient means for addressing such challenges. It is far more natural for people to say what they wish to watch, for instance, in order to search for existing recordings or point to some part of the TV for further probing. Microphone arrays, cameras or Kinect-like devices can provide the necessary hardware to implement such speech- and gestures-based interaction. This interaction combined with “intelligent” recommendation systems, such as the Netflix recommendation system proposed by AT&T Labs, will further improve the usability and naturalness of data searching systems.
The goal of a more natural Human-Computer Interface (HCI) is to build a system that allows users to use gestures along with grammar-free speech to search TV listings, record/or play TV shows, and listen to music. Such a system should not be limited to media-center related applications but, instead, should be built as a more generic system component or service, supporting a wide variety of applications and devices. By allowing users to employ gestures and speech as input to the system, naturalness can certainly be improved, whilst the TV screen provides visual system feedback.
We feel that the right starting point for building such a natural HCI is to base it on multimodal data processing, i.e. speech- and gesture-based interaction with the computer that will be capable to parse and “translate” such user input into commands and actions. For example, gesture-related information can be extracted from the visual image recorded by one or more cameras in the room. In addition to that, microphone arrays will capture and deliver much cleaner audio to the speech recognizer, alleviating the need of the user to rest almost motionless or to wear a close-talking microphone, instead of following him around as he/she moves in the room. In the latter approach, there are some major problems that must be addressed: efficient and robust real-time processing of the user-tracking video and audio components, and how well these two modalities can be parsed and combined to provide the “desired action”. Until very recently, examples of easy-to-buy devices providing sensors that support this kind of natural HCI could be counted by the fingers of one hand. This is the primary reason why Microsoft’s Kinect is such a huge success. It combines all the necessary sensors, i.e. audio and visual sensors to probe the world, into a single, inexpensive, easy-to-use package.
Furthermore, there is some good news concerning the “heavy” processing of huge volumes of multimodal data. Current experience has shown that it is not necessary for much of the processing of the multimodal data to be done on local machines in the home; but it could be provided by synaptic application servers in the “cloud”. The advantage of this approach is that the powerful and expensive machines can be centralized and the “Living Room of the Future” can be web connected to them. Therefore, end-user machines will only have to provide thin (and inexpensive) client services that communicate with these application servers, thus offering an affordable and scalable processing solution.
There are some discrete steps that should be considered when building a multimodal interface. A first step towards localized and personalized applications in the home is an accurate real-time face tracking scheme. Accurate face tracking can be quite challenging when the ambient conditions are unconstrained and vastly varying. Besides the environmental conditions that is, lighting, furniture, physical layout of the rooms, etc., human behavior also increases the complexity of this task. For example, natural gestures create occlusions caused by hands in front of the speaker face or by other people, clothes, etc, which should be adequately and smartly addressed before commercializing such systems.
Any effective face detection and classification systems have to employ certain basic functionalities: First, all the faces/users present in the room have to be detected, after translating the 3D information onto a 2D image plane. It is obvious that projecting the real world onto a 2D-plane contains challenges, for example, the 3D faces have to be correctly projected onto a “normalized” view. At present, the problem of accurately detecting a multitude of faces present in a room remains unsolved in general; especially when the users are allowed to have any orientation in reference to the camera, instead of just facing it “head on”, which is the working assumption for present systems. Alleviating such constraints makes the detection task extremely complex and challenging.
As soon as the “presence” of faces in a camera’s field of view is detected, visual features have to be extracted and the faces have to be recognized, accordingly. The features, similarly to previous steps, should be translation-invariant and robust to changes in lighting and other random environmental conditions. The output of this process is extremely valuable since the HCI system will know exactly who is in the room and where each person is located.
The system can adapt and provide personalized services according to the recognized users. Profiles are available for the most frequent users that can be personalized, thus tremendously simplifying further communication with any system tasks. Amongst the different elements of a user profile that could be personalized are the covered multimedia content, and (more practically) the speech recognition acoustic and language models that are vital for top-notch recognition accuracy. It is obvious that an interaction that is personalized significantly enhances the feeling of naturalness, something that is our main interest here.
Finally, a user tracker, when considering the users' position, can direct multimodal data to the users, providing personalized simultaneous content to different users in the same room. This kind of personalized multimedia content delivery to a particular point in the room, shares the same physical principles with microphone array audio processing. (Note that the current video technology enables 3D projection without the use of special glasses when the viewer's position is known.)
Natural spontaneous speech interaction with distant microphones is an important step towards development of hands-free human-computer interfaces. Such interaction has to provide noise immunity, robustness and flexibility despite the obvious variability of environmental conditions, such as ambient noise level, reverberation, or multiple simultaneous conversations.
Voice recognition has been progressively introduced in this demanding application field. However, so far its use appears to be heavily constrained, demanding the use of headsets, or a close-talking mobile device. The far-field ASR, at this moment, can achieve quite unsatisfactory performance, prohibiting any immediate commercial deployment. One more interesting point is that the most commonly used devices for such tasks are microphone arrays built primarily for videoconferencing applications. Therefore, they are quite expensive and still far away from wider public availability. The lack of commercial devices employing more than two microphones is noticeable and indicative of the technological gap that remains to be addressed.
One of the biggest challenges for a massive introduction of ASR technologies in home automation systems is the increase of robustness against spontaneous speech, noise interference and uncontrolled environmental acoustic conditions. The input speech variability related to microphone location is one of the most critical issues often causing degradation in ASR performance.
However, the combination of visual information with (acoustic) microphone arrays will certainly provide accurate directional beams to very narrow spots in the room. In this respect, we are not that far from systems that will enable users to efficiently interact with devices in a crowded room, with music playing in the background and other people chatting, even when the microphones are mounted in some distance. For the time-being, accurate localization is achieved using more (than two) microphones (and thus, increase the overall computational load and respective cost). However, new algorithms, better array designs and more powerful machines are introduced and made available every day.
The ultimate living room of the future is expected to provide distant speech interaction based on ad-hoc microphone networks with hundreds of microphones, randomly distributed. Although many- and cheap - microphones will be installed, only a small subset of them will be active at a time, to provide the necessary speech-based human-computer interaction imposed by the given changing conditions, allowing users on the move to utter commands or describe tasks that now can only be executed using haptics or by pressing buttons, e.g., to search for some video-clip and stream it to the media-center.
Since HCI, even when constrained to the “living room”, as a problem is quite broad and involves many research disciplines including speech processing, computer vision, psychology, artificial intelligence, and many others, an exhaustive listing of all the challenges that have still to be answered is simply not realistic. Therefore, the main goal of this article was only to highlight some of the challenges and start a fruitful discussion about the development of how close we are in building of natural-feeling human-computer interfaces that are based on, or can track and understand the users’ behavior and preferences. However, several research challenges have to be addressed before claiming any real victory in this effort.
Solutions will come from the use of multimodal and multichannel signal processing schemes that will provide an accurate description of the extremely complicated acoustic and visual scene of a real-life living room, separating speech coming from multiple concurrent users and deciding who are the ones that it should “listen” to. In this working scenario, possible interfering sources (e.g., phone ringing, radio, background noise external to the house, other people’s speech, baby’s cry etc.) have to be suppressed, thus improving speech signal quality and consequently the far-field performance of any functional speech recognizer.
Dimitrios Dimitriadis (S'99, M'06) Dipl.-Eng. (ECE), 1999, and Ph.D.-Eng. (ECE), 2005, National Technical University of Athens, Athens, Greece.
From 2001 to 2002 he was an intern at the Multimedia Communications Lab at Bell Labs, Lucent Technologies, Murray Hill, NJ. From 2005 to 2009 he was postdoctoral Research Associate at the National Technical University of Athens, where he also taught courses in Signal Processing. He is now Principal Member of Technical Staff with the Networking and Services Research Laboratory, AT&T Labs, Florham Park, NJ. His current research interests include speech processing, analysis, synthesis and recognition, multi-modal systems, nonlinear and multi-sensor signal processing.
Dr. Dimitriadis has authored or co-authored over fifteen papers in professional journals and conferences. He is a member of the IEEE Signal Processing Society (SPS) since 1999 where he is serving as a reviewer, too.
Juergen Schroeter: Dipl.-Ing. (EE), 1976, and Dr.-Ing. (EE), 1983, Ruhr-Universitaet Bochum, W. Germany; AT&T Bell Laboratories, 1986-1995, AT&T Labs - Research, 1996-
From 1976 to 1985, Dr. Schroeter was with the Institute for Communication Acoustics, Ruhr-University Bochum, Germany, where he taught courses in acoustics and fundamentals of electrical engineering, and did research in binaural hearing, hearing protection, and signal processing.
At AT&T Bell Laboratories he has been working on speech coding and synthesis methods employing models of the vocal tract and vocal cords. At AT&T Labs, he is heading research on Speech Algorithms and Engines.
Dr. Schroeter is a Fellow of IEEE and a Fellow of ASA.