The AT&T Statistical Dialog Toolkit V1.0
Jason D. Williams
SLTC Newsletter, April 2010
The AT&T Statistical Dialog Toolkit V1.0 is now available to the research community. This toolkit simplifies building statistical dialog systems, which maintain a distribution over multiple dialog states.
Background
Recently, statistical approaches to spoken dialog systems have received increasing attention in the research community. Whereas conventional systems track a single hypothesis for the current dialog state, one of the main benefits of statistical dialog systems is that they maintain a distribution over a set of possible dialog states. This distribution can incorporate information and models not available to conventional dialog systems, including all of the entries on the ASR N-Best list, context-dependent models of user behavior, and prior personalized expectations of the user’s goals. This distribution has shown to yield gains in robustness to ASR errors when compared to conventional dialog systems [1,2,3].
Despite their promise, building statistical dialog systems remains challenging. A major problem is that there are no toolkits or frameworks currently available, so application developers must implement their own infrastructure from scratch. To move the field forward -- and to understand how to approach application development for statistical systems -- a toolkit is needed.
Toolkit description
The AT&T Statistical Dialog Toolkit (ASDT) V1.0 aims to provide an accessible means to building dialog systems, which maintain a distribution over multiple dialog states. The core of the toolkit is a real-time belief update engine which maintains the distribution over a set of application-specific programmatic objects. The application developer creates these objects to suit their application; each object implements a small set of straightforward methods which the engine calls to update the distribution.
Each programmatic object represent a partition of one or more user goals. For example, in a tourist information domain, one partition object might represent all venues which are "NOT hotels"; another partition object might represent all venues which are "hotels near midtown"; another might represent all "hotels which NOT near midtown which are inexpensive". Partitions are sub-divided at run-time according to the ASR output. If the number of partitions grows too large, low-belief partitions are recombined to ensure that the update runs in real-time. The optimized update method employed by the engine is based on recent work described in [4], which draws on earlier work in [2] and [5].
In addition to the update engine, ASDT v1.0 also includes a large set of examples, illustrating off-line and simulation usage. The examples also provide an end-to-end voice dialer system, drawing on a database of 100,000 (fictitious) listings.
The toolkit and examples are written in standard Python. The provided end-to-end dialog system uses the in-the-cloud "AT&T Speech Mash-up Platform" for speech recognition and text-to-speech, obviating the need to install and configure ASR and TTS locally – only standard Python and common Python libraries are required. ASDT v1.0 is extensively documented, and uses standard mechanisms for configuration and logging.
Examples

Figure 1: Example of tracking a dialog off line. The user says "Boston" twice. In each example, the ASR produces an N-Best list with the wrong hypothesis in the top position, and "Boston" further down the N-Best list. After the second recognition, "Boston" has the highest belief. This example shows how the statistical approach synthesizes together all the information on the N-Best list to arrive at the correct answer at the end of the dialog.

Figure 2: Screenshot of the end-to-end dialog system. The left column shows the ASR result. The center column shows the distribution over partitions – each partition corresponds to a python object. Each partition contains four fields – first name, last name, city, and state. Values which are not known are shown with '*'. The right column shows marginal belief in the most likely value for each field.
Download and discussion list
The source code of ASDT v1.0 is available for download under a non-commercial source code license from the following URL:
      http://www.research.att.com/people/Williams_Jason_D/
This page also includes information on how to join an ASDT discussion list.
References
[1] H Higashinaka, M Nakano, and K Aikawa, "Corpus-based discourse understanding in spoken dialogue systems," in Proc ACL, Sapporo, 2003.
[2] SJ Young,M Gasic, S Keizer, F Mairesse, J Schatzmann, B Thomson, and K Yu, "The hidden information state model: a practical framework for POMDP-based spoken dialogue management," Computer Speech and Language, vol. 24, no. 2, pp. 150–174, April 2010.
[3] J Henderson and O Lemon, "Mixture model POMDPs for efficient handling of uncertainty in dialogue management," in Proc ACL-HLT, Columbus, Ohio, 2008.
[4] Jason D. Williams, "Incremental partition recombination for efficient tracking of multiple dialog states," Proc ICASSP, Dallas, Texas, USA, 2010.
[5] Steve Young, Jason D. Williams, Jost Schatzmann, Matthew Stuttle, and Karl Weilhammer, "The hidden information state approach to dialogue management," Cambridge University Engineering Department Technical Report CUED/F-INFENG/TR.544, 2006.
Jason D. Williams is Principal Member of Technical Staff at AT&T Labs - Research. His main research interests are spoken dialog systems, user modeling, and planning under uncertainty. Email: jdw@research.att.com.

