Island-Driven Search
Tara N. Sainath
SLTC Newsletter, February 2010
Many speech scientists believe that human speech processing is done first by identifying regions of reliability in the speech signal and then filling in unreliable regions using a combination of contextual and stored phonological information [1]. However, most current decoding paradigms in speech recognition consist of a left-to-right (and optional right-to-left) scoring and search component without utilizing knowledge of reliable speech regions. This is particularly a problem when the search space is pruned, as pruning algorithms generally do not make use of the reliable portions of the speech signal, and hence may prune away too many hypotheses in unreliable regions of the speech signal and keep too many hypotheses in reliable regions. Island-driven search is an alternative method to better deal with noisy and unintelligible speech. This strategy works by first hypothesizing islands from regions in the signal which are reliable. Further recognition works outwards from these anchor points to hypothesize unreliable regions. In this article, we discuss past research in island-driven search and propose some future research directions.
Early Work in Island-Driven Search
The BBN Hear What I Mean (HWIM) speech understanding system [2] was one of the first systems to utilize island-driven search for sentence parsing. In this system, parsing does not start at the beginning of the word network, but rather can start at confident regions within the network, at places known as islands. Parsing works outwards left and right from these island regions to parse a set of gap regions. While this type of approach showed promise for small grammars, it has not been explored in large vocabulary speech recognition systems due to the computational complexities of the island parser.
In addition, [3] explores using island-information for sentence parsing. This paper looks at computing language model probabilities when the sentence is generated by a stochastic context-free grammar (SCFG). Specifically, the language model is computed by first having the parser focus on islands, defined as works with high semantic relevance. The parser then proceeds outwards towards gaps to compute the total language model score. While this paper presented solid theory behind an island-driven search parser, subsequent use of this theory in modern day ASR systems has not been heavily explored.
Furthermore, [4] explores island-driven search for isolated-word handwriting recognition. The authors identify reliable islands in isolated words to obtain a small filtered vocabulary, after which a second-pass more detailed recognition is performed. This technique is similar to that explored by Tang et al. in [5] for improving lexical access using broad classes. However, these solutions cannot be directly applied to continuous speech recognition.
Current Work in Island-Driven Search
More recent work in island-driven search has focused on extending theories from the above papers into modern day continuous ASR. For example, Kumaran et al. have explored an island-driven search strategy for continuous ASR [6]. In [6], the authors perform a first-pass recognition to generate an N-best list of hypotheses. A word-confidence score is assigned to each word from the 1-best hypothesis, and islands are identified as words in the 1-best hypothesis which have high confidence. Next, the words in the island regions are held constant, while words in the gap regions are re-sorted using the N-best list of hypotheses. This technique was shown to offer a 0.4% absolute improvement in word error rate on a large vocabulary conversational telephone speech task. However, if a motivation behind island-driven search is to identify reliable regions to influence effective pruning, identifying these regions from an N-best list generated from a pruned search space may not be an appropriate choice.
In [7], broad phonetic class (BPC) knowledge was utilized for island-driving search. First, islands were detected from knowledge of hypothesized broad phonetic classes (BPCs). Using this island/gap knowledge, a method was explored to prune the search space to limit computational effort in unreliable areas. In addition, the authors investigate scoring less detailed BPC models in gap regions and more detailed phonetic models in islands. Experiments on both small scale vocabulary tasks indicate that the proposed island-driven search strategy results in an improvement in recognition accuracy and computation time. However, the performance of the island-driven search methods for large scale vocabulary tasks is worse than baseline methods. One possible reason for this performance degradation is that the island-driven search remains left-to-right in the proposed technique, opening up the idea of doing a bi-directional search.
Open Research Directions
While some of the current work in island-driven search has shown promise, there are plenty of interesting research directions and unanswered questions. As the work in [6] indicated, one major question is how to define an island. For example, as [1] discusses, when humans process speech, they first identify distinct acoustic landmarks to segment the speech signal into articulatory-free broad classes. Acoustic cues are extracted at each segment to come up with a set of features for each segment, which make use of both articulator-bound and articulator-free features. Finally, knowledge of syllable structure is incorporated to impose constraints on the context and articulation of the underlying phonemes. Perhaps one idea is to look at defining islands though combination of articulator-free and articulator-bound cues, in conjunction with syllable knowledge.
A second topic of interest is how to perform the island-driven search. Second, the nature of speech recognition poses some constraints on the type of island-driven search strategy preferred. While island searches have been explored both unidirectionally and bi-directionally, the computational and on-line benefits of unidirectional search in speech recognition make this approach more attractive. Furthermore, if reliable regions are identified as sub-word units and not words, a bidirectional search requires a very complex vocabulary and language model. However, unidirectional island-driven search might not always address the problem of good hypotheses being pruned away. Therefore, in the future, an interesting research direction might be exploring bi-directional search strategies might which first start in the reliable island regions and works outwards to the gaps.
For more information, please see:
[1] K. N. Stevens. Toward A Model for Lexical Access Based on Acoustic Landmarks and Distinctive Features. Journal of the Acoustic Society of America, 111(4):1872–1891, 2002.
[2]W. A. Lea, Trends in Speech Recognition. Englewood Cliffs, NJ:
Prentice Hall, 1980.
[3] A. Corazza, R. De Mori, R. Gretter, and G. Satta. Computation Probabilities for an Island-Driven Parser. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 13(9):936–950, 1991.
[4] J. F. Pitrelli, J. Subrahmonia, and B. Maison. Toward Island-of-Reliability-Driven Very-Large-Vocabulary On-Line Handwriting Recognition Using Character
Confidence Scoring. In Proc. ICASSP, pages 1525–1528, 1991.
[5] M. Tang, S. Seneff, and V. W. Zue. Modeling Linguistic Features in Speech Recognition. In Proc. Eurospeech, pages 2585–2588, 2003.
[6] R. Kumaran, J. Bilmes, and K. Kirchhoff, “Attention Shift Decoding for Conversational Speech Recognition,” in Proc. Interspeech, 2007
[7] T. N. Sainath, "Island-Driven Search Using Broad Phonetic Classes," Proc. ASRU, Merano, Italy, December 2009.
If you have comments, corrections, or additions to this article, please leave a comment below or contact the author: Tara Sainath, tsainath [at] us [dot] ibm [dot] com.
Tara Sainath is a Research Staff Member at IBM T.J. Watson Research Center in New York. Her research interests are in acoustic modeling. Email: tnsainat@us.ibm.com



