Normalizing txts into text for speech synthesis
Matthew Marge
SLTC Newsletter, April 2010
Over the last five years, text messaging (also called SMS, short for "short message service") has become an increasingly popular form of communication in many parts of the world. Although state-wide bans on driving while talking on a mobile phone exist in the US, a growing distraction facing drivers is texting while driving. A recent study showed that people sending text messages in simulated driving tasks were six times more likely to be in car accidents [1]. Those with visual disabilities can also find text messaging (especially reading others' messages) to be a challenge.
Researchers at the University of Texas at Dallas are addressing this problem by developing a procedure to give text messages a "voice" with speech synthesis systems. Synthesizing text messages, however, is not trivial - abbreviations are a common property of text messages, both due to the lack of a full-size keyboard and the linguistic habits of many SMS users. Text messaging "lingo" associated with SMS is often considered a language of its own, with properties such as numbers replacing letters (e.g., "2day" for "today") or the removal of letters in a word (e.g., "sth" for "something"). As is common in most speech synthesis methods, these messages must be normalized into a format that can easily be interpreted by speech synthesizers.
Deana Pennell and others at the Human Language Technology Institute at UT Dallas are developing heuristic and statistical methods for normalizing text messages for synthesis. When asked about the key motivation for this work, "abbreviations are becoming more and more common, and text containing these abbreviations may contain useful information," said Pennell. "This will only become more true as younger generations age, so it is important to begin working on the problem now," she added.
Pennell and her colleagues first developed a set of rules for transforming normal English text to text messaging lingo [2]. These rules included removing characters that only weakly determine the overall phonetic properties of a word, such as the trailing "g" in words ending with "ing" and the silent "e" in words like "cable". Other rules exist at the syllabic level, such as the removal of vowels that are in the interior of a word's syllables (e.g., transforming "disk" to "dsk").
They are also developing a binary decision classifier to create abbreviations from English words. This classifier determines when a character in a word could be removed to create an abbreviation. The classifier is currently trained from a set of 1000 English words with 2-3 abbreviations for each word. Features in the maximum entropy-based model include neighboring character context, whether or not each character is a vowel or consonant, and how each character is positioned in the syllables of a word. Each word's abbreviations using this model are scored, ranked, and mapped to the corresponding word in a lookup table. They hope that this method can be a reasonable approach to predicting the correct word from an abbreviation without any sentence-level context.
Pennell evaluated the quality of the model by reversing the problem - determining the rate at which abbreviations from a test set corresponded to the correct English words in the lookup table that they generated. They achieved an accuracy of 58% when determining the corresponding word (this was only when the best abbreviation-word match was selected from the lookup table). When a test abbreviation was matched to one of the three best candidates, results increased to 75% accuracy. They expect that incorporating more context, such as word-level n-grams, will improve the model.
Pennell finds the most challenging part of her work to be that the "lingo" associated with text messages is "constantly changing". "It would be impossible to just provide a dictionary of all abbreviations and use that for translation," she says.
One growing challenge is the rapid growth of text messaging from smartphones, which often have full keyboard layouts. Although this may decrease the use of numbers associated with text messages (e.g., "any1" for "anyone"), Pennell believes this problem "won't go away completely". "One look at internet forums and chat rooms will tell you why; people use these abbreviations frequently,” she says. She adds that "we will continue to see deletion-based abbreviations and phonemic substitution-based abbreviations such as 'bcuz' for 'because' - not only because they are also common in chat rooms...but until unlimited texting plans are the norm".
Despite these challenges, there is a great deal of promise from this and related work. Not only will drivers be able to communicate more effectively with SMS when text messages are played back, but text messages will become accessible for a broader audience of users. "This technology can also be used for screen readers for the blind to enable them to better read forums, blogs and chat rooms," said Pennell.
For more information on this work, please see Pennell’s 2010 ICASSP paper. Related work can also be found by researchers at Microsoft Research who address the problem of composing text messages in the car.
- [1] F. A. Drews, H. Yazdani, C. N. Godfrey, J. M. Cooper, and D. L. Strayer, "Text Messaging During Simulated Driving," Human Factors, 2009.
- [2] D. Pennell and Y. Liu, "Normalization of text messages for text-to-speech," ICASSP, 2010.
Related work from MSR:
- Y. Ju & T. Paek. "A voice search approach to replying to SMS messages in automobiles," Interspeech, 2009.
- W. Wu, Y. Ju, X. Li, and Y. Wang. “Paraphrase detection on SMS message in automobiles,” ICASSP, 2010.
If you have comments, corrections, or additions to this article, please contact the author: Matthew Marge, mrma...@cs.cmu.edu.



