Frederick Jelinek 1932 – 2010 : The Pioneer of Speech Recognition Technology

Steve Young

SLTC Newsletter, November 2010

The speech recognition problem is fascinating because it is so simple to describe yet so astonishingly difficult to do. Humans do it effortlessly yet machine capability is still at an infant stage – and a young infant at that. Indeed, some question whether it can ever be achieved at all. In 1969, John Pierce from Bell Labs famously wrote in the Journal of the Acoustical Society of America that speech recognition was dominated by "mad scientists and untrustworthy engineers" and that "speech recognition will not be possible until the intelligence and linguistic competence of a human speaker can be built into the machine".

Despite this attack from such a leading figure as Pierce, in 1971 ARPA launched a major 5 year $15m program to solve the "speech understanding problem". ARPA saw it as an Artificial Intelligence problem and therefore funded multidisciplinary teams of computer scientists and linguists. The goal was to develop machines which could recognise continuous speech from a 1000-word vocabulary. But the ARPA programme failed and was terminated in 1976 - the problem appeared to be just too difficult.

Meanwhile, a young post-graduate called Fred Jelinek at Cornell was working to extend the PhD work that he had completed in 1962 at MIT on Channel Coding. His goal was to extend our understanding of information theory and explore its application to practical problems. This work bore fruit and he helped to establish Cornell as a major centre for information theory. IBM meanwhile had been working on speech recognition since the late 1960’s, and in the early 70’s in an apparent change of tack, Fred moved from Cornell to IBM to take over the management of their speech and language processing activities.

Not surprisingly but to the bewilderment of many others in the field, Fred’s approach to speech recognition was to view it as a channel coding problem. He asked us to imagine that the "human brain" sends a message (ie what it has in mind to say) down a noisy channel encoded as an acoustic waveform. The receiver i.e. the machine then has to use statistics to find the most probable message given the noisy acoustics. This is a simple enough concept today, but in 1970 casting speech recognition as a channel coding problem seemed pretty far-fetched. After all, Chomsky had written in his seminal 1957 work "Syntactic Structures" that "we are forced to conclude that grammar is autonomous and independent of meaning, and that probabilistic models give no insight into the basic problems of syntactic structure". By 1970, computational linguists regarded Chomsky’s position as axiomatic and so perhaps Pierce was right, perhaps Fred and his followers really were "mad scientists and untrustworthy engineers".

Despite all this, Fred began his attempts to solve the speech recognition problem with an open mind and he did have linguists in his team. However, the story goes that one day one of his linguists resigned, and Fred decided to replace him not by another linguist but by an engineer. A little while later, Fred noticed that the performance of his system improved significantly. So he encouraged another linguist to find alternative employment, and sure enough performance improved again. The rest as they say is history, eventually all the linguists were replaced by engineers (and not just in Fred’s lab) and then speech recognition really started to make progress.

The beauty of Fred’s approach was simply that it reduced the speech recognition problem to one of producing two statistical models: one to describe the prior probability of any given message, and a second to describe the posterior likelihood of an observed speech waveform given some assumed message. No intuition or introspection into the human mind was required, just real speech data to train the models.

In 1976 Fred published a now famous paper in the Proceedings of the IEEE called "Continuous Speech Recognition by Statistical Methods", and eventually his view of the problem entirely reshaped the landscape and set the foundations for virtually all modern speech recognition systems. But Fred did not just set us all in the right direction, in his 20 years at IBM, he and his team invented nearly all of the key components that you need to build a high performance speech recognition system including phone-based acoustic models, N-gram language models, decision tree clustering, and many more.

The story does not end there of course. In 1993, Fred joined Johns Hopkins University and rapidly propelled the Center for Language and Speech Processing into being one of the top research groups in the world. He engaged with the major research programs, and he started the now famous Johns Hopkins Summer Workshops on Language and Speech.

And he focused more of his own time on his own research. In particular, the statistical model of language with which he had so successfully replaced linguistic rules was a simple word trigram – ie a very crude model of three word sequences. Whilst it was obvious to everyone that this model was hopelessly impoverished, in practice it had proved almost impossible to improve on. However, in the year 2000, Fred published a paper with one of his students called "Structured language modeling for speech recognition". It sets out a principled way to incorporate linguistics into a statistical framework and as well as representing a significant step forward in language modeling, it has helped bridge the gap between speech engineers and the hitherto diverging computational linguistics community. In 2002, it received a "Best Paper" award and the citation read "for work leading to significant advances in the representation and automatic learning of syntactic structure in statistical language models". It seemed somehow fitting that 25 years after starting the movement towards statistical approaches, Fred sought to re-engage with aspects of more traditional linguistics. I hope Chomsky read the paper and enjoyed it as much as we speech technologists did.

In nearly five decades of outstanding research, Fred Jelinek made a truly enormous contribution. He was not a pioneer of speech recognition, he was the pioneer of speech recognition. His contribution has been recognized by many awards. He received the IEEE Signal Processing Society award in 1998 and the IEEE Flanagan Award in 2005. He received the ISCA Medal for Scientific Achievement in 1999 and he was made an inaugural ISCA Fellow in 2008. He was awarded an Honorary Doctorate at Charles University of Prague in 2001 and he was elected to the National Academy of Engineering in 2006.

Fred Jelinek was an inspiration to our community. He will be sorely missed by all who knew him.

Steve Young

October 2010

Steve Young is Chair, Speech and Language Technical Committee.