IEEE

Xuedong Huang, Joseph S. Perkell, Hiroya Fujisaki, and Christian Wellekens talk to Saras Institute

Chi Zhang, Annie Gilbert, Jin Kyu-Park, Marcel Waeltermann

SLTC Newsletter, April 2009

We continue the series of excerpts of interviews from the History of Speech and Language Technology Project. In these segments Xuedong Huang, Joseph S. Perkell, Hiroya Fujisaki, and Christian Wellekens discuss how they became involved with the field of speech and language technology.

These interviews were conducted by Dr. Janet Baker in 2005 and are being transcribed by members of ISCA-SAC as described previously: 1, 2, 3, 4. Sylvie Saget (Telecom Bretagne), Sunayana Sitaram (National Institute of Technology, Surat), and Antonio Roque (University of Southern California) coordinated transcription efforts and edited the transcripts

Xuedong Huang

Transcribed by Chi Zhang (University of Texas at Dallas)

Q: Well it's a great pleasure, and we are honored to have Xuedong Huang here today. I appreciate your giving us an interview for this project... Basically, we'd like to start off by asking you... how did you get into speech and language?

A: After I just got my bachelor, I was in China. I graduated from Hunan University in my hometown. I went to Tsinghua University in Beijing for my master's degree. So at that time we had just got the Apple II. It was very hard to create any Chinese document. Then we received the IBM PC XT in 82.

Q: In 82. So that was very new.

A: That was 82. That was very new. And so for the Chinese computing industry at that time, also the research community, how to make computing easy for Chinese users was a big deal. And I remember the Prime Minister at that time made this a high priority for the Chinese research community. So speech was really a dream... so that's how I got into speech. So speech was my master thesis, PhD thesis. Since 1982, I have been doing speech from China, then Scotland, US, from university to the company.

Q: I guess you've... done [work] in Asia and Europe and US, you've done from academia to industry. But you haven't done to the goverment piece yet!

A: I was funded by goverment! (laughs) Both in the UK, in Edinburgh, also in the CMU, I was kind of paid by the goverment. By three goverments.

Q: Yeah, three governments. (laughs) Why don't we go back closer to the beginning of this process. So now we've spanned basically the last thirty three years in one fell swoop. And so, starting back in Beijing, so you got a XT in 82.

A: Actually the first computer I had used in speech was Apple II.

Q: What speech were you doing on the Apple II?

A: We did command control, using Apple II with about a hundred voice command control, isolated. And there was 16K memory.

Q: And so what resources did you use with the Apple II to do that?

A: It didn't have a floppy disk, so the program was stored on cassette tape. And I was using assembly to write everything. We couldn't even afford to use DTW [Dynamic Time Warping] at that time. So we did the pre-warping of the linear, the frequency spectrum. So we built [special] hardware, put that into the Apple II, and that was basically a filter bank, and then we did the pre-warping of the signal given the template. Then we did a very simplified DTW-like speech recognition.

Q: So I mean, this was not a real time system...

A: Was real time.

Q: Oh, was real time...

A: Right... So when I got an IBM PC XT, I was so happy! I got floppy disk, not just floppy also hard disk! (laughs) I don't need to go rewind the cassette to find the program! And I was also able to use the UCSD Pascal. You know in the school, data structure class was taught using Pascal. So that was almost like a godsend. Then we developed a thousand word vocabulary system, and we used the TMS 320, and I programmed on the front-end and did MFCC, LPC analysis on that.

Joseph S. Perkell

Transcribed by Annie Gilbert (University of Montreal)

Q: How did you get into this field?

A: Well, it's a slightly complicated story. I'd been an undergraduate here in mechanical engineering but that didn't take too well. And for a couple of odd reasons I ended up going to dental school after I finished my undergraduate degree. And I wanted to take a year off in the middle of dental school but the only way I could do that was to do research. And I had arranged a research project that fell through so at the last minute I was looking for a summer job. And I got the names of four people at MIT and one of them was Ken Stevens and I called him and he said "well, you're a dental student so you know something about reading X-rays right?" and I said yeah. And he said "well, I have this X-ray motion picture that I'd like to have traced on a frame by frame basis. So why don't you come over and we'll talk about it."

And he showed me this film which he had made in Stockholm at the year before... this was the summer of 64. And I looked at the film and then he offered me a pretty good salary so I started making these frame by frame tracings. And after about 3 weeks of looking at this film I became fascinated by these complicated movements of the tongue and lips and jaw and how they could convey information. And this was something that we did all the time and never thought about it. And it just became a topic of fascination for me.

So I continued making these tracings and then when it came time to go back to school I said I'd like to stay and I took the year off from school, kept going. And then I started looking at the results, plotting out movements and overlaying them and doing things like that and [it] just continued to be more and more fascinating. So I worked for a total of nine months on this by the end of which I had kind of assembled a number of plots and observations. And with Ken's help we turned that into a research monograph which was published by the MIT Press. And I was off and running, so when I had to go back to dental school I said I'd really like to keep doing this research so he said well when you get out of the Army, which I had to do after dental school, why don't you come back and get a doctorate and so that's what I did. I came back in 69, I got my doctorate in 74 and I've been doing this ever since then.

Hiroya Fujisaki

Transcribed by Jin Kyu-Park (University College, London)

Q: [How did you get into this field]?

A: ...My supervisor [the late] Professor Sakamoto was a man of many talents, many fields, actually he was a pioneer in medical electronics... But he discouraged me: speech communication may be interesting but there are many many new fields that require your activity. But, I somehow stick to speech communication and had a chance to study, while still at the graduate school, to get a Fulbright scholarship to study at MIT. At MIT, of course, I looked for a supervisor and Ken Stevens was looking... so I immediately started working with Ken Stevens.

Q: What year was that?

A: 1958.

Q: 1958, ok.

A: First as a full-time graduate student but then he allowed me to continue with some Fulbright support as a part-time research assistant. And Ken Stevens also had a very good collaboration with linguist Morris Halle, and most likely more indirectly Roman Jakobson and a younger [Noam] Chomsky, and then also Gunnar Fant came to MIT so these are really fun that in order to study speech very deeply you just can not be satisfied working on acoustic aspects or engineering aspects. MIT was very good place to get to open my eyes to inter-disciplinary studies of human speech and language. That's how I got into this field.

And MIT, of course, had very easy access to the most up-to-date digital technology. The computers at the Research Laboratory of Electronics were really the first research tools to do digital signal processing, there was not even the term called digital signal processing, but my dissertation or Master's Thesis was the first use of digital computer for pitch extraction based on short-time autocorrelation. Of course the work was not complete and so it was followed and continued by Ben Gold at Lincoln Laboratory and later by Larry Rabiner...

Christian Wellekens

Transcribed by Marcel Waeltermann (Deutsche Telekom Laboratories)

Q: How did you get into speech?

A: Oh, that was a long time ago when I was working for Philips Research. And, in the past, my activity was circuit theory. And, from circuit theory - it means analogue circuits, filter design and so - I evolved progressively into digital signal processing. And, from digital signal processing, in my lab somebody asked me if I wasn't interested in starting an activity in speech.

So, Philips was at that time interested in having somebody working in speech processing, at least a very small team. And at that time I started alone at the Philips Research Lab in Brussels, and very soon we decided to increase the team and we had some trainees, young students working with me. And then, Prof. Boite from Mons contacted me and one of my colleagues, to attend the presentation of the master theses of some of his students. And among them, one was Hervé Bourlard, so I decided to hire him and with my colleagues. So Hervé and myself were colleagues for, well, let's say for seven or eight years in the same office, so we shared the same office for several years. And, we started making research for Philips together for seven or eight years. So that was the beginning, because it was a demand from Philips.

At that time, [there was] another lab in Hamburg where Hermann Ney was working. And at that time, Hermann was just starting trying to use Hidden Markov Models. And he had at that time no experience at all in HMM. So, we had known, at that time either, that we learned and we made the first test on HMM in our lab. And, with a very small database, it was just a sentence that we succeeded to cut into three pieces. And that was the recognition of a lexicon of three words concatenated to a sentence. So we took two words by pair and then making three sentences trying to recognize that. And it worked quite well, of course it was quite easy stuff. And then, Hermann asked us to join his lab in Hamburg for a while, and Hervé and myself, we spent several weeks working in his lab, and collaborating with him. That was really the first time Philips was using Hidden Markov Models for speech processing.

Q: So, when might that have been, what year?

A: I think it was in 82 or something like that, around that.