The INTERSPEECH 2009 Emotion Challenge: Results and Lessons Learnt
Bjoern Schuller, Stefan Steidl, Anton Batliner, and Filip Jurcicek
SLTC Newsletter, October 2009
The INTERSPEECH 2009 Emotion Challenge, organised by Bjoern Schuller (TUM, Germany), Stefan Steidl (FAU, Germany), and Anton Batliner (FAU, Germany), was held in conjunction with INTERSPEECH 20009 in Brighton, UK, September 6-10. This challenge was the first open public evaluation of speech-based emotion recognition systems with strict comparability where all participants were using the same corpus. The German FAU Aibo Emotion Corpus of spontaneous, emotionally coloured speech of 51 children served as a basis. The corpus clearly defined test and training partitions incorporating speaker independence and different room acoustics, as needed in most real-life settings.
Three sub-challenges (Open Performance, Classifier, and Feature) addressed classification of five non-prototypical emotion classes (anger, emphatic, neutral, positive, remainder) or two emotion classes (negative, idle). The Open Performance Sub-Challenge allowed contributors to find their own features with their own classification algorithms. However, they had to stick to the definition of test and training sets. In the Classifier Sub-Challenge, participants designed their own classifiers and had to use a large set of standard acoustic features, computed with the openSMILE toolkit provided by the organisers. Participants had an option to subsample, alter, and combine features (e.g. by standardisation or analytical functions). The training could be bootstrapped, and several classifiers could be combined by tools such as Ensemble Learning, or side tasks learned as gender, etc. However, the audio files could not be used for additional feature extraction in this task. In the Feature Sub-Challenge, participants were encouraged to design 100 best features for emotion classification to be tested by the organisers in equivalent setting. In particular, novel, high-level, or perceptually adequate features were sought-after.
Participants did not have access to the labels of the test data, and all learning and optimisations was based only on the training data. However, each participant could upload instance predictions to receive the confusion matrix and results from the test data set up to 25 times. The format contained instance and prediction, and optionally additional probabilities per class. This later allowed a final fusion by majority vote of predicted classes of all participants' results to demonstrate the best possible performance of the combined efforts. As classes were unbalanced, the primary measure to optimise was firstly unweighted average (UA) recall, and secondly the accuracy. The choice of unweighted average recall was a necessary step to better reflect imbalance of instances among classes in real-world emotion recognition, where an emotionally "idle" state usually dominates. Other well-suited and interesting measures as the area under the receiver operator curve (ROC) were also considered; however, they were not used as they are not yet common measures in the field at the time.
The organisers did not take part in the sub-challenges but provided baselines by the two most popular approaches. First, the WEKA toolkit and Support Vector Machines were used. Second, the HTK toolkit was used to train Hidden Markov Models. They intentionally used standard-tools so that the results were reproducible.
All participants were encouraged to compete in multiple sub-challenges and each participant submitted a paper to the INTERSPEECH 2009 Emotion Challenge Special Session. The results of the challenge were presented at a Special Session of Interspeech 2009 and the winners were awarded in the closing ceremony by the organisers. Three prizes (each 125 GBP sponsored by the HUMAINE Association and the Deutsche Telekom Laboratories) could be awarded following the pre-conditions: 1) the awarded paper was accepted to the special session after the INTERSPEECH 2009 general peer-review, 2) the provided baseline (67.7% and 38.2% UA recall for the two- and five-class tasks) was exceeded, and 3) the best result in the sub-challenge and the task was achieved.
The Open Performance Sub-Challenge Prize was awarded to Pierre Dumouchel et al. (University de Quebec, Canada) for their victory in this sub-challenge: they had managed to obtain the best result (70.29% UA recall) in the two-class task, significantly ahead of their eight competitors. The best result in the five-class task (41.65% UA recall) was achieved by Marcel Kockmann et al. (Brno University of Technology, Czech Republic) who surpassed six further results and were awarded the Best Special Session's Paper Prize as they had received the highest reviewers' score for their paper at the same time.
The Classifier Sub-Challenge Prize was given to Chi-Chun Lee et al. (University of Southern California, USA) for their best result in the five-class task in advance of three further participants. In the two-class task, the baseline was not exceeded by any of two participants.
Regrettably, no award could be given in the Feature Sub-Challenge. Neither of the feature sets provided by three participants in this sub-challenge exceeded the baseline feature set provided by the organizers.
Overall, the results of all 17 participating sites were often very close to each other, and significant differences were as seldom as one might expect in such a close competition. However, by the "democratic" fusion of all participants' results, the performance exceeded all of the individual results: 71.16% and 44.01% UA recall for the two- and five-class tasks. The general lesson learned thus is "together we are best": apparently the different feature representations and learning architectures dominate in their combination. In addition, the challenge clearly demonstrated the difficulty of dealing with a real-life non-prototypical emotion recognition scenario - this challenge remains.
The organizers plan to make the corpus used for the challenge publicly available. They also suggest that future challenges should consider cross age groups, languages, and culture evaluations, noisy, reverberated, or transmission corrupted speech, multimodal sources, and naturally many further aspects and related topics as non-linguistic vocalisations, or automatic speech recognition of emotional speech.


Add A Comment