Spoofing and Countermeasures for Speaker Verification: a Need for Standard Corpora, Protocols and Metrics
Nicholas Evans, Junichi Yamagishi and Tomi Kinnunen
SLTC Newsletter, May 2013
Over the last decade biometric person authentication has revolutionised our approach to personal identification and has come to play an essential role in safeguarding personal, national and global security. It is well-known, however, that biometric systems can be "spoofed", i.e. intentionally fooled by impostors .
Efforts to develop spoofing countermeasures are under way across the various biometrics communities (http://www.tabularasa-euproject.org/). However, research in automatic speaker verification (ASV) is far less advanced in this aspect in comparison to physical biometrics such as fingerprint, face and iris recognition. Given that ASV is often applied as a low-cost, remote authentication technique over uncontrolled communications channels without human supervision or face-to-face contact, speech is arguably more prone to malicious interference or manipulation in comparison to other biometric traits. Indeed, it has become clear that ASV systems can be spoofed through impersonation , replay attacks , voice conversion [4, 5] and speaker-adapted speech synthesis , though the bulk of the wider literature involves only text-independent ASV. While the research community has focused most of their efforts on tackling session, channel and environment variation, even the most sophisticated of today's recognizers can be circumvented. When subjected to spoofing it is not uncommon that ASV performance falls below that expected by chance.
Previous efforts to develop countermeasures for ASV generally utilize prior knowledge of specific spoofing attacks. This approach stems from the lack of standard datasets which necessitates the collection or generation of purpose-made, non-standard datasets using specific spoofing algorithms. Familiarity with particular attacks and the availability of large quantities of example data then influence countermeasure development since telltale indicators of spoofing can be identified with relative ease. For instance, synthetic speech generated according to the specific algorithm reported in  provokes lower variation in frame-level log-likelihood values than natural speech. Observations of such characteristics are thus readily utilised to distinguish between genuine accesses and spoofing attacks generated from the same or similar approach to speech synthesis; other approaches may overcome this countermeasure. Based on the difficulty in reliable prosody modelling in both unit selection and statistical parametric speech synthesis, other researchers have explored the use of F0 statistics to detect spoofing attacks [8, 9], however, even these countermeasures may be overcome by alternative spoofing attacks. Vulnerabilities to voice conversion have also attracted considerable interest over recent years. One approach to detect voice conversion proposed in  exploits the absence of natural speech phase in voices converted according to a specific approach based on joint-density Gaussian mixture models. This countermeasure will likely be overcome by the approach to voice conversion proposed in  which retains real speech phase, but which has been shown to reduce the short-term variability of generated speech . Once again this countermeasure will not be universally reliable in detecting different forms of voice conversion, nor different forms of spoofing.
The above work demonstrates the vulnerability of ASV to what are undeniably high-cost, high-technology attacks but also the strong potential for spoofing countermeasures. However, the use of prior knowledge is clearly unrepresentative of the practical scenario and detrimental to the pursuit of general countermeasures with potential to detect unforeseen spoofing attacks whose nature can never be known. There is thus a need to develop new countermeasures which generalise to previously unseen spoofing techniques. Unfortunately, the lack of standard corpora, protocols and metrics present a fundamental barrier to the study of spoofing and generalised anti-spoofing countermeasures. While most state-of-the-art ASV systems have been developed using standard NIST speaker recognition evaluation (SRE) corpora and associated tools, there is no equivalent, publicly available corpus of spoofed speech signals and no standard evaluation or assessment procedures to encourage the development of countermeasures and their integration with ASV systems. Further work is also needed to analyse spoofing through risk assessment and to consider more practical use cases including text-dependent ASV.
The authors of this newsletter have organised a special session for the Interspeech 2013 conference on Spoofing and Countermeasures for Automatic Speaker Verification. It is intended to stimulate the discussion and collaboration needed to organize the collection of standard datasets of both licit and spoofed speaker verification transactions and the definition of standard metrics and evaluation protocols for future research in spoofing and generalised countermeasures. Ultimately, this initiative will require the expertise of different speech and language processing communities, e.g. those in voice conversion and speech synthesis, in addition to ASV.
 N. K. Ratha, J. H. Connell, and R. M. Bolle, "Enhancing security and privacy in biometrics-based authentication systems," IBM Systems Journal, vol. 40, no. 3, pp. 614-634, 2001.
 Y. Lau, D. Tran, and M. Wagner, "Testing voice mimicry with the yoho speaker verification corpus," in Knowledge-Based Intelligent Information and Engineering Systems. Springer, 2005, pp. 907-907.
 J. Villalba and E. Lleida, "Preventing replay attacks on speaker verification systems," in Security Technology (ICCST), 2011 IEEE International Carnahan Conference on. IEEE, 2011, pp. 1-8.
 D. Matrouf, J.-F. Bonastre, and C. Fredouille, "Effect of speech transformation on impostor acceptance," in Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, vol. 1. IEEE, 2006, pp. I-I.
 T. Kinnunen, Z.-Z. Wu, K. A. Lee, F. Sedlak, E. S. Chng, and H. Li, "Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech," in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4401-4404.
 P. L. De Leon, M. Pucher, and J. Yamagishi, "Evaluation of the vulnerability of speaker verification to synthetic speech," in Proc. IEEE Speaker and Language Recognition Workshop (Odyssey), 2010, pp. 151-158.
 T. Satoh, T. Masuko, T. Kobayashi, and K. Tokuda, "A robust speaker verification system against imposture using an HMM-based speech synthesis system," in Proc. Eurospeech, 2001.
 A. Ogihara, H. Unno, and A. Shiozakai, "Discrimination method of synthetic speech using pitch frequency against synthetic speech falsification," IEICE transactions on fundamentals of electronics, communications and computer sciences, vol. 88, no. 1, pp. 280-286, jan 2005.
 P. L. De Leon, B. Stewart, and J. Yamagishi, "Synthetic speech discrimination using pitch pattern statistics derived from image analysis," in Proc. Interspeech, Portland, Oregon, USA, Sep. 2012.
 Z. Wu, T. Kinnunen, E. S. Chng, H. Li, and E. Ambikairajah, "A study on spoofing attack in state-of-the-art speaker verification: the telephone speech case," in Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific. IEEE, 2012, pp. 1-5.
 F. Alegre, A. Amehraye, and N. Evans, "Spoofing countermeasures to protect automatic speker verification from voice conversion," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2013.
Nicholas Evans is with EURECOM, Sophia Antipolis, France. His interests include speaker diarization, speaker recognition and multimodal biometrics. Email: firstname.lastname@example.org.
Junichi Yamagishi is with National Institute of Informatics, Japan and with University of Edinburgh, UK. His interests include speaker synthesis and speaker adaptation. Email: email@example.com
Tomi H. Kinnunen is with University of Eastern Finland (UEF). His current research interest include speaker verification, robust feature extraction and voice conversion. Email: firstname.lastname@example.org.