Elsevier

Computer Speech & Language

Volume 63, September 2020, 101096
Computer Speech & Language

Generalized end-to-end detection of spoofing attacks to automatic speaker recognizers

https://doi.org/10.1016/j.csl.2020.101096Get rights and content

Abstract

As automatic speaker recognizer systems become mainstream, voice spoofing attacks are on the rise. Common attack strategies include replay, the use of text-to-speech synthesis, and voice conversion systems. While previously-proposed end-to-end detection frameworks have shown to be effective in spotting attacks for one particular spoofing strategy, they have relied on different models, architectures, and speech representations, depending on the spoofing strategy. In practice, however, one does not have a priori information regarding the strategy an attacker might employ to fool a speaker recognizer, thus it is necessary to devise approaches which are able to detect attacks regardless of the strategy employed to generate them. In this work, we introduce an end-to-end ensemble based approach such that two models – previously shown to perform well on each considered attack strategy – are trained jointly, while a third model learns how to mix their outputs yielding a single score. Experimental results with replay and text-to-speech/voice conversion attacks show the proposed ensemble method achieving similar or superior performance when compared to systems specialized on each spoofing strategy separately.

Introduction

The increasing presence of smart portable devices in our daily lives, along with the fact that microphones are available on most of these devices, has rendered voice an appealing alternative for biometrics applications, attracting the attention from both industry and academia. Under the text-independent setting, speaker verification is performed on top of unconstrained phrases of arbitrary phonetic content and length, while specific pass phrases are considered for the text-dependent case (Kinnunen and Li, 2010). Since automatic speaker verification systems are burgeoning, fraud, or so-called spoofing attacks, are also on the rise, as fraudsters attempt to fool speech-based biometric systems in order to gain unauthorized access or to perform fraudulent financial transactions.

Recently, the use of artificial neural networks has allowed for automated speaker verification to achieve state-of-the-art results, thus bypassing the need for classical features, such as i-vectors (Dehak et al., 2011). Representative examples range from the use of neural networks to generate alternative embeddings (Bhattacharya, Alam, Kenn, Gupta, 2016, Snyder, Garcia-Romero, Sell, Povey, Khudanpur, 2018a) to systems trained in an end-to-end fashion (Rohdin, Silnova, Diez, Plchot, Matejka, Burget, Li, Ma, Jiang, Li, Zhang, Liu, Cao, Kannan, Zhu, Snyder, Ghahremani, Povey, Garcia-Romero, Carmiel, Khudanpur, 2016). Despite such major breakthroughs, recent literature has shown that artificial neural networks (Goodfellow et al., 2016) can be vulnerable to imperceptible perturbations added to input examples, leading to incorrect predictions with high confidence (Goodfellow et al., 2014). Such perturbed examples are usually referred to as adversarial attacks, i.e., carefully crafted variations of genuine samples intentionally modified so as to confuse or fool undefended models.

Besides adversarial attackers, which might target neural networks in general regardless of its application context, in the specific case of speaker verification and voice biometric systems, other attack strategies exist, being often termed “spoofing attacks”, which are defined as a person or computer program that tries to overcome an authentication system by forging the data of a legitimate user. Spoofing attacks can be broadly classified into two strategies: (i) replay, or also referred to as presentation or physical access (PA) attacks; and (ii) synthetic attacks, also referred to as logical access (LA) attacks (Todisco, Wang, Vestman, Sahidullah, Delgado, Nautsch, Yamagishi, Evans, Kinnunen, Lee, Korshunov, Marcel, Muckenhirn, Gonçalves, Mello, Violato, Simoes, Neto, de Assis Angeloni, Stuchi, et al., 2016). The latest LA attacks have taken advantage of recent advances in speech synthesis and voice conversion based on auto-regressive waveform modeling or generative adversarial networks (van den Oord, Dieleman, Zen, Simonyan, Vinyals, Graves, Kalchbrenner, Senior, Kavukcuoglu, Wang, Skerry-Ryan, Stanton, Wu, Weiss, Jaitly, Yang, Xiao, Chen, Bengio, et al., Tamamori, Hayashi, Kobayashi, Takeda, Toda, 2017, Kaneko, Kameoka, Hojo, Ijima, Hiramatsu, Kashino, 2017).

Given the serious consequences that spoofing attacks can have on speaker verification systems, recent research has focused on the development of new attack detection algorithms and several challenges have been organized (e.g., Todisco, Wang, Vestman, Sahidullah, Delgado, Nautsch, Yamagishi, Evans, Kinnunen, Lee, Korshunov, Marcel, Muckenhirn, Gonçalves, Mello, Violato, Simoes, Neto, de Assis Angeloni, Stuchi, et al., 2016, Wu, Kinnunen, Evans, Yamagishi, Hanilçi, Sahidullah, Sizov, 2015, Kinnunen, Sahidullah, Delgado, Todisco, Evans, Yamagishi, Lee, 2017). Fig. 1 presents block diagrams of two possible settings where spoofing detectors are used in tandem with speaker verification systems. In both cases, the input corresponds to an audio signal along with a claimed identity. The spoofing detector can be used if the claimed identity is verified as target, i.e. true and claimed identities match, or the opposite can be done so that only samples classified as genuine by the spoofing detector will be verified against the claimed identity. Moreover, spoofing detection challenges include, for example, ASVspoof 2015 (Wu et al., 2015) which focused on speech synthesis and voice conversion spoofing. The ASVspoof 2017 Challenge (Kinnunen et al., 2017), in turn, was concerned with playback (replay) attacks. The recent 2019 edition of ASVspoof (Todisco et al., 2019), on the other hand, consisted of two sub-challenges, each involving PA or LA attacks only.

Even though the data released for such challenges are generated making sure that speakers and attack types vary across train, development, and evaluation datasets, the developed detection systems rely on the strong (and unrealistic) assumption that train and test data are identically distributed so that the same general attack strategy (LA or PA) will appear on both train and test data. By doing so, varying models, architectures and input features are used for either PA or LA attacks. This strategy-specific configuration, however, is not aligned with practical, real-life scenarios where the attack strategy is not known a priori.

Finding settings where train and test data distributions differ has been discussed in depth by the Machine Learning community. A notable example corresponds to generalization results introduced in Ben-David et al. (2007), which bound the performance gap across train and test data distributions depending on how close such distributions are. Those results thus motivated approaches such as that discussed in Ganin et al. (2016), where an encoder is trained to map raw data to a space where relevant features to the task of interest are kept, while domain-specific cues are filtered away so that train and test data look alike in that space. One limitation of such an approach, however, is that it targets a specific test distribution and requires it to be somewhat close to the training one. An alternative approach corresponds to conditioning predictions on domain discriminating factors, i.e., rather than removing domain information from the data when encoding it, one might keep and condition on domain dependent factors while doing inference, such that the model accounts for variations on the data conditions.

In this paper, we aim to tackle the detection of spoofing attacks in a fashion that is closer to scenarios detectors would face if deployed on real applications. More specifically, we address the following general problem: can we go beyond the common i.i.d. assumption in supervised learning (i.e. both train and test data are independently sampled from the same distribution) and tackle a more general setting where a set of data distributions is available? Translating that to the detection problem, the research question we pose to ourselves is: can a single system be used to detect both LA and PA attackers? To try and answer that question, we assume two data sources are available: LA or PA attackers along with their genuine samples. We thus consider the domain-conditional approach described above so that a dedicated model (referred to as mixture model) learns to discriminate to which source each input is more likely to belong, and how to best combine the outputs of two other models: one tailored for PA and another for LA spoofing strategies. The architectures as well as the choice of speech representations for each of the LA and PA models are chosen taking into account previously reported best practices for each type of attack, and the assumption we make is that, by doing so, each such model will be able to detect attacks in each distribution while the mixture model will be able to assign importance to each output.

The 3-model ensemble is jointly trained in a single step using train data created by pooling together genuine, LA, and PA examples. Scoring of test examples is performed by either directly taking the outputs of the specialist models, their combined outputs, or the deviation from 0.5 of the output of the mixture model. By doing so, we show that the proposed ensemble is able to outperform specialized models, i.e., those trained and evaluated on a either PA or LA examples, as well as individual ensemble component models trained on the same pooled training data. Our contributions can be summarized as follows:

  • 1.

    We introduce an end-to-end framework effective in detecting both LA and PA spoofing attacks;

  • 2.

    We evaluate which speech representation is more suitable for performing detection in each of the considered strategies; and

  • 3.

    We evaluate which speech representation is more suitable for detecting which type of attack was presented to the model.

The remainder of this paper is organized as follows: Section 2 discusses recent literature on the detection of spoofing attacks to speaker recognizers, domain-conditional strategies in other application domains, and strategies to summarize global information across speech data into compact representations. Section 3 details the approach we designed to generalize across attack strategies. The evaluation protocol is presented in Section 4 along with experiments results and discussion. Conclusions are finally drawn in Section 5.

Section snippets

Detection of spoofing attacks

A generative classifier was introduced in Wu et al. (2012b) following a similar approach to that of linear discriminant analysis (LDA), such that one generative model is trained per class. However, in this case, Gaussian Mixture Models (GMM) are employed for modelling the class-conditional features rather than simple Gaussians with shared covariances as in the LDA case. Given a sequence of feature vectors denoted by O corresponding to a speech signal, the genuine versus spoofed speech decision

Proposed model

Fig. 2 illustrates the strategy we propose in order to be able to detect attackers generated using different strategies. Inputs xLA, xMIX, and xPA correspond to features obtained from a given audio sample. We discuss the selection of the feature space in each case in Section 4. Each of MLA, MMIX, and MPA are such that M: X → [0, 1], i.e., speech features are mapped into a score in [0,1]. The output λ=MMIX(xMIX) is then used to compute a convex combination of scores yLA and yPA:y=λyLA+(1λ)yPA,

Experimental setup and evaluation

The approach discussed herein is evaluated under the conditions introduced for the ASVspoof 2019 challenge. Two types of attacks are considered: logical and physical access. While the former corresponds to synthetic speech created using both voice conversion and text-to-speech systems, the latter consists of playbacks simulated from genuine recordings considering exhaustive combinations of 3 room sizes, 3 distances to the microphone, and 3 levels of reverberation. Dataset details are reported

Conclusion

In this contribution, we introduced an ensemble-based approach with the goal of enabling detectors to be effective across varying types of spoofing attacks to speaker verification systems. We thus proposed a setting containing three components such that two of those are known to perform well individually in each of the two considered attack strategies. The third is then trained so as to decide how to combine the decision of the other systems depending on the input it is presented with.

Acknowledgement

The authors wish to acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) through grant RGPIN-2019-05381. We also wish to acknowledge the support of NVIDIA Corporation with the donation of a Titan Xp GPU used for this research. The first author was funded by the Bourse du CRIM pour Études Supérieures.

References (64)

  • T. Kinnunen et al.

    The ASVspoof 2017 challenge: assessing the limits of replay spoofing attack detection

    Eighteenth Annual Conference of the International Speech Communication Association

    (2017)
  • H. Muckenhirn et al.

    End-to-end convolutional neural network-based voice presentation attack detection

    2017 IEEE International Joint Conference on Biometrics (IJCB)

    (2017)
  • Rohdin, J., Silnova, A., Diez, M., Plchot, O., Matejka, P., Burget, L., 2017. End-to-end DNN based speaker recognition...
  • J. Alam et al.

    Spoofing detection employing infinite impulse response constant q transform-based feature representations

    2017 25th European Signal Processing Conference (EUSIPCO)

    (2017)
  • M.J. Alam et al.

    Boosting the performance of spoofing detection systems on replay attacks using q-logarithm domain feature normalization

    Proc. Odyssey 2018 The Speaker and Language Recognition Workshop

    (2018)
  • M.J. Alam et al.

    Development of crim system for the automatic speaker verification spoofing and countermeasures challenge 2015

    Sixteenth Annual Conference of the International Speech Communication Association

    (2015)
  • M.J. Alam et al.

    Spoofing detection on the ASVspoof2015 challenge corpus employing deep neural networks

    Proc. Odyssey

    (2016)
  • Bai, S., Kolter, J. Z., Koltun, V., 2018. An empirical evaluation of generic convolutional and recurrent networks for...
  • Bartlett, P. L., Helmbold, D. P., Long, P. M., 2018. Gradient descent with identity initialization efficiently learns...
  • S. Ben-David et al.

    Analysis of representations for domain adaptation

    Advances in Neural Information Processing Systems

    (2007)
  • G. Bhattacharya et al.

    Modelling speaker and channel variability using deep neural networks for robust speaker verification

    Spoken Language Technology Workshop (SLT)

    (2016)
  • N. Chen et al.

    Robust deep feature for spoofing detection the SJTU system for ASVspoof 2015 challenge

    Sixteenth Annual Conference of the International Speech Communication Association

    (2015)
  • N. Dehak et al.

    Front-end factor analysis for speaker verification

    IEEE Trans. Audio Speech Lang. Process.

    (2011)
  • M. Delcroix et al.

    Auxiliary feature based adaptation of end-to-end ASR systems

    Interspeech

    (2018)
  • Y. Ganin et al.

    Domain-adversarial training of neural networks

    J. Mach. Learn. Res.

    (2016)
  • I. Goodfellow et al.

    Deep Learning

    (2016)
  • Goodfellow, I. J., Shlens, J., Szegedy, C., 2014. Explaining and harnessing adversarial examples....
  • Hardt, M., Ma, T., 2016. Identity matters in deep learning....
  • H. He et al.

    Learning from imbalanced data

    IEEE Trans. Knowl. Data Eng.

    (2009)
  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • T. Kaneko et al.

    Generative adversarial network-based postfilter for statistical parametric speech synthesis

    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2017)
  • M. Karafiát et al.

    ivector-based discriminative adaptation for automatic speech recognition

    2011 IEEE Workshop on Automatic Speech Recognition & Understanding

    (2011)
  • Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., Socher, R., 2019. Ctrl: a conditional transformer language model...
  • Kingma, D. P., Ba, J., 2014. Adam: a method for stochastic optimization....
  • Kinnunen, T., Lee, K. A., Delgado, H., Evans, N., Todisco, M., Sahidullah, M., Yamagishi, J., Reynolds, D. A., 2018....
  • T. Kinnunen et al.

    An overview of text-independent speaker recognition: from features to supervectors

    Speech Commun.

    (2010)
  • P. Korshunov et al.

    Overview of BTAS 2016 speaker anti-spoofing competition

    2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS)

    (2016)
  • C.-I. Lai et al.

    Attentive filtering networks for audio replay attack detection

    ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2019)
  • Li, C., Ma, X., Jiang, B., Li, X., Zhang, X., Liu, X., Cao, Y., Kannan, A., Zhu, Z., 2017. Deep speaker: an end-to-end...
  • Mirza, M., Osindero, S., 2014. Conditional generative adversarial nets....
  • J. Monteiro et al.

    Combining speaker recognition and metric learning for speaker-dependent representation learning

    Twentieth Annual Conference of the International Speech Communication Association

    (2019)
  • J. Monteiro et al.

    End-to-end detection of attacks to automatic speaker recognizers with time-attentive light convolutional neural networks

    IEEE International Workshop on Machine Learning for Signal Processing

    (2019)
  • Cited by (57)

    • Voice spoofing detector: A unified anti-spoofing framework

      2022, Expert Systems with Applications
      Citation Excerpt :

      In Monteiro, Alam, and Falk (2020), an end-to-end LCNN ensemble model was proposed based on training a model on the predictions of two separately trained models for replay and cloning attacks respectively. Although this method (Monteiro et al., 2020) outperforms the ASVspoof baseline model (Yamagishi et al., 2019), but with increased features computation cost. Existing approaches have employed various magnitude- and phase-oriented features for synthetic/cloned speech detection.

    • SiFDetectCracker: An Adversarial Attack Against Fake Voice Detection Based on Speaker-Irrelative Features

      2023, MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
    View all citing articles on Scopus
    View full text