Generalized end-to-end detection of spoofing attacks to automatic speaker recognizers
Introduction
The increasing presence of smart portable devices in our daily lives, along with the fact that microphones are available on most of these devices, has rendered voice an appealing alternative for biometrics applications, attracting the attention from both industry and academia. Under the text-independent setting, speaker verification is performed on top of unconstrained phrases of arbitrary phonetic content and length, while specific pass phrases are considered for the text-dependent case (Kinnunen and Li, 2010). Since automatic speaker verification systems are burgeoning, fraud, or so-called spoofing attacks, are also on the rise, as fraudsters attempt to fool speech-based biometric systems in order to gain unauthorized access or to perform fraudulent financial transactions.
Recently, the use of artificial neural networks has allowed for automated speaker verification to achieve state-of-the-art results, thus bypassing the need for classical features, such as i-vectors (Dehak et al., 2011). Representative examples range from the use of neural networks to generate alternative embeddings (Bhattacharya, Alam, Kenn, Gupta, 2016, Snyder, Garcia-Romero, Sell, Povey, Khudanpur, 2018a) to systems trained in an end-to-end fashion (Rohdin, Silnova, Diez, Plchot, Matejka, Burget, Li, Ma, Jiang, Li, Zhang, Liu, Cao, Kannan, Zhu, Snyder, Ghahremani, Povey, Garcia-Romero, Carmiel, Khudanpur, 2016). Despite such major breakthroughs, recent literature has shown that artificial neural networks (Goodfellow et al., 2016) can be vulnerable to imperceptible perturbations added to input examples, leading to incorrect predictions with high confidence (Goodfellow et al., 2014). Such perturbed examples are usually referred to as adversarial attacks, i.e., carefully crafted variations of genuine samples intentionally modified so as to confuse or fool undefended models.
Besides adversarial attackers, which might target neural networks in general regardless of its application context, in the specific case of speaker verification and voice biometric systems, other attack strategies exist, being often termed “spoofing attacks”, which are defined as a person or computer program that tries to overcome an authentication system by forging the data of a legitimate user. Spoofing attacks can be broadly classified into two strategies: (i) replay, or also referred to as presentation or physical access (PA) attacks; and (ii) synthetic attacks, also referred to as logical access (LA) attacks (Todisco, Wang, Vestman, Sahidullah, Delgado, Nautsch, Yamagishi, Evans, Kinnunen, Lee, Korshunov, Marcel, Muckenhirn, Gonçalves, Mello, Violato, Simoes, Neto, de Assis Angeloni, Stuchi, et al., 2016). The latest LA attacks have taken advantage of recent advances in speech synthesis and voice conversion based on auto-regressive waveform modeling or generative adversarial networks (van den Oord, Dieleman, Zen, Simonyan, Vinyals, Graves, Kalchbrenner, Senior, Kavukcuoglu, Wang, Skerry-Ryan, Stanton, Wu, Weiss, Jaitly, Yang, Xiao, Chen, Bengio, et al., Tamamori, Hayashi, Kobayashi, Takeda, Toda, 2017, Kaneko, Kameoka, Hojo, Ijima, Hiramatsu, Kashino, 2017).
Given the serious consequences that spoofing attacks can have on speaker verification systems, recent research has focused on the development of new attack detection algorithms and several challenges have been organized (e.g., Todisco, Wang, Vestman, Sahidullah, Delgado, Nautsch, Yamagishi, Evans, Kinnunen, Lee, Korshunov, Marcel, Muckenhirn, Gonçalves, Mello, Violato, Simoes, Neto, de Assis Angeloni, Stuchi, et al., 2016, Wu, Kinnunen, Evans, Yamagishi, Hanilçi, Sahidullah, Sizov, 2015, Kinnunen, Sahidullah, Delgado, Todisco, Evans, Yamagishi, Lee, 2017). Fig. 1 presents block diagrams of two possible settings where spoofing detectors are used in tandem with speaker verification systems. In both cases, the input corresponds to an audio signal along with a claimed identity. The spoofing detector can be used if the claimed identity is verified as target, i.e. true and claimed identities match, or the opposite can be done so that only samples classified as genuine by the spoofing detector will be verified against the claimed identity. Moreover, spoofing detection challenges include, for example, ASVspoof 2015 (Wu et al., 2015) which focused on speech synthesis and voice conversion spoofing. The ASVspoof 2017 Challenge (Kinnunen et al., 2017), in turn, was concerned with playback (replay) attacks. The recent 2019 edition of ASVspoof (Todisco et al., 2019), on the other hand, consisted of two sub-challenges, each involving PA or LA attacks only.
Even though the data released for such challenges are generated making sure that speakers and attack types vary across train, development, and evaluation datasets, the developed detection systems rely on the strong (and unrealistic) assumption that train and test data are identically distributed so that the same general attack strategy (LA or PA) will appear on both train and test data. By doing so, varying models, architectures and input features are used for either PA or LA attacks. This strategy-specific configuration, however, is not aligned with practical, real-life scenarios where the attack strategy is not known a priori.
Finding settings where train and test data distributions differ has been discussed in depth by the Machine Learning community. A notable example corresponds to generalization results introduced in Ben-David et al. (2007), which bound the performance gap across train and test data distributions depending on how close such distributions are. Those results thus motivated approaches such as that discussed in Ganin et al. (2016), where an encoder is trained to map raw data to a space where relevant features to the task of interest are kept, while domain-specific cues are filtered away so that train and test data look alike in that space. One limitation of such an approach, however, is that it targets a specific test distribution and requires it to be somewhat close to the training one. An alternative approach corresponds to conditioning predictions on domain discriminating factors, i.e., rather than removing domain information from the data when encoding it, one might keep and condition on domain dependent factors while doing inference, such that the model accounts for variations on the data conditions.
In this paper, we aim to tackle the detection of spoofing attacks in a fashion that is closer to scenarios detectors would face if deployed on real applications. More specifically, we address the following general problem: can we go beyond the common i.i.d. assumption in supervised learning (i.e. both train and test data are independently sampled from the same distribution) and tackle a more general setting where a set of data distributions is available? Translating that to the detection problem, the research question we pose to ourselves is: can a single system be used to detect both LA and PA attackers? To try and answer that question, we assume two data sources are available: LA or PA attackers along with their genuine samples. We thus consider the domain-conditional approach described above so that a dedicated model (referred to as mixture model) learns to discriminate to which source each input is more likely to belong, and how to best combine the outputs of two other models: one tailored for PA and another for LA spoofing strategies. The architectures as well as the choice of speech representations for each of the LA and PA models are chosen taking into account previously reported best practices for each type of attack, and the assumption we make is that, by doing so, each such model will be able to detect attacks in each distribution while the mixture model will be able to assign importance to each output.
The 3-model ensemble is jointly trained in a single step using train data created by pooling together genuine, LA, and PA examples. Scoring of test examples is performed by either directly taking the outputs of the specialist models, their combined outputs, or the deviation from 0.5 of the output of the mixture model. By doing so, we show that the proposed ensemble is able to outperform specialized models, i.e., those trained and evaluated on a either PA or LA examples, as well as individual ensemble component models trained on the same pooled training data. Our contributions can be summarized as follows:
- 1.
We introduce an end-to-end framework effective in detecting both LA and PA spoofing attacks;
- 2.
We evaluate which speech representation is more suitable for performing detection in each of the considered strategies; and
- 3.
We evaluate which speech representation is more suitable for detecting which type of attack was presented to the model.
The remainder of this paper is organized as follows: Section 2 discusses recent literature on the detection of spoofing attacks to speaker recognizers, domain-conditional strategies in other application domains, and strategies to summarize global information across speech data into compact representations. Section 3 details the approach we designed to generalize across attack strategies. The evaluation protocol is presented in Section 4 along with experiments results and discussion. Conclusions are finally drawn in Section 5.
Section snippets
Detection of spoofing attacks
A generative classifier was introduced in Wu et al. (2012b) following a similar approach to that of linear discriminant analysis (LDA), such that one generative model is trained per class. However, in this case, Gaussian Mixture Models (GMM) are employed for modelling the class-conditional features rather than simple Gaussians with shared covariances as in the LDA case. Given a sequence of feature vectors denoted by O corresponding to a speech signal, the genuine versus spoofed speech decision
Proposed model
Fig. 2 illustrates the strategy we propose in order to be able to detect attackers generated using different strategies. Inputs xLA, xMIX, and xPA correspond to features obtained from a given audio sample. We discuss the selection of the feature space in each case in Section 4. Each of MLA, MMIX, and MPA are such that M: X → [0, 1], i.e., speech features are mapped into a score in [0,1]. The output is then used to compute a convex combination of scores yLA and yPA:
Experimental setup and evaluation
The approach discussed herein is evaluated under the conditions introduced for the ASVspoof 2019 challenge. Two types of attacks are considered: logical and physical access. While the former corresponds to synthetic speech created using both voice conversion and text-to-speech systems, the latter consists of playbacks simulated from genuine recordings considering exhaustive combinations of 3 room sizes, 3 distances to the microphone, and 3 levels of reverberation. Dataset details are reported
Conclusion
In this contribution, we introduced an ensemble-based approach with the goal of enabling detectors to be effective across varying types of spoofing attacks to speaker verification systems. We thus proposed a setting containing three components such that two of those are known to perform well individually in each of the two considered attack strategies. The third is then trained so as to decide how to combine the decision of the other systems depending on the input it is presented with.
Acknowledgement
The authors wish to acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) through grant RGPIN-2019-05381. We also wish to acknowledge the support of NVIDIA Corporation with the donation of a Titan Xp GPU used for this research. The first author was funded by the Bourse du CRIM pour Études Supérieures.
References (64)
- et al.
The ASVspoof 2017 challenge: assessing the limits of replay spoofing attack detection
Eighteenth Annual Conference of the International Speech Communication Association
(2017) - et al.
End-to-end convolutional neural network-based voice presentation attack detection
2017 IEEE International Joint Conference on Biometrics (IJCB)
(2017) - Rohdin, J., Silnova, A., Diez, M., Plchot, O., Matejka, P., Burget, L., 2017. End-to-end DNN based speaker recognition...
- et al.
Spoofing detection employing infinite impulse response constant q transform-based feature representations
2017 25th European Signal Processing Conference (EUSIPCO)
(2017) - et al.
Boosting the performance of spoofing detection systems on replay attacks using q-logarithm domain feature normalization
Proc. Odyssey 2018 The Speaker and Language Recognition Workshop
(2018) - et al.
Development of crim system for the automatic speaker verification spoofing and countermeasures challenge 2015
Sixteenth Annual Conference of the International Speech Communication Association
(2015) - et al.
Spoofing detection on the ASVspoof2015 challenge corpus employing deep neural networks
Proc. Odyssey
(2016) - Bai, S., Kolter, J. Z., Koltun, V., 2018. An empirical evaluation of generic convolutional and recurrent networks for...
- Bartlett, P. L., Helmbold, D. P., Long, P. M., 2018. Gradient descent with identity initialization efficiently learns...
- et al.
Analysis of representations for domain adaptation
Advances in Neural Information Processing Systems
(2007)
Modelling speaker and channel variability using deep neural networks for robust speaker verification
Spoken Language Technology Workshop (SLT)
Robust deep feature for spoofing detection the SJTU system for ASVspoof 2015 challenge
Sixteenth Annual Conference of the International Speech Communication Association
Front-end factor analysis for speaker verification
IEEE Trans. Audio Speech Lang. Process.
Auxiliary feature based adaptation of end-to-end ASR systems
Interspeech
Domain-adversarial training of neural networks
J. Mach. Learn. Res.
Deep Learning
Learning from imbalanced data
IEEE Trans. Knowl. Data Eng.
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Generative adversarial network-based postfilter for statistical parametric speech synthesis
2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
ivector-based discriminative adaptation for automatic speech recognition
2011 IEEE Workshop on Automatic Speech Recognition & Understanding
An overview of text-independent speaker recognition: from features to supervectors
Speech Commun.
Overview of BTAS 2016 speaker anti-spoofing competition
2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS)
Attentive filtering networks for audio replay attack detection
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Combining speaker recognition and metric learning for speaker-dependent representation learning
Twentieth Annual Conference of the International Speech Communication Association
End-to-end detection of attacks to automatic speaker recognizers with time-attentive light convolutional neural networks
IEEE International Workshop on Machine Learning for Signal Processing
Cited by (57)
Voice spoofing detector: A unified anti-spoofing framework
2022, Expert Systems with ApplicationsCitation Excerpt :In Monteiro, Alam, and Falk (2020), an end-to-end LCNN ensemble model was proposed based on training a model on the predictions of two separately trained models for replay and cloning attacks respectively. Although this method (Monteiro et al., 2020) outperforms the ASVspoof baseline model (Yamagishi et al., 2019), but with increased features computation cost. Existing approaches have employed various magnitude- and phase-oriented features for synthetic/cloned speech detection.
Texture analysis of edge mapped audio spectrogram for spoofing attack detection
2024, Multimedia Tools and ApplicationsSiFDetectCracker: An Adversarial Attack Against Fake Voice Detection Based on Speaker-Irrelative Features
2023, MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia