On the use of blind channel response estimation and a residual neural network to detect physical access attacks to speaker verification systems

https://doi.org/10.1016/j.csl.2020.101163Get rights and content

Highlights

  • The use of blind channel response estimation as a new approach for replay attack detection.

  • The proposed method outperformed the baseline systems in two spoofing datasets.

  • Further improvement achieved after combining recent deep learning models.

  • Front- and back-end based on single feature extraction and single neural network classifier.

Abstract

Spoofing attacks have been acknowledged as a serious threat to automatic speaker verification (ASV) systems. In this paper, we are specifically concerned with replay attack scenarios. As a countermeasure to the problem, we propose a front-end based on the blind estimation of the channel response magnitude and as a back-end a residual neural network. Our hypothesis is that the magnitude response of the channel, obtained by subtracting the log-magnitude spectrum of the observed signal from the prediction of the log-magnitude spectrum average of the observed signal’s clean counterpart, will capture the nuances of room ambiences, recordings and playback devices. The performance of these features is investigated on a benchmark back-end, based on a Gaussian mixture model and on a deep neural network classifier. Our experiments are performed on the 2017 and 2019 Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof) datasets. The benchmark systems are the same as used in the challenges and are based on constant-Q cepstral coefficients (CQCC) and linear-frequency cepstral coefficients (LFCC) features. Experimental results on the 2017 dataset show that the proposed method outperforms the two benchmarks, providing equal-error rates (EER) as low as 7.57% and 11.64%, respectively, for the development and evaluation sets. On the ASVspoof 2019 dataset, in turn, the proposed method outperformed the benchmark using a residual neural network as back-end by yielding tandem detection cost function (t-DCF) and EER as low as 0.1086 and 4.26% on the evaluation set. Lastly, an instrumental (objective) quality assessment is performed on the two datasets and the impact of quality variability on spoofing detection accuracy is discussed.

Introduction

Automatic speaker verification (ASV) has significantly matured over the last few years (Wu and et al., 2015b). Advances in channel compensation techniques (Rahman, et al., 2018, Misra, Hansen, 2018) and the use of deep learning embeddings, such as x-vectors (Snyder, et al., 2018, Chung, Nagrani, Zisserman), have taken automatic speaker verification to a higher level. The deployment of commercial mobile voice recognition products has already become a reality (Aware.com, 2017). To enhance password-based authentication mechanisms, for instance, a number of financial institutions are investing in voice authentication solutions (Biometricupdate.com, 2018). This is driven mainly by the increased use of mobile devices, as well as by the convenience and non-intrusiveness offered by such technologies. In fact, recent reports predict a continued growth of the mobile biometrics sector due to the increased consumer demand for safety, especially while using mobile devices for banking transactions and e-commerce (Biometricupdate.com, 2018).

Despite all these advances, malicious spoofing attacks have been recognized as a serious threat to ASV (Wu and et al., 2015b). Characterized by an attempt of a person or a program to illegitimately bypass security by masquerading one’s identity, there are growing concerns towards the vulnerability of ASV in the face of spoofing attacks, such as impersonation, replay attacks, speech synthesis, and voice conversion (Wu and et al., 2015b). As such, a handful of initiatives to develop spoof countermeasures have been made lately (Wu, et al., 2015, Kinnunen, et al.). Many of the efforts in this direction have been focused on developing anti-spoofing techniques to protect ASV systems against speech synthesis (SS) and voice conversion (VC) (Wu and et al., 2015a). In this study, we are particularly interested in countermeasures to replay attacks, which consist of attempts to fool an ASV system by playing back a pre-recorded speech sample. In such circumstances, detecting the replay attack beforehand is crucial to maintain ASV reliability.

Given the emerging interest in the topic, the Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof) was created, with the latest versions being in 2017 (Kinnunen and et al., 2017) and 2019 (Asvspoof 2019: Automatic speaker verification spoofing and countermeasures challenge evaluation plan, 2019). Henceforth referred to as ASVspoof 2017/2019, these challenges provided common databases, protocols, and metrics to evaluate different countermeasure solutions. Both competitions contained datasets with diverse forms of replay attacks, with the 2019 set providing a larger number of replay attack configurations. Until then, replay attacks, also known as physical attacks, had received little attention from the research community compared to other spoofing modalities (e.g., speech synthesis).

In Dinkel and et al. (2017), for example, the authors propose an end-to-end countermeasure solution using the raw waveform. While many countermeasure methods are based on a combination of feature extraction and a back-end classifier, this method dismisses the need for any pre-processing on the speech waveform. Although feature engineering can make more feasible the interpretation of the extracted features, it may neglect rich speech information that might be found in the raw waveform by an end-to-end approach. In Tom et al. (2018), the authors proposed a deep residual neural network (ResNet-18) architecture, with a visual attention mechanism on time-frequency representations based on group-delay features, as a countermeasure for replay attacks. Results obtained in terms of equal error rate (EER) were quite low, but only reported on the ASVspoof 2017 dataset. In Suthokumar and et al. (2018), the authors proposed to complement short-term spectral features with two novel features based on the modulation spectrum. The latter captures static and dynamic characteristics of the speech signal from the modulation spectrum, which complement short-term spectral features for use in replay detection. The authors in Gałka et al. (2015), in turn, relied on spectral bitmaps or spectral peaks, which are time-frequency points higher than a pre-defined threshold. The similarity score was attained by computing an element-wise product between the spectral bitmap of the verification sample and stored spectral bitmap templates. More recently, the performance of several features and classifiers was described in Hanilçi (2017). The authors reported results from six magnitude-spectrum and three phase-spectrum-based features on the ASVspoof 2017 replay attack detection challenge, with experiments revealing the superiority of the magnitude-spectrum features over phase-based features for all four classifiers tested. An attentive filtering network combined with a ResNet-based classifier is proposed in Lai and et al. (2019), thus resulting in better discriminative features both in the time and frequency domains. In Monteiro et al. (2020), the authors address the problem of generalizing countermeasure solutions across different spoofing attacks. The authors proposed a method to detect different types of attacks (e.g., physical and logical attacks) without prior knowledge of the strategy employed to generate them, enabling detectors to be effective in a larger range of spoofing attack types. For that, an ensemble of three components, known to perform well individually in each of the two attack strategies considered, was proposed.

Despite the recent advancements in this field, investigating new countermeasure solutions applicable to emerging and more challenging scenarios is still needed. In this work, we propose the use of blind channel spectrum estimation in combination with deep neural network-based classification to detect replay attacks. Considering that in a replay attack the utterance will be acoustically affected by factors such as the room environment, the recording, and the playback devices, it is expected that such effects will generate a unique “signature” in the signal’s log-magnitude spectrum. Hence, we propose to detect such spectral signatures by estimating the magnitude response of the channel.

Here, this is achieved by first training a clean speech Gaussian mixture model (GMM). The model is trained using RASTA-filtered mel-frequency cepstral coefficients (RASTA-MFCCs) extracted from several clean speech files, thus allowing us to attain a model of clean spectrum characteristics. The channel response spectrum is then estimated by computing the log-magnitude spectrum average of clean signals and by then subtracting it from the log-magnitude spectrum of the observed signal. As a classifier, we adopted the benchmark GMM to distinguish between true and spoofing utterances. Next, motivated by the recent results obtained with ResNets (Glorot, Bengio, 2010, He, et al., 2016), we also explore the use of such networks.

Experimental results show the proposed method outperforming the benchmarks on both the development and evaluation sets for the ASVspoof 2017 and 2019 datasets. To the best of our knowledge, only a few studies have addressed the use of channel estimation as a countermeasure solution. In Nagarsheth and et al (2017), for example, the authors proposed the use of two low-level descriptors, the constant-Q cepstral coefficients (CQCC) and the high-frequency cepstral coefficients (HFCC), as input to a convolutional neural network (CNN). The authors claim that the CNN model is estimating the channel conditions although no clear explanation is given regarding how the channel is estimated. The present work provides a more complete investigation of the use of channel estimation, representing an important contribution towards mitigating the problem of spoofing replay attacks. Moreover, compared to our previous work (Avila and et al., 2019), this study (1) presents additional and improved results on an extended dataset; (2) evaluates the impact of the resolution of the channel estimation approach on spoofing detection performance; (3) performs a quality analysis of the two datasets being tested and discusses the impact of signal quality and spoofing detection accuracy; and (4) presents performance improvements with ResNet, while comparing results with a state-of-the-art algorithm.

The remainder of this paper is organized as follows. Section 2 provides a description of the proposed method and Section 3 the proposed deep neural network classifier. In Section 4, we present our experiment setup and Section 5 discusses our experimental results. Section 6 concludes the paper.

Section snippets

Blind channel response estimation

In this section, the general ideas behind blind channel response estimation are described, along with the steps to attain the average spectra of clean speech, followed by estimation of the channel response magnitude. Lastly, we give a short description of the MFCC’s extraction and some insights on the RASTA filtering procedure.

Residual neural network

Deep residual networks (ResNets) were proposed initially in He and et al. (2016a) as a strategy to mitigate the problem of vanishing gradients, encountered while optimizing a deep neural network. Since its introduction, the training of considerably deep models (e.g., over 50 layers) has become more feasible. It has been shown, for example, that training and testing errors increase with deeper networks (He and et al., 2016a), not necessarily due to overfitting (He and et al., 2016a), but also

Experimental setup

In this section, we present the datasets used throughout our experiments, the adopted benchmark features and classifiers, a no-reference perceptual model used to evaluate the quality of the datasets, as well as the figure-of-merit.

Experimental results and discussion

In this section, we describe two experiments along with the respective discussions on the achieved results. The first experiment is performed on the ASVspoof Challenge 2017 dataset, where we compare the performance of the proposed method to the baseline systems. Then, similar experiments are performed on the ASVspoof Challenge 2019 dataset. We also discuss the role of perceptual quality on the performance of our model.

Conclusions

In this paper, we proposed the use of blind channel response estimation as a new approach for replay attack detection. Our assumption is that the nuances of the acoustic ambience, microphones and playback devices present in the spectrum contain enough information to distinguish between a bonafide and a spoofed attack. We explored a baseline back-end based on Gaussian mixture models, as well as a deep residual neural network classifier. Experiments on the ASVspoof 2017 and the ASVspoof 2019

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors would like to thank Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Fonds de recherche du Québec - Nature et Technologies (FRQNT) and the Natural Sciences and Engineering Research Council of Canada (NSERC) through grant RGPIN-2019-05381. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the NSERC.

References (46)

  • Chung, J., Nagrani, A., Zisserman, A., 2018. Voxceleb2: deep speaker recognition....
  • H. Delgado

    Asvspoof 2017 version 2.0: meta-data analysis and baseline enhancements

    (2018)
  • H. Dinkel

    End-to-end spoofing detection with raw waveform CLDNNS

    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2017)
  • J. Garofolo

    Timit Acoustic Phonetic Continuous Speech Corpus

    (1993)
  • N. Gaubitch

    Single-microphone blind channel identification in speech using spectrum classification

    2011 19th European Signal Processing Conference

    (2011)
  • N. Gaubitch et al.

    Blind channel magnitude response estimation in speech using spectrum classification

    IEEE Trans. Audio Speech Lang. Process.

    (2013)
  • X. Glorot et al.

    Understanding the difficulty of training deep feedforward neural networks

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    (2010)
  • C. Hanilçi

    Features and classifiers for replay spoofing attack detection

    10th International Conference on Electrical and Electronics Engineering (ELECO), 2017

    (2017)
  • K. He

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • K. He

    Identity mappings in deep residual networks

    European Conference on Computer Vision

    (2016)
  • H. Hermansky et al.

    Rasta processing of speech

    IEEE Trans. Speech Audio Process.

    (1994)
  • Asvspoof 2019: Automatic speaker verification spoofing and countermeasures challenge evaluation plan, 2019....
  • ITU-T, 2001. Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality...
  • Cited by (3)

    • Voice privacy using CycleGAN and time-scale modification

      2022, Computer Speech and Language
      Citation Excerpt :

      It is highly possible that the user’s speech data leaks in any untrusted and unsecured network using voice biometric-based systems. Thus, the speech data can be obtained by the attacker, which brings attention to key issue of data privacy (Avila et al., 2021). The task of hiding speaker’s identity with assurance of less damaged linguistic content (making unlinkable data) is known as Anonymization or De-identification (Srivastava et al., 2019).

    View full text