Elsevier

Digital Signal Processing

Volume 72, January 2018, Pages 171-180
Digital Signal Processing

Data selection for i-vector based automatic speaker verification anti-spoofing

https://doi.org/10.1016/j.dsp.2017.10.010Get rights and content

Abstract

State-of-the-art i-vector based automatic speaker verification (ASV) systems lead to considerably high performance and thus voice becomes one of the most important biometric modality for person authentication. However, similar to other biometrics, ASV systems are highly vulnerable to spoofing attacks. Therefore, developing countermeasures for detecting spoofing attacks plays an important role against the concerns regarding the reliability of ASV systems. Recent studies have shown that simple Gaussian mixture model (GMM) classifier outperforms i-vector countermeasures. In this study, we focus on improving the spoofing detection performance of i-vector system using cosine and probabilistic linear discriminant analysis (PLDA) scoring. Experimental results conducted on ASVspoof 2015 database reveals that the data used to train the two key elements of i-vector system, universal background model (UBM) and the i-vector extractor (T-matrix), play an important role on spoofing detection performance. In this paper, we study the effect of using different type of data (genuine/human or spoofed) to train these two elements and their performance on spoofing detection. In particular, extracting i-vectors using UBM trained with genuine (human) speech utterances and T-matrix trained from both genuine and spoofed utterances leads to 50% performance improvement on spoofing detection. With the proposed scheme, unlike the previous results, i-vector countermeasure outperforms GMM classifier. Finally, experimental results shows that recently proposed constant Q cepstral coefficients (CQCC) shows superior performance in comparison to standard Mel-frequency cepstral coefficients (MFCC).

Introduction

Automatic speaker verification (ASV) is the task of verifying an identity claim given a speech signal [1]. With recent developments on ASV technology, after introducing the Gaussian mixture model with universal background model (GMM-UBM) [2] followed by joint factor analysis (JFA) [3] in particular, voice has become one of the most important biometric tool for person authentication [4].

While the performance of ASV systems have considerably improved during recent years, as in the case for any other biometric person authentication systems, reliability of ASV systems against spoofing attacks [5] (also known as presentation attacks) has become an important security concern [6]. With spoofing attack, an attacker masquerades as a target enrolled speaker in order to gain illegitimate access to the system.

ASV systems can directly be spoofed by mainly four different types of attacks [6]: impersonation [7], where attacker mimicks a target speaker's voice, replay [8], presentation of a target speaker's pre-recorded speech signal, speech synthesis (SS) [9], synthesizing a speaker's voice from a text input and voice conversion (VC) [10], modifying a source speaker's voice to target speaker's speech. Among these four types of attacks, impersonation is less likely to occur since professional and skilled impersonators are required whereas replay attacks can easily be realized since it requires a simple voice recorder (e.g. a smartphone). SS and VC attacks in turn, have gained more attention due to two reasons: first, both SS and VC techniques have improved significantly where high quality speech signals can be generated with limited amount of training data and the second, the availability of freely available open-source SS and VC toolkits which can easily be used by non-expert attackers. Although, there are four main types of direct attacks against ASV systems, in order to limit the scope of this research, we only focus on SS and VC attacks since each attack type has different processes to generate the signals.

The vulnerability of the ASV systems to SS attacks were first studied more than a decade ago [11], [12]. For example in [11], authors reported the performance degradation on text-prompted speaker verification system against synthetic speech signals generated by SS. In [12], GMM based speaker verification system and its sensitivity against speech synthesis were analyzed. More recent studies have independently confirmed the vulnerability of state-of-the-art speaker recognition systems [13], [14], [15], [16], [17].

In order to cope with the threat of spoofing, the research mostly focused on developing new techniques for anti-spoofing – determining whether a speech signal is genuine or originated from a SS or VC technique [6]. In 2015, ASVspoof  2015: Automatic Speaker Verification Spoofing and Countermeasures Challenge1 was organized in order to study the spoofing detection for both known and unknown attacks. One of the aims of the challenge was to design a common dataset and evaluation metric for stand-alone anti-spoofing so the results reported by different sites can be comparable.

Spoofing countermeasure is in fact a pattern recognition task consisting of front-end and back-end parts. In the front-end side, features capturing the artefacts generated by SS and VC systems are desired to be extracted from speech signal. In the back-end side, in turn, the extracted features are firstly used to model possible hypotheses in training phase and then these models are used to score features extracted from a test speech signal in order to decide whether speech signal is genuine or spoofed. Most of the previous studies focused on the front-end side of a spoofing detection system with simple Gaussian mixture modeling (GMM) back-end [18], [19], [20], [21], [22], [23], [24]. For example, in [18], standard Mel-frequency cepstral coefficients (MFCC), modified group delay (MGD) and cosine phase features were compared for the detection of spoofed speech signals generated by Gaussian mixture model (GMM) and unit-selection techniques and it was reported that cosine phase features yields better performance. In [19], seventeen different magnitude and phase based feature extraction methods were extensively compared for anti-spoofing and it was reported that simple linear filterbank features yields promising results for unknown types of attacks. Seven different features were compared in [21] and MFCC was found to perform better than MGD based features. In a more recent study [24] constant Q transform based cepstral coefficients (CQCC) were proposed for anti-spoofing with encouraging performance on both known and unknown attacks.

The classifier (back-end) part of anti-spoofing is less studied in comparison to the front-end side. In [25], different classifiers were compared for spoofing detection and it was found that simple Gaussian mixture model (GMM) trained with maximum likelihood (ML) criterion outperforms more sophisticated SVM and state-of-the-art i-vector cosine distance scoring (CDS) classifiers. In [26], GMM and i-vector based probabilistic linear discriminant analysis (PLDA) classifiers were compared and GMM was found to outperform i-vector countermeasure. Since anti-spoofing systems are developed for reliable ASV systems, integration of countermeasures with ASV systems is the main purpose. Therefore, it would be beneficial if both ASV and anti-spoofing systems use the same back-end. Although i-vector based systems yield encouraging speaker recognition performance, their spoofing detection performance is relatively poor in comparison to other classifiers such as GMM [25], [26], [27]. This implies that integrated ASV and anti-spoofing systems require two different detection schemes (one for anti-spoofing and another for ASV) which is not straightforward [28]. Therefore, this study focuses on improving the performance of i-vector based spoofing detection system. To this end, we aim at finding the optimal data type (genuine or spoofed) to train the two important components of i-vector recognizer, namely universal background model (UBM) and i-vector extractor (T-matrix) for anti-spoofing. Although, the effect of these two has extensively studied for speaker recognition, their effects on spoofing detection performance remains unknown. Thus in this research, we extensively and systematically study the effect of the data type used in i-vector system for spoofing detection.

Section snippets

Related work

Although most of the studies focused on front-end, selection of appropriate feature representations capturing the imperfections of SS and VC techniques employing simple GMM classifier, there exists studies employed i-vector for spoofing detection. For example in [29], i-vectors extracted from three different features (MFCC, Mel-frequency principle components and cosine phase features) were concatenated and used as the features of support vector machines (SVM) classifier. Similarly, in [30],

Spoofing detection

Given a speech signal s, the spoofing detection – determining whether s is a genuine or a spoofed speech – can be defined as a hypothesis test between two hypotheses:

  • H0: s is a genuine speech,

  • H1: s is a spoofed speech.

The decision between two hypotheses can be made based on the log-likelihood ratio (LLR) score:Λ(s)=logp(s|H0)logp(s|H1).

In order to compute (1), usually the speech signal s is characterized by the feature vectors, X={x1,x2,,xT}, extracted from the signal and the hypotheses H0

Database

In the experiments, we used ASVspoof 2015 database [41] consisting of genuine/human and synthetic speech signals generated by various speech synthesis (SS) and voice conversion (VC) techniques. ASVspoof 2015 database is partitioned into three disjoint subsets: training, development and evaluation. The statistics (number of speakers and number of speech signals) of the database is summarized in Table 2. The training set includes 3750 genuine and 12625 spoofed speech signals from 25 speakers.

Baseline results

The average EERs (in %) obtained with the baseline systems on development set are reported in Table 4. It can be seen that GMM classifier outperforms cosine distance scoring (CDS) and PLDA systems for MFCC features. The similar results have recently been reported in [25], [26], [27]. Approximately 101% relative increase on the EER is observed when CDS is used in comparison to standard GMM classifier (EER increases from 0.660% to 1.328%) for MFCC features. The performance further reduces when

Conclusion

In this paper, we investigated the effect of data used to train hyperparameters (UBM and T-matrix) for i-vector based spoofing detection system. In the experiments, similar to results reported in previous studies [25], [26], [27], we first observed that simple GMM classifier yields almost two times lower EER than the i-vector system (EERs of 2.391% vs. 4.624%). Hence, improving the spoofing detection performance of i-vector countermeasure was the goal of this study. We experimented different

Acknowledgement

This study was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK) under the project no. 115E916. The author would like to thank the anonymous reviewers and the handling editor for their valuable comments.

Cemal Hanilçi received B.Sc., M.Sc. and Ph.D. degrees from Uludağ University, in 2005, 2007 and 2013, respectively, all in Electronics Engineering. From March to December 2011, he was a visiting researcher at the School of Computing, University of Eastern Finland. From 2014 to 2015, he was a post-doctoral researcher at the same school. Currently he is an Assistant Professor at the Bursa Technical University, Department of Electrical & Electronics Engineering, in Turkey. His research interests

References (45)

  • N.K. Ratha et al.

    Enhancing security and privacy in biometrics-based authentication systems

    IBM Syst. J.

    (2001)
  • R.G. Hautamäki et al.

    Automatic versus human speaker verification: the case of voice mimicry

    Speech Commun.

    (2015)
  • K. Tokuda et al.

    Speech synthesis based on hidden Markov models

    Proc. IEEE

    (2013)
  • Y. Stylianou et al.

    Continuous probabilistic transform for voice conversion

    IEEE Trans. Speech Audio Process.

    (1998)
  • T. Masuko et al.

    On the security of HMM-based speaker verification systems against imposture using synthetic speech

  • B.L. Pellom et al.

    An experimental study of speaker verification sensitivity to computer voice-altered imposters

  • D. Matrouf et al.

    Effect of speech transformation on impostor acceptance

  • J. Bonastre et al.

    Artificial impostor voice transformation effects on false acceptance rates

  • Q. Jin et al.

    Is voice transformation a threat to speaker identification?

  • P.L.D. Leon et al.

    Revisiting the security of speaker verification systems against imposture using synthetic speech

  • F. Alegre et al.

    On the vulnerability of automatic speaker recognition to spoofing attacks with artificial signals

  • Z. Wu et al.

    Detecting converted speech and natural speech for anti-spoofing attack in speaker recognition

  • Cited by (0)

    Cemal Hanilçi received B.Sc., M.Sc. and Ph.D. degrees from Uludağ University, in 2005, 2007 and 2013, respectively, all in Electronics Engineering. From March to December 2011, he was a visiting researcher at the School of Computing, University of Eastern Finland. From 2014 to 2015, he was a post-doctoral researcher at the same school. Currently he is an Assistant Professor at the Bursa Technical University, Department of Electrical & Electronics Engineering, in Turkey. His research interests include speech processing, speaker recognition, anti-spoofing, audio forensics.

    View full text