1 Introduction

Over the past few years, biometric technology has evolved tremendously. Biometric authentication provides secure access to the system using human biological data such as DNA, facial features, fingerprint data, iris data, and voice data. Biometric data is unique for every individual and hence considered secure and easy to use. Face and fingerprint-based biometric systems became so advanced that they are widely used in various IoT devices and smartphones. Voice biometrics has a lot of potential because of convenience, low cost and readily available. Some of the applications of voice biometric would be smartphone user authentication, Interactive Voice Response (IVR) and voice authentication in banking.

Speech-based biometric systems are liable to a variety of attacks like impersonation, voice synthesis, voice conversion and replay attack [4]. In the impersonation attack, the imposter tries to mimic the voice of the authentic user and gain access to the system. Voice conversion systems are capable of converting the utterances spoken by person A to that of person B and they pose a grave threat to speaker verification systems. The voice synthesis system takes text as input and generates audio corresponding to the utterances of the text input. The replay attack is the most popular attack because it doesn’t require any signal processing skills. The attacker has to capture the voice of the authentic user using a portable recording device and play it back to the speaker verification system.

The main purpose of this research is to investigate different ways to detect audio replay attack. ASVspoof 2017 challenge is solely focused on replay audio detection and as part of the challenge dataset containing genuine and replay audio is provided [6]. The audio replay attack detection problem reduces to a simple binary classification problem of classifying the audio as genuine or replay audio. In the ASVspoof 2015 challenge which was mainly focused on detecting voice conversion and speech synthesis CQT cepstral coefficients (CQCC) were used by many models as they provide high resolution in the lower frequency spectrum [14, 17]. Hence in ASVspoof 2017 challenge, the baseline system used CQCC features and also gives good results in replay attack detection.

In this paper the proposed methodology consists of three major steps: removal of voiced segments, feature extraction and classification. The objective is to find discriminative features that will help the model to distinguish genuine and replay audio. The replay audio generally contains more noise as it comprises of environment noise from multiple channels and the effects of reverberation are also present due to reflection of sound. The removal of voiced segments results in much less feature vector size which makes it easier to train deep neural networks. In the next step, log magnitude FFT power spectrum is extracted and used as a feature to train a CNN classifier to mark the audio as genuine or replay. The ASVspoof 2017 dataset is used for training and evaluating the model [6].

The paper is organized as follows: Sect. 2 contains the related work done in reference to the task of replay attack detection. Section 3 provides detailed information about the proposed methodology and the CNN model which used to solve the problem. Section 4 gives the results and analysis of the proposed system. In Sect. 5 conclusion and further possible research direction is discussed.

2 Literature Survey

In past years, a lot of efforts was put to detect audio replay attacks and many features for audio replay attack detection have been identified. Features like spectral ratio, modulation index, and low frequency have been used to detect replay attacks [15]. After feature extraction, a feature vector is formed and Support Vector Machine (SVM) classifier was used to classify genuine and recorded voice. Till that moment there was no dataset publicly available for replay attack detection so the approach could not be validated properly.

In order to develop countermeasures for audio replay attack, ASVSpoof 2017 challenge was conducted [6]. The baseline system used constant Q cepstral transform coefficients (CQCC) for feature representation which uses constant Q transform (CQT) to represent the audio signal [14]. The CQCC features have proven to be a huge success in the ASVspoof 2015 challenge and hence used as baseline feature for ASVspoof 2017 challenge. The Gaussian Mixture Model (GMM) with expectation maximization (EM) algorithm was used as a classifier to classify whether the audio as genuine or replay. The idea was to find a discriminative feature in genuine and replay the audio. The replay audio goes through the analog to digital conversion process twice, as a result, channel noise and other recording device artifacts gets introduced in the audio. This distortion and noise can be used as a feature for the classification problem. The CQCC features increase the resolution of the lower frequency spectrum where most of the voice data and noise are present, which helps in identification of the artifacts and distortions in audio. The GMM by itself was unable to learn all the discriminative features from the CQCC feature, which was giving high equal error rate (EER) [13] on evaluation dataset. The point where the false acceptance and false rejection are minimal and optimal is called EER [13]. The system which gives lesser EER value is considered as good system.

Lavrentyeva et al. [7] had proposed a deep learning approach for audio replay attack detection. The authors make use of truncated normalized Fast Fourier transform (FFT) spectrograms as features to train a deep learning model architecture known as Light CNN (LCNN) [16]. LCNN uses max-feature-map (MFM) activation instead of ReLU, which takes a maximum between two convolution feature maps and also preserves the relavant information. This paper was considered as the state of the art since it gives the lowest EER of 4.53 on development dataset and 7.37 on the evaluation dataset. As the size of the feature vector is large in this approach, it takes more time to train the deep learning model and also requires high-end machines to train and evaluate such systems. Zhuxin Chen et al. [2] have investigated the use of recurrent neural networks which have gating mechanisms such as long short term memory (LSTM) unit and gated recurrent unit (GRU). Features such as MFCC, CQCC and Filter bank energy coefficients have been studied and the results show that the Filter bank energy coefficients provide better results for the task of audio replay attack detection. The problem with this approach is that when the training data is increased the model tends to overfit.

To avoid overfitting due to the presence of similar voiced segments in both genuine and replay audio MS Saranya et al. [12] had proposed the use of non-voiced segments of the audio to detect replay attack. Author suggests that it is easier to detect reverberation and channel condition using features extracted from non-voiced segments. Three different GMM’s have been trained using features like CQCC, MFCC and Mel-Filterbank-Slope features (MFS) then voting was performed to decide whether the audio was replayed or not. This technique gives an EER of 2.99 on development dataset and 16.39 on the evaluation dataset. The stated approach takes a lot of time to train three GMM models.

3 Data Set

The ASVspoof 2017 database version 2.0 is used in all the experiments performed [6]. It is a subset of RedDots dataset [9]. The ASVspoof 2017 dataset contains three subsets namely training, development and evaluation. The train dataset is used for training and at the same time simultaneously, the development dataset is used for checking the model performance and adjustment of weights. The evaluation dataset contains new environment, recording device and playback device combination which is not present in train or development dataset. Therefore the evaluation dataset gives a better idea about model performance. The replayed utterances are recorded in 26 different environments labeled from E01 to E26. There are 26 different playback devices used for playing back the recorded utterances and labeled from P01 to P26. There are 25 recording devices used to record the replay audio and they are labeled from R01 to R25. The Sampling frequency of each record is 16 KHz. The training/train dataset contains 1508 genuine audio files and 1508 replay audio files. The development dataset contains 760 genuine audio files and 950 replay audio files. The evaluation dataset contains 1298 genuine audio files and 12922 replay audio files.

In the training dataset environments, E03 and E21 are used, playback devices P01 and P02 and recording device R01 is used. The development dataset is collected from environments E03, E04, E05, E06, E16 and E18 playback devices used are P07, P05, p09, P06, P01 and P08 and the following recording devices are used R04, R01, R02, R03, R07, R05 and R06. The replay attack detection systems are affected by the environmental conditions and the types of equipment used for playback and recording. High-quality recording and playback devices make it hard for the systems to detect replay attack and also if the environment is noisy, it becomes easier to detect replay attack. In the ASVspoof 2017 dataset the environments, playback devices, and recording devices are ranked using three colors green, yellow and red. The green ones pose less threat to the replay attack detection system whereas those marked with red color pose a greater threat.

Fig. 1.
figure 1

Proposed methodology

4 Proposed Methodology

The proposed method of replay audio detection is displayed in Fig. 1. The approach mainly contains three steps. The first step consists of pre-processing of the audio signal where the silence, unvoiced and voiced regions are identified from the speech signal using voice activity detection algorithm [11]. The silence and unvoiced regions are concatenated together. The second step is feature extraction from the concatenated signal where log magnitude FFT power spectrum is extracted as shown in Fig. 3. In the last step, CNN model is used as a classifier to classify the audio as genuine or replay [8].

4.1 Pre-Processing

The silence and unvoiced regions are likely to contain more information about reverberation and channel noise which helps in distinguishing genuine and replay audio. Therefore the voiced regions are removed and silence and unvoiced regions are combined. Usually, the silence and unvoiced regions have low energy when compared to voiced regions which have higher energy. The energy threshold which separates the voiced region from other regions can be represented by Eq. 1.

$$\begin{aligned} Threshold= a \times average\;energy \end{aligned}$$
(1)

where a is a constant which varies from 0 to 1. The energy values above the threshold are considered as voiced regions and hence removed. The threshold value is set to 0.15 after analysis.

The spectrogram of genuine and replay audio for the utterance “What sparked never boils” is shown in Figs. 2 and 3 respectively. In Fig. 3, the effect of reverberation on the speech signal can be seen, there is gradual degradation of energy around the voiced segments of the speech. Also there is more noise in the audio when compared to genuine audio. Therefore the voiced segments are removed using a voice activity detection algorithm (VAD) [11]. The rest of the audio segments are concatenated together and used for feature extraction process.

Fig. 2.
figure 2

(a) and (b) are the spectrograms of genuine and replay audio for the utterance “What sparked never boils”

Fig. 3.
figure 3

Feature extraction

4.2 FFT Spectrogram Generation

In this phase framing is done to split the audio signal into frames and hamming window of size 256 is applied to each frame. A n - point Discrete Fourier Transform (DFT) of the speech signal is computed using a Fast Fourier Transform (FFT) algorithm. The audio signals with less frames are padded with zeros before computing FFT and at the end, FFT spectrograms of size 512*256*1 are obtained.

4.3 Convolutional Neural Network

Convolutional neural networks (CNN) [8] are used as a classifier to classify the audio as genuine or replay. The CNN is good at learning features for classification problem which give better performance compared to other traditional approaches where features have to be extracted manually. The model has 5 convolutional layers. The first convolutional layer has 32 neurons, and filters of size 7*7 are used. The second convolutional layer has 64 neurons, and filters of size 5*5 are chosen. For the remaining three convolutional layers, the neurons and filters are fixed at 96 and 3*3 respectively. Maxpool layers of size 2*2 is used and the output of which is flattened and passed through two fully connected layers with 4096 neurons each. The last layer of the network is a softmax layer which classifies the audio as genuine or replay.

To avoid overfitting, dropout with a probability of 0.2 is used during training. Dropout ignores random neurons with a certain probability p so that the neurons become less dependent on the other neurons with which they are connected. In general, the dropout helps the network to better generalize and increase the performance of the network. Adam optimizer is initialized with a learning rate of 0.0001 and momentum of 0.9 has been used for the optimizer [5]. The learning rate lowered by half if the training loss increases which makes the network converge faster and prevents overfitting. The initial weights of the network have been initialized with Xavier weight initialization method. The model is run for 20 epochs in general and early stop is used to stop the training if the training loss increases. The best model is saved every time better accuracy is obtained on the development dataset.

Table 1. Results

5 Results and Analysis

To evaluate the model performance, equal error rate (EER) [13] is used. EER is used as an evaluation measure in popular biometric systems it helps in equalizing false acceptance and rejections, the lower the EER the more accurate the system. For finding the EER of the development subset, the system is trained first on the training dataset and the posterior probabilities whether the audio is genuine or replay are obtained. The scores are generated by taking the difference between the two posterior probabilities. The BOASRIS Toolkit is used to estimate the EER from the scores [1]. Similarly, for finding the EER for the evaluation subset, the model is trained on both the training and development subset because it is seen that in the training set there are less environment, recording and playback device configurations.

In the evaluation dataset, there are many unknown environments, different playback, and recording devices and hence difficult to predict whether the audio is a replay or not. Hence the model is trained on both training and development subsets so that it can learn about more environments, recording and playback devices. Since the size of feature vector is small due to the removal of voiced speech segments, it is easier to train the network. In the experiments a simple three layer DNN [3] model was also considered. The model had 512 neurons in each layer and a softmax layer at the end. The model was trained on MFCC [10] features. The performance of the model was not good on the evaluation dataset mainly because of the problem of overfitting. The results are shown in Table 1, the proposed CNN model achieves an EER of 5.62% on development dataset and 12.47% on evaluation dataset.

6 Conclusion and Future Work

In this paper, we studied the applicability of the convolutional neural network (CNN) using features extracted from silence and unvoiced segments of speech signal for audio replay attack detection. The silent and unvoiced regions contain information about the channel and also reverberation of audio signal caused by the environment. The spectrograms of both genuine audio and replay audio are nearly identical because of the similar voiced regions which cause difficulty for CNN to learn the necessary features to discriminate genuine and replay audio. The removal of voiced speech segments also reduces the feature size which makes it easier to train CNN. The proposed approach is evaluated on ASVspoof 2017 dataset and it can be seen from the results that it outperforms the baseline system by 5.21% in development dataset and by 16.18% in the evaluation dataset. In the future, different features will be experimented with more sophisticated and newer deep learning architectures.