Abstract
Voice conversion and speech synthesis techniques present a threat to current automatic speaker verification systems. Therefore, to prevent such spoofing attack, choosing an appropriate classifier for learning relevant information from speech feature is an important issue. In this paper, a GRU-SVM model for synthetic speech detection is proposed. The Gate Recurrent Unit (GRU) neural network is considered to learn the feature. The GRU can overcome the problems of gradients vanishing and explosion in traditional Recurrent Neural Networks (RNN) when learning the temporal dependencies. The Support Vector Machines (SVM) plays a role in regression before softmax layer for classification. An excellent performance after the SVM regression has shown in the case of classification ability and data gradient descent. We also obtain the optimal speech feature extraction method and apply it to the classifier for training by a large amount of verification and analysis. Experimental results show that the proposed GRU-SVM models gain higher prediction accuracy on data sets, and an average detection rate of 99.63% has been achieved in our development database. In addition, the proposed method can improve the learning ability of the model effectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Langford, J., Guzdial, M.: The arbitrariness of reviews, and advice for school administrators. Commun. ACM 58(4), 12–13 (2016)
Campbell, J.P.: Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997)
Kain, A., Macon, M.W.: Spectral voice conversion for text-to-speech synthesis. In: IEEE International Conference on Acoustics, pp. 285–288 (1998)
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis (2018)
Moulines, E., Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9(5–6), 453–467 (1990)
Reynolds, D.A.: Speaker identification and verification using Gaussian mixture speaker models. Speech Commun. 17(1–2), 91–108 (1995)
Zhao, X., Wang, D.L.: Analyzing noise robustness of MFCC and GFCC features in speaker identification. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7204–7208. IEEE (2013)
Yuan, Y., Zhao, P., Zhou, Q.: Research of speaker recognition based on combination of LPCC and MFCC. In: IEEE International Conference on Intelligent Computing & Intelligent Systems, pp. 765–767. IEEE (2010)
Wang, J.-C., Wang, C.-Y., Chin, Y.-H., Liu, Y.-T., Chen, E.-T., Chang, P.-C.: Spectral-temporal receptive fields and MFCC balanced feature extraction for robust speaker recognition. Multimed. Tools Appl. 76(3), 4055–4068 (2016). https://doi.org/10.1007/s11042-016-3335-0
Ahmad, K.S., Thosar, A.S., Nirmal, J.H.: A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. In: Eighth International Conference on Advances in Pattern Recognition, pp. 1–6 (2015)
Jagtap, S.S., Bhalke, D.G.: Speaker verification using Gaussian mixture model. In: International Conference on Pervasive Computing, pp. 1–5 (2015)
Shahamiri, S.R., Salim, S.S.B.: Artificial neural networks as speech recognisers for dysarthric speech: identifying the best-performing set of MFCC parameters and studying a speaker-independent approach. Adv. Eng. Inform. 28(1), 102–110 (2014)
LeCun, Y.: Generalization and network design strategies. Ph.D. thesis, University of Toronto (1989)
Lipton, Z.C.: A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019 (2015)
Mou, L., Ghamisi, P., Zhu, X.X.: Deep recurrent neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 55(7), 3639–3655 (2017)
Kawakami, K.: Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technical University of Munich (2008)
Tang, Y.: Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013)
Hu, H., Xu, M.X., Wu, W.: GMM supervector based SVM with spectral features for speech emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 413–416 (2007)
The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset
Hanhart, P., Ebrahimi, T.: Calculation of average coding efficiency based on subjective quality scores. J. Vis. Commun. Image Represent. 25(3), 555–564 (2014)
Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. In: 2016 24th European Signal Processing Conference (EUSIPCO), pp. 1128–1132. IEEE (2016)
Acknowledgement
This work is supported by the National Natural Science Foundation of China (NSFC) under Grants 61972269 and 61902263, the Fundamental Research Funds for the Central Universities under the grant No. YJ201881, and Doctoral Innovation Fund Program of Southwest Jiaotong University under the grant No. DCX201824.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Huang, T., Wang, H., Chen, Y., He, P. (2020). GRU-SVM Model for Synthetic Speech Detection. In: Wang, H., Zhao, X., Shi, Y., Kim, H., Piva, A. (eds) Digital Forensics and Watermarking. IWDW 2019. Lecture Notes in Computer Science(), vol 12022. Springer, Cham. https://doi.org/10.1007/978-3-030-43575-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-43575-2_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43574-5
Online ISBN: 978-3-030-43575-2
eBook Packages: Computer ScienceComputer Science (R0)