GRU-SVM Model for Synthetic Speech Detection

Huang, Ting; Wang, Hongxia; Chen, Yi; He, Peisong

doi:10.1007/978-3-030-43575-2_9

Ting Huang¹³,
Hongxia Wang¹⁴,
Yi Chen¹³ &
…
Peisong He¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12022))

Included in the following conference series:

International Workshop on Digital Watermarking

1358 Accesses

Abstract

Voice conversion and speech synthesis techniques present a threat to current automatic speaker verification systems. Therefore, to prevent such spoofing attack, choosing an appropriate classifier for learning relevant information from speech feature is an important issue. In this paper, a GRU-SVM model for synthetic speech detection is proposed. The Gate Recurrent Unit (GRU) neural network is considered to learn the feature. The GRU can overcome the problems of gradients vanishing and explosion in traditional Recurrent Neural Networks (RNN) when learning the temporal dependencies. The Support Vector Machines (SVM) plays a role in regression before softmax layer for classification. An excellent performance after the SVM regression has shown in the case of classification ability and data gradient descent. We also obtain the optimal speech feature extraction method and apply it to the classifier for training by a large amount of verification and analysis. Experimental results show that the proposed GRU-SVM models gain higher prediction accuracy on data sets, and an average detection rate of 99.63% has been achieved in our development database. In addition, the proposed method can improve the learning ability of the model effectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Langford, J., Guzdial, M.: The arbitrariness of reviews, and advice for school administrators. Commun. ACM 58(4), 12–13 (2016)
Article Google Scholar
Campbell, J.P.: Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997)
Article Google Scholar
Kain, A., Macon, M.W.: Spectral voice conversion for text-to-speech synthesis. In: IEEE International Conference on Acoustics, pp. 285–288 (1998)
Google Scholar
Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network for speech synthesis (2018)
Google Scholar
Moulines, E., Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Commun. 9(5–6), 453–467 (1990)
Article Google Scholar
Reynolds, D.A.: Speaker identification and verification using Gaussian mixture speaker models. Speech Commun. 17(1–2), 91–108 (1995)
Article Google Scholar
Zhao, X., Wang, D.L.: Analyzing noise robustness of MFCC and GFCC features in speaker identification. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7204–7208. IEEE (2013)
Google Scholar
Yuan, Y., Zhao, P., Zhou, Q.: Research of speaker recognition based on combination of LPCC and MFCC. In: IEEE International Conference on Intelligent Computing & Intelligent Systems, pp. 765–767. IEEE (2010)
Google Scholar
Wang, J.-C., Wang, C.-Y., Chin, Y.-H., Liu, Y.-T., Chen, E.-T., Chang, P.-C.: Spectral-temporal receptive fields and MFCC balanced feature extraction for robust speaker recognition. Multimed. Tools Appl. 76(3), 4055–4068 (2016). https://doi.org/10.1007/s11042-016-3335-0
Article Google Scholar
Ahmad, K.S., Thosar, A.S., Nirmal, J.H.: A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. In: Eighth International Conference on Advances in Pattern Recognition, pp. 1–6 (2015)
Google Scholar
Jagtap, S.S., Bhalke, D.G.: Speaker verification using Gaussian mixture model. In: International Conference on Pervasive Computing, pp. 1–5 (2015)
Google Scholar
Shahamiri, S.R., Salim, S.S.B.: Artificial neural networks as speech recognisers for dysarthric speech: identifying the best-performing set of MFCC parameters and studying a speaker-independent approach. Adv. Eng. Inform. 28(1), 102–110 (2014)
Article Google Scholar
LeCun, Y.: Generalization and network design strategies. Ph.D. thesis, University of Toronto (1989)
Google Scholar
Lipton, Z.C.: A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019 (2015)
Mou, L., Ghamisi, P., Zhu, X.X.: Deep recurrent neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 55(7), 3639–3655 (2017)
Article Google Scholar
Kawakami, K.: Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technical University of Munich (2008)
Google Scholar
Tang, Y.: Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239 (2013)
Hu, H., Xu, M.X., Wu, W.: GMM supervector based SVM with spectral features for speech emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 413–416 (2007)
Google Scholar
The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset
Hanhart, P., Ebrahimi, T.: Calculation of average coding efficiency based on subjective quality scores. J. Vis. Commun. Image Represent. 25(3), 555–564 (2014)
Article Google Scholar
Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. In: 2016 24th European Signal Processing Conference (EUSIPCO), pp. 1128–1132. IEEE (2016)
Google Scholar

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (NSFC) under Grants 61972269 and 61902263, the Fundamental Research Funds for the Central Universities under the grant No. YJ201881, and Doctoral Innovation Fund Program of Southwest Jiaotong University under the grant No. DCX201824.

Author information

Authors and Affiliations

School of Information Science and Technology, Southwest Jiaotong University, Chengdu, 611756, People’s Republic of China
Ting Huang & Yi Chen
College of Cybersecurity, Sichuan University, Chengdu, 610065, People’s Republic of China
Hongxia Wang & Peisong He

Authors

Ting Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hongxia Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Peisong He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongxia Wang .

Editor information

Editors and Affiliations

College of Cybersecurity, Sichuan University, Chengdu, China
Hongxia Wang
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Xianfeng Zhao
Department of ECE, New Jersey Institute of Technology, Newark, NJ, USA
Yunqing Shi
Graduate School of Information Study, Korea University, Seoul, Korea (Republic of)
Hyoung Joong Kim
Department of Information Engineering, University of Florence, Florence, Italy
Alessandro Piva

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, T., Wang, H., Chen, Y., He, P. (2020). GRU-SVM Model for Synthetic Speech Detection. In: Wang, H., Zhao, X., Shi, Y., Kim, H., Piva, A. (eds) Digital Forensics and Watermarking. IWDW 2019. Lecture Notes in Computer Science(), vol 12022. Springer, Cham. https://doi.org/10.1007/978-3-030-43575-2_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-43575-2_9
Published: 25 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-43574-5
Online ISBN: 978-3-030-43575-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics