Abstract
The state-of-the-art models for speech synthesis and voice conversion have caused a great threat to automatic speech verification (ASV) system. In fact, it is difficult for human beings to perceive the subtle difference between the bonafide speech and spoofed speech from these models. The ASVspoof 2019 challenge, jointly launched by several world-leading research institutions, is the largest and most comprehensive challenge for spoofed speech identification. In this work, a countermeasure system for ASVspoof 2019 is proposed based on cepstrum features and deep capsule network. MFCC and CQCC features are extracted as the input of the proposed network. The convolutional layer and routing strategy of the capsule network are specifically designed to distinguish bonafide speech from spoofed ones. The experimental results on ASVspoof 2019 LA evaluation set show that the proposed deep capsule network can improve the baseline algorithms t-DCF and EER scores by 31% and 37%, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Toda, T., et al.: The voice conversion challenge. In: Interspeech, 1632–1636 (2016)
Huang, W.C., Lo, C.C., Hwang, H.T., Tsao, Y., Wang, H.M.: Wavenet vocoder and its applications in voice conversion. In: The 30th ROCLING Conference on Computational Linguistics and Speech Processing (ROCLING) (2018)
Oord, A.V.D., et al.: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: 9th ISCA Speech Synthesis Workshop, pp. 202–207 (2016)
Juvela, L., et al.: Speech waveform synthesis from MFCC sequences with generative adversarial networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5679–5683 (2018)
Kinnunen, T., et al.: t-DCF: a detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. arXiv preprint arXiv:1804.09618 (2018)
Sahidullah, M., Kinnunen, T., Hanilci, C.: A comparison of features for synthetic speech detection. In: The International Speech Communication Association (ISCA) (2015)
Patel, T.B., Patil, H.A.: Combining evidences from Mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Todisco, M., Delgado, H., Evans, N: A new feature for automatic speaker verification anti-spoofing: constant Q cepstral coefficients. Odyssey: The Speaker and Language Recognition Workshop, pp. 283–290 (2016)
Alzantot, M., Wang, Z., Srivastava, M.S.: Deep residual neural networks for audio spoofing detection: deep residual neural networks for audio spoofing detection. In: Interspeech, pp. 1078–1082 (2019)
Lai, C.I., Chen, N., Villalba, J., et al.: ASSERT: anti-spoofing with squeeze-excitation and residual networks: deep residual neural networks for audio spoofing detection. In: Interspeech, pp. 1013–1017 (2019)
Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules: advances. In: Neural Information Processing Systems, pp. 3856–3886 (2017)
Tiwari, V.: MFCC and its applications in speaker recognition: Int. J. Emerg. Technol. 1, 19–22 (2010)
Todisco, M., Delgado, H., Evans, N.: Constant Q cepstral coefficients: a spoofing counter-measure for automatic speaker verification. Comput. Speech Lang. 45, 516–535 (2017)
Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models: IEEE Trans. Speech Audio Process. 3, 72–83 (1995)
Reynolds, D.A., Quatieri, T.F., Dunn, R.S.: Speaker verification using adapted Gaussian mixture models. Digit. Sig. Process. 10, 19–41 (2000)
Jaiswal, A., AbdAlmageed, W., Wu, Y., Natarajan, P.: CapsuleGAN: generative adversarial capsule network. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Mao, T., Yan, D., Gong, Y., Wang, R. (2022). Identification of Synthetic Spoofed Speech with Deep Capsule Network. In: Cao, C., Zhang, Y., Hong, Y., Wang, D. (eds) Frontiers in Cyber Security. FCS 2021. Communications in Computer and Information Science, vol 1558. Springer, Singapore. https://doi.org/10.1007/978-981-19-0523-0_17
Download citation
DOI: https://doi.org/10.1007/978-981-19-0523-0_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-0522-3
Online ISBN: 978-981-19-0523-0
eBook Packages: Computer ScienceComputer Science (R0)