Identification of Synthetic Spoofed Speech with Deep Capsule Network

Mao, Terui; Yan, Diqun; Gong, Yongkang; Wang, Randing

doi:10.1007/978-981-19-0523-0_17

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1558))

Included in the following conference series:

International Conference on Frontiers in Cyber Security

604 Accesses

Abstract

The state-of-the-art models for speech synthesis and voice conversion have caused a great threat to automatic speech verification (ASV) system. In fact, it is difficult for human beings to perceive the subtle difference between the bonafide speech and spoofed speech from these models. The ASVspoof 2019 challenge, jointly launched by several world-leading research institutions, is the largest and most comprehensive challenge for spoofed speech identification. In this work, a countermeasure system for ASVspoof 2019 is proposed based on cepstrum features and deep capsule network. MFCC and CQCC features are extracted as the input of the proposed network. The convolutional layer and routing strategy of the capsule network are specifically designed to distinguish bonafide speech from spoofed ones. The experimental results on ASVspoof 2019 LA evaluation set show that the proposed deep capsule network can improve the baseline algorithms t-DCF and EER scores by 31% and 37%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Toda, T., et al.: The voice conversion challenge. In: Interspeech, 1632–1636 (2016)
Google Scholar
Huang, W.C., Lo, C.C., Hwang, H.T., Tsao, Y., Wang, H.M.: Wavenet vocoder and its applications in voice conversion. In: The 30th ROCLING Conference on Computational Linguistics and Speech Processing (ROCLING) (2018)
Google Scholar
Oord, A.V.D., et al.: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: 9th ISCA Speech Synthesis Workshop, pp. 202–207 (2016)
Google Scholar
Juvela, L., et al.: Speech waveform synthesis from MFCC sequences with generative adversarial networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5679–5683 (2018)
Google Scholar
Kinnunen, T., et al.: t-DCF: a detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. arXiv preprint arXiv:1804.09618 (2018)
Sahidullah, M., Kinnunen, T., Hanilci, C.: A comparison of features for synthetic speech detection. In: The International Speech Communication Association (ISCA) (2015)
Google Scholar
Patel, T.B., Patil, H.A.: Combining evidences from Mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Todisco, M., Delgado, H., Evans, N: A new feature for automatic speaker verification anti-spoofing: constant Q cepstral coefficients. Odyssey: The Speaker and Language Recognition Workshop, pp. 283–290 (2016)
Google Scholar
Alzantot, M., Wang, Z., Srivastava, M.S.: Deep residual neural networks for audio spoofing detection: deep residual neural networks for audio spoofing detection. In: Interspeech, pp. 1078–1082 (2019)
Google Scholar
Lai, C.I., Chen, N., Villalba, J., et al.: ASSERT: anti-spoofing with squeeze-excitation and residual networks: deep residual neural networks for audio spoofing detection. In: Interspeech, pp. 1013–1017 (2019)
Google Scholar
Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules: advances. In: Neural Information Processing Systems, pp. 3856–3886 (2017)
Google Scholar
Tiwari, V.: MFCC and its applications in speaker recognition: Int. J. Emerg. Technol. 1, 19–22 (2010)
Google Scholar
Todisco, M., Delgado, H., Evans, N.: Constant Q cepstral coefficients: a spoofing counter-measure for automatic speaker verification. Comput. Speech Lang. 45, 516–535 (2017)
Google Scholar
Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models: IEEE Trans. Speech Audio Process. 3, 72–83 (1995)
Google Scholar
Reynolds, D.A., Quatieri, T.F., Dunn, R.S.: Speaker verification using adapted Gaussian mixture models. Digit. Sig. Process. 10, 19–41 (2000)
Google Scholar
Jaiswal, A., AbdAlmageed, W., Wu, Y., Natarajan, P.: CapsuleGAN: generative adversarial capsule network. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Ningbo University, Zhejiang, 315000, China
Terui Mao, Diqun Yan, Yongkang Gong & Randing Wang

Authors

Terui Mao
View author publications
You can also search for this author in PubMed Google Scholar
Diqun Yan
View author publications
You can also search for this author in PubMed Google Scholar
Yongkang Gong
View author publications
You can also search for this author in PubMed Google Scholar
Randing Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diqun Yan .

Editor information

Editors and Affiliations

Hainan University, Haikou, China
Chunjie Cao
National Computer Network Intrusion Protection Center (NCNIPC), Beijing, China
Yuqing Zhang
Illinois Institute of Technology, Chicago, IL, USA
Yuan Hong
Nankai University, Tianjin, China
Ding Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mao, T., Yan, D., Gong, Y., Wang, R. (2022). Identification of Synthetic Spoofed Speech with Deep Capsule Network. In: Cao, C., Zhang, Y., Hong, Y., Wang, D. (eds) Frontiers in Cyber Security. FCS 2021. Communications in Computer and Information Science, vol 1558. Springer, Singapore. https://doi.org/10.1007/978-981-19-0523-0_17

Download citation

DOI: https://doi.org/10.1007/978-981-19-0523-0_17
Published: 01 March 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-0522-3
Online ISBN: 978-981-19-0523-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Identification of Synthetic Spoofed Speech with Deep Capsule Network