Skip to main content

GhostVec: Directly Extracting Speaker Embedding fromĀ End-to-End Speech Recognition Model Using Adversarial Examples

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1793))

Included in the following conference series:

  • 834 Accesses

Abstract

Obtaining excellent speaker embedding representations can leverage the performance of a series of tasks, such as speaker/speech recognition, multi-speaker dialogue, and translation systems. The automatic speech recognition (ASR) system is trained with massive speech data and contains many speaker information. There are no existing attempts to protect the speaker embedding space of ASR from adversarial attacks. This paper proposes GhostVec, a novel method to export the speaker space from the ASR system without any external speaker verification system or real human voice as reference. More specifically, we extract speaker embedding from a transformer-based ASR system. Two kinds of targeted adversarial embedding (GhostVec) are proposed from features-level and embedding-level, respectively. The similarities are evaluated between GhostVecs and corresponding speakers randomly selected from Librispeech. Experiment results show that the proposed methods have superior performance in generating a similar embedding of the target speaker. We hope the preliminary discovery in this study to catalyze future downstream research speaker recognition-related topics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Campbell, W., et al.: SVM based speaker verification using a GMM supervector Kernel and NAP variability compensation. In: Proceedings IEEE-ICASSP (2006)

    Google ScholarĀ 

  2. Carlini, N., Wagner, D.: Audio adversarial examples: targeted attacks on speech-to-text. in Proceedings IEEE Security and Privacy Workshops (SPW), pp. 1ā€“7 (2018)

    Google ScholarĀ 

  3. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4960ā€“4964. IEEE (2016)

    Google ScholarĀ 

  4. Chen, G., et al.: Who is real bob? adversarial attacks on speaker recognition systems. arXiv preprint arXiv:1911.01840 (2019)

  5. Cooper, E., Lai, C.I., Yasuda, Y., Fang, F., Wang, X., Chen, N., Yamagishi, J.: Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In: Proc. IEEE-ICASSP, pp. 6184ā€“6188 (2020)

    Google ScholarĀ 

  6. Dalmia, S., Liu, Y., Ronanki, S., Kirchhoff, K.: Transformer-transducers for code-switched speech recognition. In: ICASSP 2021ā€“2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5859ā€“5863. IEEE (2021)

    Google ScholarĀ 

  7. Dehak, N., et al.: Front-end factor analysis for speaker verification. IEEE Trans. ASLP 19, 788ā€“798 (2011)

    Google ScholarĀ 

  8. Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884ā€“5888 (2018). https://doi.org/10.1109/ICASSP.2018.8462506

  9. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)

  10. Jati, A., Hsu, C.C., Pal, M., Peri, R., AbdAlmageed, W., Narayanan, S.: Adversarial attack and defense strategies for deep speaker recognition systems. Comput. Speech Lang. 68, 101199 (2021)

    ArticleĀ  Google ScholarĀ 

  11. Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: Advances in Neural Information Processing Systems, pp. 4480ā€“4490 (2018)

    Google ScholarĀ 

  12. Kenny, P., et al.: A study of inter-speaker variability in speaker verification. IEEE Trans. ASLP 16(5), 980ā€“988 (2008)

    Google ScholarĀ 

  13. Kreuk, F., Adi, Y., Cisse, M., Keshet, J.: Fooling end-to-end speaker verification with adversarial examples. In: Proceedings IEEE-ICASSP, pp. 1962ā€“1966 (2018)

    Google ScholarĀ 

  14. Li, C.Y., Yuan, P.C., Lee, H.Y.: What does a network layer hear? Analyzing hidden representations of end-to-end ASR through speech synthesis. In: Proceedings IEEE-ICASSP, pp. 6434ā€“6438 (2020)

    Google ScholarĀ 

  15. Li, S., Dabre, R., Lu, X., Shen, P., Kawahara, T., Kawai, H.: Improving transformer-based speech recognition systems with compressed structure and speech attributes augmentation. In: Proceedings INTERSPEECH (2019)

    Google ScholarĀ 

  16. Li, S., lu, X., Dabre, R., Shen, P., Kawai, H.: Joint training end-to-end speech recognition systems with speaker attributes, pp. 385ā€“390 (2020). https://doi.org/10.21437/Odyssey

  17. Li, X., Zhong, J., Wu, X., Yu, J., Liu, X., Meng, H.: Adversarial attacks on GMM i-vector based speaker verification systems. In: Proceedings IEEE-ICASSP, pp. 6579ā€“6583 (2020)

    Google ScholarĀ 

  18. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)

  19. McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)

  20. Miyato, T., Maeda, S., Koyama, M., Nakae, K., Ishii, S.: Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677 (2015)

  21. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31ā€“88 (2001)

    ArticleĀ  Google ScholarĀ 

  22. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206ā€“5210 (2015). https://doi.org/10.1109/ICASSP.2015.7178964

  23. Sarı, L., Moritz, N., Hori, T., Le Roux, J.: Unsupervised speaker adaptation using attention-based speaker memory for end-to-end ASR. In: ICASSP 2020ā€“2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7384ā€“7388. IEEE (2020)

    Google ScholarĀ 

  24. Snyder, D., et al.: X-vectors: Robust DNN embeddings for speaker recognition. In: Proceedings IEEE-ICASSP, pp. 5329ā€“5333 (2018)

    Google ScholarĀ 

  25. Szegedy, C., et al.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)

  26. Variani, E., et al.: Deep neural networks for small footprint text-dependent speaker verification, pp. 4052ā€“4056 (2014)

    Google ScholarĀ 

  27. Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017)

    Google ScholarĀ 

  28. Wang, Q., Guo, P., Sun, S., Xie, L., Hansen, J.: Adversarial regularization for end-to-end robust speaker verification. In: Proceedings INTERSPEECH, pp. 4010ā€“4014 (2019)

    Google ScholarĀ 

  29. Wang, Q., Guo, P., Xie, L.: Inaudible adversarial perturbations for targeted attack in speaker recognition. arXiv preprint arXiv:2005.10637 (2020)

  30. Yuan, X., et al.: CommanderSong: a systematic approach for practical adversarial voice recognition. In: Proceedings 27th \(\{\)USENIX\(\}\) Security Symposium (\(\{\)USENIX\(\}\) Security 18), pp. 49ā€“64 (2018)

    Google ScholarĀ 

  31. Zeyer, A., Bahar, P., Irie, K., SchlĆ¼ter, R., Ney, H.: A comparison of transformer and LSTM encoder decoder models for ASR. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 8ā€“15. IEEE (2019)

    Google ScholarĀ 

  32. Zhang, Q., et al.: Transformer transducer: a streamable speech recognition model with transformer encoders and RNN-T loss. In: ICASSP 2020ā€“2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829ā€“7833. IEEE (2020)

    Google ScholarĀ 

  33. Zhao, Y., Ni, C., Leung, C.C., Joty, S.R., Chng, E.S., Ma, B.: Speech transformer with speaker aware persistent memory. In: INTERSPEECH. pp. 1261ā€“1265 (2020)

    Google ScholarĀ 

  34. Zong, W., Chow, Y.-W., Susilo, W., Rana, S., Venkatesh, S.: Targeted universal adversarial perturbations forĀ automatic speech recognition. In: Liu, J.K., Katsikas, S., Meng, W., Susilo, W., Intan, R. (eds.) ISC 2021. LNCS, vol. 13118, pp. 358ā€“373. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91356-4_19

    ChapterĀ  Google ScholarĀ 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hao Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, X., Li, S., Huang, H. (2023). GhostVec: Directly Extracting Speaker Embedding fromĀ End-to-End Speech Recognition Model Using Adversarial Examples. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1793. Springer, Singapore. https://doi.org/10.1007/978-981-99-1645-0_40

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-1645-0_40

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-1644-3

  • Online ISBN: 978-981-99-1645-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics