GhostVec: Directly Extracting Speaker Embedding from End-to-End Speech Recognition Model Using Adversarial Examples

Chen, Xiaojiao; Li, Sheng; Huang, Hao

doi:10.1007/978-981-99-1645-0_40

Xiaojiao Chen¹⁰,
Sheng Li¹¹ &
Hao Huang¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1793))

Included in the following conference series:

International Conference on Neural Information Processing

834 Accesses

Abstract

Obtaining excellent speaker embedding representations can leverage the performance of a series of tasks, such as speaker/speech recognition, multi-speaker dialogue, and translation systems. The automatic speech recognition (ASR) system is trained with massive speech data and contains many speaker information. There are no existing attempts to protect the speaker embedding space of ASR from adversarial attacks. This paper proposes GhostVec, a novel method to export the speaker space from the ASR system without any external speaker verification system or real human voice as reference. More specifically, we extract speaker embedding from a transformer-based ASR system. Two kinds of targeted adversarial embedding (GhostVec) are proposed from features-level and embedding-level, respectively. The similarities are evaluated between GhostVecs and corresponding speakers randomly selected from Librispeech. Experiment results show that the proposed methods have superior performance in generating a similar embedding of the target speaker. We hope the preliminary discovery in this study to catalyze future downstream research speaker recognition-related topics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Campbell, W., et al.: SVM based speaker verification using a GMM supervector Kernel and NAP variability compensation. In: Proceedings IEEE-ICASSP (2006)
Google Scholar
Carlini, N., Wagner, D.: Audio adversarial examples: targeted attacks on speech-to-text. in Proceedings IEEE Security and Privacy Workshops (SPW), pp. 1–7 (2018)
Google Scholar
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4960–4964. IEEE (2016)
Google Scholar
Chen, G., et al.: Who is real bob? adversarial attacks on speaker recognition systems. arXiv preprint arXiv:1911.01840 (2019)
Cooper, E., Lai, C.I., Yasuda, Y., Fang, F., Wang, X., Chen, N., Yamagishi, J.: Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In: Proc. IEEE-ICASSP, pp. 6184–6188 (2020)
Google Scholar
Dalmia, S., Liu, Y., Ronanki, S., Kirchhoff, K.: Transformer-transducers for code-switched speech recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5859–5863. IEEE (2021)
Google Scholar
Dehak, N., et al.: Front-end factor analysis for speaker verification. IEEE Trans. ASLP 19, 788–798 (2011)
Google Scholar
Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888 (2018). https://doi.org/10.1109/ICASSP.2018.8462506
Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)
Jati, A., Hsu, C.C., Pal, M., Peri, R., AbdAlmageed, W., Narayanan, S.: Adversarial attack and defense strategies for deep speaker recognition systems. Comput. Speech Lang. 68, 101199 (2021)
Article Google Scholar
Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: Advances in Neural Information Processing Systems, pp. 4480–4490 (2018)
Google Scholar
Kenny, P., et al.: A study of inter-speaker variability in speaker verification. IEEE Trans. ASLP 16(5), 980–988 (2008)
Google Scholar
Kreuk, F., Adi, Y., Cisse, M., Keshet, J.: Fooling end-to-end speaker verification with adversarial examples. In: Proceedings IEEE-ICASSP, pp. 1962–1966 (2018)
Google Scholar
Li, C.Y., Yuan, P.C., Lee, H.Y.: What does a network layer hear? Analyzing hidden representations of end-to-end ASR through speech synthesis. In: Proceedings IEEE-ICASSP, pp. 6434–6438 (2020)
Google Scholar
Li, S., Dabre, R., Lu, X., Shen, P., Kawahara, T., Kawai, H.: Improving transformer-based speech recognition systems with compressed structure and speech attributes augmentation. In: Proceedings INTERSPEECH (2019)
Google Scholar
Li, S., lu, X., Dabre, R., Shen, P., Kawai, H.: Joint training end-to-end speech recognition systems with speaker attributes, pp. 385–390 (2020). https://doi.org/10.21437/Odyssey
Li, X., Zhong, J., Wu, X., Yu, J., Liu, X., Meng, H.: Adversarial attacks on GMM i-vector based speaker verification systems. In: Proceedings IEEE-ICASSP, pp. 6579–6583 (2020)
Google Scholar
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)
McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)
Miyato, T., Maeda, S., Koyama, M., Nakae, K., Ishii, S.: Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677 (2015)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31–88 (2001)
Article Google Scholar
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015). https://doi.org/10.1109/ICASSP.2015.7178964
Sarı, L., Moritz, N., Hori, T., Le Roux, J.: Unsupervised speaker adaptation using attention-based speaker memory for end-to-end ASR. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7384–7388. IEEE (2020)
Google Scholar
Snyder, D., et al.: X-vectors: Robust DNN embeddings for speaker recognition. In: Proceedings IEEE-ICASSP, pp. 5329–5333 (2018)
Google Scholar
Szegedy, C., et al.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)
Variani, E., et al.: Deep neural networks for small footprint text-dependent speaker verification, pp. 4052–4056 (2014)
Google Scholar
Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017)
Google Scholar
Wang, Q., Guo, P., Sun, S., Xie, L., Hansen, J.: Adversarial regularization for end-to-end robust speaker verification. In: Proceedings INTERSPEECH, pp. 4010–4014 (2019)
Google Scholar
Wang, Q., Guo, P., Xie, L.: Inaudible adversarial perturbations for targeted attack in speaker recognition. arXiv preprint arXiv:2005.10637 (2020)
Yuan, X., et al.: CommanderSong: a systematic approach for practical adversarial voice recognition. In: Proceedings 27th \(\{\)USENIX\(\}\) Security Symposium (\(\{\)USENIX\(\}\) Security 18), pp. 49–64 (2018)
Google Scholar
Zeyer, A., Bahar, P., Irie, K., Schlüter, R., Ney, H.: A comparison of transformer and LSTM encoder decoder models for ASR. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 8–15. IEEE (2019)
Google Scholar
Zhang, Q., et al.: Transformer transducer: a streamable speech recognition model with transformer encoders and RNN-T loss. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829–7833. IEEE (2020)
Google Scholar
Zhao, Y., Ni, C., Leung, C.C., Joty, S.R., Chng, E.S., Ma, B.: Speech transformer with speaker aware persistent memory. In: INTERSPEECH. pp. 1261–1265 (2020)
Google Scholar
Zong, W., Chow, Y.-W., Susilo, W., Rana, S., Venkatesh, S.: Targeted universal adversarial perturbations for automatic speech recognition. In: Liu, J.K., Katsikas, S., Meng, W., Susilo, W., Intan, R. (eds.) ISC 2021. LNCS, vol. 13118, pp. 358–373. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91356-4_19
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Xinjiang University, Urumqi, China
Xiaojiao Chen & Hao Huang
National Institute of Information and Communications Technology, Kyoto, Japan
Sheng Li

Authors

Xiaojiao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Hao Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hao Huang .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, X., Li, S., Huang, H. (2023). GhostVec: Directly Extracting Speaker Embedding from End-to-End Speech Recognition Model Using Adversarial Examples. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1793. Springer, Singapore. https://doi.org/10.1007/978-981-99-1645-0_40

Download citation

DOI: https://doi.org/10.1007/978-981-99-1645-0_40
Published: 14 April 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1644-3
Online ISBN: 978-981-99-1645-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

GhostVec: Directly Extracting Speaker Embedding from End-to-End Speech Recognition Model Using Adversarial Examples