Abstract
Obtaining excellent speaker embedding representations can leverage the performance of a series of tasks, such as speaker/speech recognition, multi-speaker dialogue, and translation systems. The automatic speech recognition (ASR) system is trained with massive speech data and contains many speaker information. There are no existing attempts to protect the speaker embedding space of ASR from adversarial attacks. This paper proposes GhostVec, a novel method to export the speaker space from the ASR system without any external speaker verification system or real human voice as reference. More specifically, we extract speaker embedding from a transformer-based ASR system. Two kinds of targeted adversarial embedding (GhostVec) are proposed from features-level and embedding-level, respectively. The similarities are evaluated between GhostVecs and corresponding speakers randomly selected from Librispeech. Experiment results show that the proposed methods have superior performance in generating a similar embedding of the target speaker. We hope the preliminary discovery in this study to catalyze future downstream research speaker recognition-related topics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Campbell, W., et al.: SVM based speaker verification using a GMM supervector Kernel and NAP variability compensation. In: Proceedings IEEE-ICASSP (2006)
Carlini, N., Wagner, D.: Audio adversarial examples: targeted attacks on speech-to-text. in Proceedings IEEE Security and Privacy Workshops (SPW), pp. 1ā7 (2018)
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4960ā4964. IEEE (2016)
Chen, G., et al.: Who is real bob? adversarial attacks on speaker recognition systems. arXiv preprint arXiv:1911.01840 (2019)
Cooper, E., Lai, C.I., Yasuda, Y., Fang, F., Wang, X., Chen, N., Yamagishi, J.: Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In: Proc. IEEE-ICASSP, pp. 6184ā6188 (2020)
Dalmia, S., Liu, Y., Ronanki, S., Kirchhoff, K.: Transformer-transducers for code-switched speech recognition. In: ICASSP 2021ā2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5859ā5863. IEEE (2021)
Dehak, N., et al.: Front-end factor analysis for speaker verification. IEEE Trans. ASLP 19, 788ā798 (2011)
Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884ā5888 (2018). https://doi.org/10.1109/ICASSP.2018.8462506
Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014)
Jati, A., Hsu, C.C., Pal, M., Peri, R., AbdAlmageed, W., Narayanan, S.: Adversarial attack and defense strategies for deep speaker recognition systems. Comput. Speech Lang. 68, 101199 (2021)
Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In: Advances in Neural Information Processing Systems, pp. 4480ā4490 (2018)
Kenny, P., et al.: A study of inter-speaker variability in speaker verification. IEEE Trans. ASLP 16(5), 980ā988 (2008)
Kreuk, F., Adi, Y., Cisse, M., Keshet, J.: Fooling end-to-end speaker verification with adversarial examples. In: Proceedings IEEE-ICASSP, pp. 1962ā1966 (2018)
Li, C.Y., Yuan, P.C., Lee, H.Y.: What does a network layer hear? Analyzing hidden representations of end-to-end ASR through speech synthesis. In: Proceedings IEEE-ICASSP, pp. 6434ā6438 (2020)
Li, S., Dabre, R., Lu, X., Shen, P., Kawahara, T., Kawai, H.: Improving transformer-based speech recognition systems with compressed structure and speech attributes augmentation. In: Proceedings INTERSPEECH (2019)
Li, S., lu, X., Dabre, R., Shen, P., Kawai, H.: Joint training end-to-end speech recognition systems with speaker attributes, pp. 385ā390 (2020). https://doi.org/10.21437/Odyssey
Li, X., Zhong, J., Wu, X., Yu, J., Liu, X., Meng, H.: Adversarial attacks on GMM i-vector based speaker verification systems. In: Proceedings IEEE-ICASSP, pp. 6579ā6583 (2020)
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)
McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)
Miyato, T., Maeda, S., Koyama, M., Nakae, K., Ishii, S.: Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677 (2015)
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. (CSUR) 33(1), 31ā88 (2001)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206ā5210 (2015). https://doi.org/10.1109/ICASSP.2015.7178964
Sarı, L., Moritz, N., Hori, T., Le Roux, J.: Unsupervised speaker adaptation using attention-based speaker memory for end-to-end ASR. In: ICASSP 2020ā2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7384ā7388. IEEE (2020)
Snyder, D., et al.: X-vectors: Robust DNN embeddings for speaker recognition. In: Proceedings IEEE-ICASSP, pp. 5329ā5333 (2018)
Szegedy, C., et al.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)
Variani, E., et al.: Deep neural networks for small footprint text-dependent speaker verification, pp. 4052ā4056 (2014)
Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017)
Wang, Q., Guo, P., Sun, S., Xie, L., Hansen, J.: Adversarial regularization for end-to-end robust speaker verification. In: Proceedings INTERSPEECH, pp. 4010ā4014 (2019)
Wang, Q., Guo, P., Xie, L.: Inaudible adversarial perturbations for targeted attack in speaker recognition. arXiv preprint arXiv:2005.10637 (2020)
Yuan, X., et al.: CommanderSong: a systematic approach for practical adversarial voice recognition. In: Proceedings 27th \(\{\)USENIX\(\}\) Security Symposium (\(\{\)USENIX\(\}\) Security 18), pp. 49ā64 (2018)
Zeyer, A., Bahar, P., Irie, K., SchlĆ¼ter, R., Ney, H.: A comparison of transformer and LSTM encoder decoder models for ASR. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 8ā15. IEEE (2019)
Zhang, Q., et al.: Transformer transducer: a streamable speech recognition model with transformer encoders and RNN-T loss. In: ICASSP 2020ā2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7829ā7833. IEEE (2020)
Zhao, Y., Ni, C., Leung, C.C., Joty, S.R., Chng, E.S., Ma, B.: Speech transformer with speaker aware persistent memory. In: INTERSPEECH. pp. 1261ā1265 (2020)
Zong, W., Chow, Y.-W., Susilo, W., Rana, S., Venkatesh, S.: Targeted universal adversarial perturbations forĀ automatic speech recognition. In: Liu, J.K., Katsikas, S., Meng, W., Susilo, W., Intan, R. (eds.) ISC 2021. LNCS, vol. 13118, pp. 358ā373. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-91356-4_19
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chen, X., Li, S., Huang, H. (2023). GhostVec: Directly Extracting Speaker Embedding fromĀ End-to-End Speech Recognition Model Using Adversarial Examples. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Communications in Computer and Information Science, vol 1793. Springer, Singapore. https://doi.org/10.1007/978-981-99-1645-0_40
Download citation
DOI: https://doi.org/10.1007/978-981-99-1645-0_40
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1644-3
Online ISBN: 978-981-99-1645-0
eBook Packages: Computer ScienceComputer Science (R0)