skip to main content
10.1145/3595916.3626366acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Reprogramming Self-supervised Learning-based Speech Representations for Speaker Anonymization

Published: 01 January 2024 Publication History

Abstract

Current speaker anonymization methods, especially with self-supervised learning (SSL) models, require massive computational resources when hiding speaker identity. This paper proposes an effective and parameter-efficient speaker anonymization method based on recent End-to-End model reprogramming technology. To improve the anonymization performance, we first extract speaker representation from large SSL models as the speaker identifies. To hide the speaker’s identity, we reprogram the speaker representation by adapting the speaker to a pseudo domain. Extensive experiments are carried out on the VoicePrivacy Challenge (VPC) 2022 datasets to demonstrate the effectiveness of our proposed parameter-efficient learning anonymization methods. Additionally, while achieving comparable performance with the VPC 2022 strong baseline 1.b, our approach also consumes less computational resources during anonymization.

References

[1]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
[2]
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518.
[3]
Gamaleldin F Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein. 2019. Adversarial Reprogramming of Neural Networks. In International Conference on Learning Representations.
[4]
F. Fuming and et al.2019. Speaker anonymization using x-vector and neural waveform models. arXiv preprint arXiv:1905.13561 (2019).
[5]
Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. 2016. Domain-adversarial training of neural networks. The journal of machine learning research 17, 1 (2016), 2096–2030.
[6]
I. Goodfellow, Y. Bengio, and A.Courville. 2016. Deep learning. MIT press.
[7]
K. Hashimoto, J. Yamagishi, and I. Echizen. 2016. Privacy-preserving sound to degrade automatic speaker verification performance. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5500–5504.
[8]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
[9]
Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N Sainath, and Trevor Strohman. 2023. From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition. arXiv e-prints (2023), arXiv–2301.
[10]
Sergey Ioffe. 2006. Probabilistic linear discriminant analysis. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part IV 9. Springer Berlin Heidelberg, 531–542.
[11]
J. Kong, J. Kim, and J. Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems 33 (2020), 17022–17033.
[12]
C. O. Mawalim, K. Galajit, J. Karnjana, and M. Unoki. 2020. X-Vector Singular Value Modification and Statistical-Based Decomposition with Ensemble Regression Modeling for Speaker Anonymization System. In Interspeech. 1703–1707.
[13]
J. Perero-Codosero, F. M. Espinoza-Cuadros, and L. A. Hernández-Gómez. 2022. X-vector anonymization using autoencoders and adversarial training for preserving speech privacy. Computer Speech & Language 74 (2022), 101351.
[14]
Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohammadi, and Sanjeev Khudanpur. 2018. Semi-orthogonal low-rank matrix factorization for deep neural networks. In Interspeech. 3743–3747.
[15]
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.
[16]
S. Satyanand. 2018. Forensic and automatic speaker recognition system. International Journal of Electrical and Computer Engineering 8, 5 (2018), 2804.
[17]
Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 (2019).
[18]
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5329–5333.
[19]
B. Srivastava, M. Maouche, M. Sahidullah, E. Vincent, A. Bellet, M. Tommasi, N. Tomashenko, X. Wang, and J. Yamagishi. 2022. Privacy and utility of x-vector based speaker anonymization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 2383–2395.
[20]
B. Srivastava, N. Tomashenko, X. Wang, E. Vincent, J.Yamagishi, M. Maouche, A. Bellet, and M. Tommasi. 2020. Design Choices for X-vector Based Speaker Anonymization. In INTERSPEECH 2020.
[21]
N. Tomashenko, B. Srivastava, X. Wang, E. Vincent, A. Nautsch, J. Yamagishi, N. Evans, J. Patino, J. Bonastre, P. Noé, and et al.2020. Introducing the VoicePrivacy initiative. In INTERSPEECH 2020.
[22]
N. Tomashenko, X. Wang, X. Miao, H. Nourtel, P. Champion, M. Todisco, E. Vincent, N. Evans, J. Yamagishi, and J. Bonastre. 2022. The VoicePrivacy 2022 Challenge Evaluation Plan. arXiv preprint arXiv:2203.12468 (2022).
[23]
H. Turner, G. Lovisotto, and I. Martinovic. 2020. Speaker anonymization with distribution-preserving x-vector generation for the VoicePrivacy Challenge 2020. arXiv preprint arXiv:2010.13457 (2020).
[24]
Xin Wang and Junichi Yamagishi. 2019. Neural harmonic-plus-noise waveform model with trainable maximum voice frequency for text-to-speech synthesis. arXiv preprint arXiv:1908.10256 (2019).
[25]
Chao-Han Huck Yang, Yun-Yun Tsai, and Pin-Yu Chen. 2021. Voice2series: Reprogramming acoustic models for time series classification. In International Conference on Machine Learning. PMLR, 11808–11819.
[26]
Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, 2021. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051 (2021).
[27]
Hao Yen, Pin-Jui Ku, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, and Yu Tsao. 2021. A study of low-resource speech commands recognition based on adversarial reprogramming. arXiv e-prints (2021), arXiv–2110.

Cited By

View all
  • (2024)Speaker Anonymization: Disentangling Speaker Features from Pre-Trained Speech Embeddings for Voice ConversionApplied Sciences10.3390/app1409387614:9(3876)Online publication date: 30-Apr-2024

Index Terms

  1. Reprogramming Self-supervised Learning-based Speech Representations for Speaker Anonymization

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia
      December 2023
      745 pages
      ISBN:9798400702051
      DOI:10.1145/3595916
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 January 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. End-to-End model reprogramming
      2. Speaker anonymization
      3. privacy

      Qualifiers

      • Short-paper
      • Research
      • Refereed limited

      Funding Sources

      • JSPS KAKENHI Grant No.

      Conference

      MMAsia '23
      Sponsor:
      MMAsia '23: ACM Multimedia Asia
      December 6 - 8, 2023
      Tainan, Taiwan

      Acceptance Rates

      Overall Acceptance Rate 59 of 204 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)24
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 28 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Speaker Anonymization: Disentangling Speaker Features from Pre-Trained Speech Embeddings for Voice ConversionApplied Sciences10.3390/app1409387614:9(3876)Online publication date: 30-Apr-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media