short-paper

Reprogramming Self-supervised Learning-based Speech Representations for Speaker Anonymization

Authors:

Liang HeAuthors Info & Claims

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

Article No.: 93, Pages 1 - 5

https://doi.org/10.1145/3595916.3626366

Published: 01 January 2024 Publication History

Abstract

Current speaker anonymization methods, especially with self-supervised learning (SSL) models, require massive computational resources when hiding speaker identity. This paper proposes an effective and parameter-efficient speaker anonymization method based on recent End-to-End model reprogramming technology. To improve the anonymization performance, we first extract speaker representation from large SSL models as the speaker identifies. To hide the speaker’s identity, we reprogram the speaker representation by adapting the speaker to a pseudo domain. Extensive experiments are carried out on the VoicePrivacy Challenge (VPC) 2022 datasets to demonstrate the effectiveness of our proposed parameter-efficient learning anonymization methods. Additionally, while achieving comparable performance with the VPC 2022 strong baseline 1.b, our approach also consumes less computational resources during anonymization.

References

[1]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.

[2]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518.

[3]

Gamaleldin F Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein. 2019. Adversarial Reprogramming of Neural Networks. In International Conference on Learning Representations.

[4]

F. Fuming and et al.2019. Speaker anonymization using x-vector and neural waveform models. arXiv preprint arXiv:1905.13561 (2019).

[5]

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. 2016. Domain-adversarial training of neural networks. The journal of machine learning research 17, 1 (2016), 2096–2030.

[6]

I. Goodfellow, Y. Bengio, and A.Courville. 2016. Deep learning. MIT press.

Digital Library

[7]

K. Hashimoto, J. Yamagishi, and I. Echizen. 2016. Privacy-preserving sound to degrade automatic speaker verification performance. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5500–5504.

[8]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.

Digital Library

[9]

Chao-Han Huck Yang, Bo Li, Yu Zhang, Nanxin Chen, Rohit Prabhavalkar, Tara N Sainath, and Trevor Strohman. 2023. From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition. arXiv e-prints (2023), arXiv–2301.

[10]

Sergey Ioffe. 2006. Probabilistic linear discriminant analysis. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part IV 9. Springer Berlin Heidelberg, 531–542.

[11]

J. Kong, J. Kim, and J. Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems 33 (2020), 17022–17033.

[12]

C. O. Mawalim, K. Galajit, J. Karnjana, and M. Unoki. 2020. X-Vector Singular Value Modification and Statistical-Based Decomposition with Ensemble Regression Modeling for Speaker Anonymization System. In Interspeech. 1703–1707.

[13]

J. Perero-Codosero, F. M. Espinoza-Cuadros, and L. A. Hernández-Gómez. 2022. X-vector anonymization using autoencoders and adversarial training for preserving speech privacy. Computer Speech & Language 74 (2022), 101351.

Digital Library

[14]

Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohammadi, and Sanjeev Khudanpur. 2018. Semi-orthogonal low-rank matrix factorization for deep neural networks. In Interspeech. 3743–3747.

[15]

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.

[16]

S. Satyanand. 2018. Forensic and automatic speaker recognition system. International Journal of Electrical and Computer Engineering 8, 5 (2018), 2804.

[17]

Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 (2019).

[18]

David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5329–5333.

Digital Library

[19]

B. Srivastava, M. Maouche, M. Sahidullah, E. Vincent, A. Bellet, M. Tommasi, N. Tomashenko, X. Wang, and J. Yamagishi. 2022. Privacy and utility of x-vector based speaker anonymization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 2383–2395.

Digital Library

[20]

B. Srivastava, N. Tomashenko, X. Wang, E. Vincent, J.Yamagishi, M. Maouche, A. Bellet, and M. Tommasi. 2020. Design Choices for X-vector Based Speaker Anonymization. In INTERSPEECH 2020.

[21]

N. Tomashenko, B. Srivastava, X. Wang, E. Vincent, A. Nautsch, J. Yamagishi, N. Evans, J. Patino, J. Bonastre, P. Noé, and et al.2020. Introducing the VoicePrivacy initiative. In INTERSPEECH 2020.

[22]

N. Tomashenko, X. Wang, X. Miao, H. Nourtel, P. Champion, M. Todisco, E. Vincent, N. Evans, J. Yamagishi, and J. Bonastre. 2022. The VoicePrivacy 2022 Challenge Evaluation Plan. arXiv preprint arXiv:2203.12468 (2022).

[23]

H. Turner, G. Lovisotto, and I. Martinovic. 2020. Speaker anonymization with distribution-preserving x-vector generation for the VoicePrivacy Challenge 2020. arXiv preprint arXiv:2010.13457 (2020).

[24]

Xin Wang and Junichi Yamagishi. 2019. Neural harmonic-plus-noise waveform model with trainable maximum voice frequency for text-to-speech synthesis. arXiv preprint arXiv:1908.10256 (2019).

[25]

Chao-Han Huck Yang, Yun-Yun Tsai, and Pin-Yu Chen. 2021. Voice2series: Reprogramming acoustic models for time series classification. In International Conference on Machine Learning. PMLR, 11808–11819.

[26]

Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y Lin, Andy T Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, 2021. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051 (2021).

[27]

Hao Yen, Pin-Jui Ku, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, and Yu Tsao. 2021. A study of low-resource speech commands recognition based on adversarial reprogramming. arXiv e-prints (2021), arXiv–2110.

Cited By

Matassoni MFong SBrutti A(2024)Speaker Anonymization: Disentangling Speaker Features from Pre-Trained Speech Embeddings for Voice ConversionApplied Sciences10.3390/app1409387614:9(3876)Online publication date: 30-Apr-2024
https://doi.org/10.3390/app14093876

Index Terms

Reprogramming Self-supervised Learning-based Speech Representations for Speaker Anonymization
1. Human-centered computing
2. Security and privacy

Recommendations

Speaker anonymization using generative adversarial networks

The advent use of smart devices has enabled the emergence of many applications that facilitate user interaction through speech. However, speech reveals private and sensitive information about the user’s identity, posing several security risks. For ...
X-vector anonymization using autoencoders and adversarial training for preserving speech privacy
Abstract
The rapid increase in web services and mobile apps, which collect personal data from users, has also increased the risk that their privacy may be severely compromised. In particular, the increasing variety of spoken language interfaces ...
Generating identities with mixture models for speaker anonymization
Abstract
Speaker anonymization methods are a growing research area, due to the common use of voice interfaces coupled with growing privacy requirements. However, existing systems suffer from several issues, in particular a reduction in the ...
Highlights
- We identify a weakness in existing speaker identity generation methods.
- We ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

December 2023

745 pages

ISBN:9798400702051

DOI:10.1145/3595916

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Funding Sources

JSPS KAKENHI Grant No.

Conference

MMAsia '23

Sponsor:

SIGMM

MMAsia '23: ACM Multimedia Asia

December 6 - 8, 2023

Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
45
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)3

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Matassoni MFong SBrutti A(2024)Speaker Anonymization: Disentangling Speaker Features from Pre-Trained Speech Embeddings for Voice ConversionApplied Sciences10.3390/app1409387614:9(3876)Online publication date: 30-Apr-2024
https://doi.org/10.3390/app14093876

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten