Voice Conversion Using Learnable Similarity-Guided Masked Autoencoder

Gu, Yewei; Zhao, Xianfeng; Yi, Xiaowei; Xiao, Junchao

doi:10.1007/978-3-031-25115-3_4

Yewei Gu^11,12,
Xianfeng Zhao^11,12,
Xiaowei Yi^11,12 &
…
Junchao Xiao^11,12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13825))

Included in the following conference series:

International Workshop on Digital Watermarking

407 Accesses

Abstract

Voice conversion (VC) is an important voice forgery method that poses a serious threat to personal privacy protection, especially with remarkable achievements in timbre modification. To support forensic research on converted speech and further enrich the sources of fake speech, it is imperative to investigate new robust VC methods. VC is also considered a typical style transfer task, where style refers to speaker identity, suggesting that achieving sufficient feature decoupling is the key to obtaining robust performance. However, mainstream decoupling methods based on information-constrained bottlenecks still fail to obtain robust content-style trade-offs. In this paper, we propose a learnable similarity-guided mask (LSGM) algorithm to address the robustness problem. First, to make feature decoupling independent of specific language constructs and more applicable to diverse content, LSGM performs inter-frame feature compression only relying on the similarity of adjacent frames instead of complex inter-frame content correlation. Second, we implement feature compression by masking instead of dimensionality reduction, so no additional modules are needed to convey the speech frame length information. Moreover, we propose MAE-VC by using LSGM, which is an end-to-end masked autoencoder (MAE) with self-supervised representation learning. Experimental results indicate that MAE-VC performs comparable to state-of-the-art methods on speaker similarity and significantly improves the performance on content consistency.

This work was supported by National Key Technology Research and Development Program under 2020AAA0140000.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: Wav2Vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)
Google Scholar
Chen, Y.H., Wu, D.Y., Wu, T.H., Lee, H.Y.: Again-VC: a one-shot voice conversion using activation guidance and adaptive instance normalization. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5954–5958. IEEE (2021)
Google Scholar
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)
Google Scholar
Chou, J.C., Yeh, C.C., Lee, H.Y.: One-shot voice conversion by separating speaker and content representations with instance normalization. arXiv preprint arXiv:1904.05742 (2019)
Erro, D., Moreno, A., Bonafonte, A.: Voice conversion based on weighted frequency warping. IEEE Trans. Audio Speech Lang. Process. 18(5), 922–931 (2009)
Article Google Scholar
Gu, Y., Zhang, Z., Yi, X., Zhao, X.: MediumVC: any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features. arXiv preprint arXiv:2110.02500 (2021)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
Helander, E., Silén, H., Virtanen, T., Gabbouj, M.: Voice conversion using dynamic kernel partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 20(3), 806–817 (2011)
Article Google Scholar
Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M.: Voice conversion from non-parallel corpora using variational auto-encoder. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6. IEEE (2016)
Google Scholar
Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M.: Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv preprint arXiv:1704.00849 (2017)
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)
Google Scholar
Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Adv. Neural Inf. Process. Syst. 31, 4485–4495 (2018)
Google Scholar
Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: ACVAE-VC: non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder. arXiv preprint arXiv:1808.05092 (2018)
Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: StarGAN-VC: non-parallel many-to-many voice conversion using star generative adversarial networks. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 266–273. IEEE (2018)
Google Scholar
Kaneko, T., Kameoka, H.: CycleGAN-VC: non-parallel voice conversion using cycle-consistent adversarial networks. In: 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2100–2104. IEEE (2018)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Lee, S.H., Kim, J.H., Chung, H., Lee, S.W.: VoiceMixer: adversarial voice style mixup. Adv. Neural. Inf. Process. Syst. 34, 294–308 (2021)
Google Scholar
Lin, Y.Y., Chien, C.M., Lin, J.H., Lee, H.Y., Lee, L.S.: FragmentVC: any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5939–5943. IEEE (2021)
Google Scholar
Luan, Y., Saito, D., Kashiwagi, Y., Minematsu, N., Hirose, K.: Semi-supervised noise dictionary adaptation for exemplar-based noise robust speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1745–1748. IEEE (2014)
Google Scholar
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)
Google Scholar
Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: AutoVC: zero-shot voice style transfer with only autoencoder loss. In: International Conference on Machine Learning, pp. 5210–5219. PMLR (2019)
Google Scholar
Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Shi, Y., Bu, H., Xu, X., Zhang, S., Li, M.: Aishell-3: a multi-speaker mandarin tts corpus and the baselines. arXiv preprint arXiv:2010.11567 (2020)
Tian, X., Lee, S.W., Wu, Z., Chng, E.S., Li, H.: An exemplar-based approach to frequency warping for voice conversion. IEEE/ACM Trans. Audio Speech Lang. Process. 25(10), 1863–1876 (2017)
Article Google Scholar
Toda, T., Black, A.W., Tokuda, K.: Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15(8), 2222–2235 (2007)
Article Google Scholar
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)
Veaux, C., Yamagishi, J., MacDonald, K., et al.: Superseded-CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (2016)
Google Scholar
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)
Google Scholar
Wu, D.Y., Chen, Y.H., Lee, H.Y.: VQVC+: one-shot voice conversion by vector quantization and U-net architecture. arXiv preprint arXiv:2006.04154 (2020)
Zhao, Y., et al.: Voice conversion challenge 2020: intra-lingual semi-parallel and cross-lingual voice conversion. arXiv preprint arXiv:2008.12527 (2020)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, 100195, China
Yewei Gu, Xianfeng Zhao, Xiaowei Yi & Junchao Xiao
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, 100195, China
Yewei Gu, Xianfeng Zhao, Xiaowei Yi & Junchao Xiao

Authors

Yewei Gu
View author publications
You can also search for this author in PubMed Google Scholar
Xianfeng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowei Yi
View author publications
You can also search for this author in PubMed Google Scholar
Junchao Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xianfeng Zhao .

Editor information

Editors and Affiliations

Chinese Academy of Sciences, Institute of Information Engineering, Beijing, China
Xianfeng Zhao
Guangxi Normal University, Guilin, China
Zhenjun Tang
Universidade de Vigo, Vigo, Spain
Pedro Comesaña-Alfaro
University of Florence, Florence, Italy
Alessandro Piva

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gu, Y., Zhao, X., Yi, X., Xiao, J. (2023). Voice Conversion Using Learnable Similarity-Guided Masked Autoencoder. In: Zhao, X., Tang, Z., Comesaña-Alfaro, P., Piva, A. (eds) Digital Forensics and Watermarking. IWDW 2022. Lecture Notes in Computer Science, vol 13825. Springer, Cham. https://doi.org/10.1007/978-3-031-25115-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-25115-3_4
Published: 29 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25114-6
Online ISBN: 978-3-031-25115-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics