Skip to main content

Voice Conversion Using Learnable Similarity-Guided Masked Autoencoder

  • Conference paper
  • First Online:
Digital Forensics and Watermarking (IWDW 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13825))

Included in the following conference series:

  • 407 Accesses

Abstract

Voice conversion (VC) is an important voice forgery method that poses a serious threat to personal privacy protection, especially with remarkable achievements in timbre modification. To support forensic research on converted speech and further enrich the sources of fake speech, it is imperative to investigate new robust VC methods. VC is also considered a typical style transfer task, where style refers to speaker identity, suggesting that achieving sufficient feature decoupling is the key to obtaining robust performance. However, mainstream decoupling methods based on information-constrained bottlenecks still fail to obtain robust content-style trade-offs. In this paper, we propose a learnable similarity-guided mask (LSGM) algorithm to address the robustness problem. First, to make feature decoupling independent of specific language constructs and more applicable to diverse content, LSGM performs inter-frame feature compression only relying on the similarity of adjacent frames instead of complex inter-frame content correlation. Second, we implement feature compression by masking instead of dimensionality reduction, so no additional modules are needed to convey the speech frame length information. Moreover, we propose MAE-VC by using LSGM, which is an end-to-end masked autoencoder (MAE) with self-supervised representation learning. Experimental results indicate that MAE-VC performs comparable to state-of-the-art methods on speaker similarity and significantly improves the performance on content consistency.

This work was supported by National Key Technology Research and Development Program under 2020AAA0140000.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/BrightGu/MAE-VC.

  2. 2.

    https://github.com/yistLin/dvector.

  3. 3.

    https://huggingface.co/docs/transformers/model_doc/wav2vec2.

  4. 4.

    https://github.com/nl8590687/ASRT_SpeechRecognition.

  5. 5.

    https://brightgu.github.io/MAE-VC/.

References

  1. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: Wav2Vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)

    Google Scholar 

  2. Chen, Y.H., Wu, D.Y., Wu, T.H., Lee, H.Y.: Again-VC: a one-shot voice conversion using activation guidance and adaptive instance normalization. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5954–5958. IEEE (2021)

    Google Scholar 

  3. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797 (2018)

    Google Scholar 

  4. Chou, J.C., Yeh, C.C., Lee, H.Y.: One-shot voice conversion by separating speaker and content representations with instance normalization. arXiv preprint arXiv:1904.05742 (2019)

  5. Erro, D., Moreno, A., Bonafonte, A.: Voice conversion based on weighted frequency warping. IEEE Trans. Audio Speech Lang. Process. 18(5), 922–931 (2009)

    Article  Google Scholar 

  6. Gu, Y., Zhang, Z., Yi, X., Zhao, X.: MediumVC: any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features. arXiv preprint arXiv:2110.02500 (2021)

  7. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)

    Google Scholar 

  8. Helander, E., Silén, H., Virtanen, T., Gabbouj, M.: Voice conversion using dynamic kernel partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 20(3), 806–817 (2011)

    Article  Google Scholar 

  9. Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M.: Voice conversion from non-parallel corpora using variational auto-encoder. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–6. IEEE (2016)

    Google Scholar 

  10. Hsu, C.C., Hwang, H.T., Wu, Y.C., Tsao, Y., Wang, H.M.: Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv preprint arXiv:1704.00849 (2017)

  11. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)

    Google Scholar 

  12. Jia, Y., et al.: Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Adv. Neural Inf. Process. Syst. 31, 4485–4495 (2018)

    Google Scholar 

  13. Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: ACVAE-VC: non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder. arXiv preprint arXiv:1808.05092 (2018)

  14. Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: StarGAN-VC: non-parallel many-to-many voice conversion using star generative adversarial networks. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 266–273. IEEE (2018)

    Google Scholar 

  15. Kaneko, T., Kameoka, H.: CycleGAN-VC: non-parallel voice conversion using cycle-consistent adversarial networks. In: 2018 26th European Signal Processing Conference (EUSIPCO), pp. 2100–2104. IEEE (2018)

    Google Scholar 

  16. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  17. Lee, S.H., Kim, J.H., Chung, H., Lee, S.W.: VoiceMixer: adversarial voice style mixup. Adv. Neural. Inf. Process. Syst. 34, 294–308 (2021)

    Google Scholar 

  18. Lin, Y.Y., Chien, C.M., Lin, J.H., Lee, H.Y., Lee, L.S.: FragmentVC: any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5939–5943. IEEE (2021)

    Google Scholar 

  19. Luan, Y., Saito, D., Kashiwagi, Y., Minematsu, N., Hirose, K.: Semi-supervised noise dictionary adaptation for exemplar-based noise robust speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1745–1748. IEEE (2014)

    Google Scholar 

  20. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)

    Google Scholar 

  21. Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: AutoVC: zero-shot voice style transfer with only autoencoder loss. In: International Conference on Machine Learning, pp. 5210–5219. PMLR (2019)

    Google Scholar 

  22. Ren, Y., et al.: FastSpeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558 (2020)

  23. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  24. Shi, Y., Bu, H., Xu, X., Zhang, S., Li, M.: Aishell-3: a multi-speaker mandarin tts corpus and the baselines. arXiv preprint arXiv:2010.11567 (2020)

  25. Tian, X., Lee, S.W., Wu, Z., Chng, E.S., Li, H.: An exemplar-based approach to frequency warping for voice conversion. IEEE/ACM Trans. Audio Speech Lang. Process. 25(10), 1863–1876 (2017)

    Article  Google Scholar 

  26. Toda, T., Black, A.W., Tokuda, K.: Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15(8), 2222–2235 (2007)

    Article  Google Scholar 

  27. Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)

  28. Veaux, C., Yamagishi, J., MacDonald, K., et al.: Superseded-CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (2016)

    Google Scholar 

  29. Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)

    Google Scholar 

  30. Wu, D.Y., Chen, Y.H., Lee, H.Y.: VQVC+: one-shot voice conversion by vector quantization and U-net architecture. arXiv preprint arXiv:2006.04154 (2020)

  31. Zhao, Y., et al.: Voice conversion challenge 2020: intra-lingual semi-parallel and cross-lingual voice conversion. arXiv preprint arXiv:2008.12527 (2020)

  32. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xianfeng Zhao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gu, Y., Zhao, X., Yi, X., Xiao, J. (2023). Voice Conversion Using Learnable Similarity-Guided Masked Autoencoder. In: Zhao, X., Tang, Z., Comesaña-Alfaro, P., Piva, A. (eds) Digital Forensics and Watermarking. IWDW 2022. Lecture Notes in Computer Science, vol 13825. Springer, Cham. https://doi.org/10.1007/978-3-031-25115-3_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-25115-3_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-25114-6

  • Online ISBN: 978-3-031-25115-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics