Voice Privacy Using Time-Scale and Pitch Modification

Singh, Dipesh K.; Prajapati, Gauri P.; Patil, Hemant A.

doi:10.1007/s42979-023-02549-8

Voice Privacy Using Time-Scale and Pitch Modification

Original Research
Published: 27 January 2024

Volume 5, article number 243, (2024)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Dipesh K. Singh¹,
Gauri P. Prajapati¹ &
Hemant A. Patil¹

99 Accesses
Explore all metrics

Abstract

There is a growing demand toward digitization of various day-to-day work and hence, there is a surge in use of Intelligent Personal Assistants. The extensive use of these smart digital assistants asks for security and privacy preservation techniques because they use personally identifiable characteristics of the user. To that effect, various privacy preservation techniques for different types of voice assistants have been explored. Hence, for voice-based digital assistants, we need a privacy preservation technique. Thus, in this study, we explored the prosody modification methods to modify speaker-specific characteristics of the user, so that the modified utterances can then be made publicly available to use for training of different speech-based systems. This study presents three data augmentation techniques as voice anonymization methods to modify the speaker-dependent speech parameters (i.e., \(F_{0}\)). The voice anonymization and speech intelligibility are measured objectively using the automatic speaker verification (ASV) and automatic speech recognition (ASR) experiments, respectively, on development and test set of Librispeech dataset. For speed perturbation-based anonymization, up to 53.7% relative increased % EER is observed for a perturbation factor, \(\alpha =0.8\) for both male and female speakers. For the same case, the % WER was adequate (less than the baseline system), reflecting the use of speed perturbation method as anonymization algorithm in a voice privacy system. The similar performance is observed for pitch perturbation with perturbation factor, \(\lambda =-300\). However, the tempo perturbation could not found to be useful for speaker anonymization during the experiments with % EER in the order of 5–10\(\%\).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Novel Method to Evaluate the Privacy Protection in Speaker Anonymization

Voice Privacy in Biometrics

Privacy-Preserving Speaker Verification and Speech Recognition

References

Qian J, Du H, Hou J, Chen L, Jung T, Li X-Y. Hidebehind: enjoy voice input with voiceprint unclonability and anonymity. In: Proceedings of the 16th ACM Conference on embedded networked sensor systems, Shenzhen, China, November 4–7, 2018; pp. 82–94.
Nautsch JC, Kindt E, Todisco M, Trancoso I, Evans N. The GDPR and speech data: Reflections of legal and technology communities, first steps towards a common understanding. arXiv1000 preprint arXiv:1907.03458. Accessed 15 May 2022.
Nautsch A, Jiménez A, Treiber A, Kolberg J, Jasserand C, Kindt E, Héctor D, et al. Preserving privacy in speaker and speech characterisation. Comput Speech Lang. 2019;58:441–80.
Article Google Scholar
Regulation GDP. Regulation eu 2016/679 of the European parliament and of the council of 27 April 2016. Official Journal of the European Union. 2016. http://ec.europa.eu/justice/data-protection/reform/files/regulation_oj_en.pdf. Accessed 15 May 2022.
Gross R, Sweeney L, De la Torre F, Baker S. Model-based face de-identification. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06), New York, USA, 17–22 June 2006; p. 161–161.
Uzuner Ö, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007;14(5):550–63.
Article Google Scholar
Srivastava BML, Vauquier N, Sahidullah M, Bellet A, Tommasi M, Vincent E. Evaluating voice conversion-based privacy protection against informed attackers. In: ICASSP 2020-2020 IEEE International Conference on acoustics, speech and signal processing (ICASSP), Barcelona, Spain, May 4–8, 2020; p. 2802–2806.
Zhang S-X, Gong Y, Yu D. Encrypted speech recognition using deep polynomial networks. In: ICASSP 2019-2019 IEEE International Conference on acoustics, speech and signal processing (ICASSP), Brighton, United Kingdom, May 12–17, 2019; p. 5691–5695.
Stylianou Y. Voice transformation: a survey. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Taipei, Taiwan, 19–24 April 2009; p. 3585–3588.
Tomashenko N, Wang X, Miao X, Nourtel H, Champion P, Todisco M, Vincent E, Evans N, Yamagishi J, François Bonastre J. The VoicePrivacy 2022 Challenge Evaluation Plan. arXiv preprint arXiv:2203.12468. 2022. Accessed 15 May 2022.
Jin Q, Toth AR, Schultz T, Black AW. Voice convergin: speaker de-identification by voice transformation. In: IEEE International Conference on acoustics, speech, and signal processing (ICASSP), Taipei, Taiwan, 19–24 April 2009; pp. 3909–3912.
Tomashenko N, Mohan LSB, Wang X, Vincent E, Nautsch A, Yamagishi J, Evans N et. al. Introducing the VoicePrivacy initiative. In: 21st Annual Conference of the International Speech Communication Association (INTERSPEECH), Virtual Event, Shanghai, China, 25–29 October, 2020; pp. 1693–1697.
Jin Q, Toth AR, Schultz T, Black AW. Speaker de-identification via voice transformation. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) Workshop, Merano, Italy, 13–17 December, 2009; pp. 529–533.
Pribil J, Pribilova A, Matousek J. Evaluation of speaker de-identification based on voice gender and age conversion. J Electr Eng. 2018;69(2):138–47.
Google Scholar
Bjornson E. Reproducible research: best practices and potential misuse [perspectives]. IEEE Signal Process Mag. 2019;36(3):106–23.
Article Google Scholar
Raff E. A step toward quantifying independently reproducible machine learning research, advances. In: Neural Information Processing Systems (NIPS) 32, 8–14 December. BC, Canada: Vancouver; 2019. p. 5485–95.
Barni M, Perez-Gonzalez F. Pushing science into signal processing [my turn]. IEEE Signal Process Mag. 2005;22(4):120–119.
Article Google Scholar
Kovacevic J. How to encourage and publish reproducible research. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 4, Honolulu, Hawaii, USA, 15–20, April 2007; pp. 1273–1276.
Baker M. 1,500 scientists lift the lid on reproducibility’. Nature. 2016;533:7604.
Article Google Scholar
Vandewalle P, Kovacevic J, Vetterli M. Reproducible research in signal processing. IEEE Signal Process Mag. 2009;26(3):37–47.
Article Google Scholar
Wu Z, Kinnunen T, Evans N, Yamagishi J, Hanilçi C, Sahidullah M, Sizov A. ASVSpoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In: 16th annual conference of the international speech communication association (INTERSPEECH), Dresden, Germany, September 6–10, 2015; p. 2037–2041.
Kinnunen T, Sahidullah M, Delgado H, Todisco M, Evans N, Yamagishi J, Lee AK. The ASVSpoof 2017 challenge: Assessing the limits of replay spoofing attack detection. In: 18th Annual Conference of the international speech communication association (INTERSPEECH), Stockholm, Sweden, August 20–24, 2017; p. 2–6.
Nautsch A, Wang X, Evans N, Kinnunen TH, Vestman V, Todisco M, Delgado H, Sahidullah M, Yamagishi J, Lee AK. ASVSpoof 2019: Spoofing countermeasures for the detection of synthesized, converted and replayed speech. IEEE Trans Biomet Behav Ident Sci. 2021;3(2):252–65.
Article Google Scholar
Yamagishi J, Wang X, Todisco M, Sahidullah M, Patino J, Nautsch A, Liu X et al. ASVSpoof 2021: accelerating progress in spoofed and deepfake speech detection. arXiv preprint arXiv:2109.00537. 2021. Accessed 15 May 2022.
Piwowar HA, Day RS, Fridsma DB. Sharing detailed research data is associated with increased citation rate. PLoS One. 2007;2(3): e308.
Article Google Scholar
Tomashenko N, Srivastava BML, Wang X, Vincent E, Nautsch A, Yamagishi Evans N et al. Introducing the VoicePrivacy initiative. In: 21st Annual Conference of the International Speech Communication Association (INTERSPEECH), Virtual Event, Shanghai, China, 25–29 October, 2020; p. 1693–1697.
Fang F, Wang X, Yamagishi J, Echizen I, Todisco M, Evans N, Bonastre J-F. Speaker anonymization using x-vector and neural waveform models. arXiv preprint arXiv:1905.13561. 2019. Accessed 15 May 2022.
McAdams S. Spectral fusion, spectral parsing, and the formation of auditory image, Ph.D. Thesis, Department of Hearing and Speech, Stanford University, California, USA, May, 1984
Patino J, Tomashenko N, Todisco M, Nautsch A, Evans N. Speaker anonymisation using the McAdams coefficient. In: 22nd Annual Conference of the International Speech Communication Association (INTERSPEECH), Brno, Czechia, 30 August–3 September, 2021; p. 1099–1103.
Schroeder MR. Vocoders: analysis and synthesis of speech. Proc IEEE. 1966;54(5):720–34.
Article Google Scholar
Rudresh S, Vasisht A, Vijayan K, Seelamantula CA. Epoch-synchronous overlap-add (ESOLA) for time-and pitch-scale modification of speech signals. arXiv preprint arXiv:1801.06492. 2018. Accessed 15 May 2022.
Quatieri TF, McAulay RJ. Shape invariant time-scale and pitch modification of speech. IEEE Trans Signal Process. 1992;40(3):497–510.
Article Google Scholar
Veldhuis R, He H. Time-scale and pitch modifications of speech signals and resynthesis from the discrete short-time Fourier transform. Speech Commun. 1996;18(3):257–82.
Article Google Scholar
Atal BS. Automatic speaker recognition based on pitch contours. J Acoust Soc Am (JASA). 1972;52(6B):1687–97.
Article Google Scholar
Ko T, Peddinti V, Povey D, Khudanpur S. Audio augmentation for speech recognition. In: 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6–10, 2015; p. 3586–3589.
Prajapati GP, Singh DK, Amin PP, Patil HA. Voice privacy using CycleGAN and time-scale modification. Comput Speech Lang. 2022;74: 101353.
Article Google Scholar
Kaneko T, Kameoka H, Tanaka K, Hojo N. CycleGAN-VC3: examining and improving CycleGAN-VCs for mel-spectrogram conversion. arXiv preprint arXiv:2010.11672. 2020. Accessed 15 May 2022.
Kaneko T, Kameoka H, Tanaka K, Hojo N. Stargan-vc2: Rethinking conditional methods for Stargan-based voice conversion. arXiv preprint arXiv:1907.12279. 2019. Accessed 15 May 2021.
Saito Y, Takamichi S, Saruwatari H. Statistical parametric speech synthesis incorporating generative adversarial networks. IEEE/ACM Trans Audio Speech Lang Process. 2017;26(1):84–96.
Article Google Scholar
Panayotov V, Chen G, Povey D, Khudanpur S. Librispeech: An ASR corpus based on public domain audio books, et al. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Queensland, Australia, 19–24 April, 2015; p. 5206–5210.
Yamagishi J, Veaux C, MacDonald K. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92),[sound]. University of Edinburgh. The Centre for Speech Technology Research (CSTR)”, 2019, Online available at https://datashare.is.ed.ac.uk/handle/10283/3443. Accessed 15 May 2022.
Atal BS, Hanauer SL. Speech analysis and synthesis by linear prediction of the speech wave. J Acoust Soc Am (JASA). 1971;50(2B):637–55.
Article Google Scholar
Markel JD, Gray AH Jr. Linear prediction of speech, vol. 12. Springer Science & Business Media; 2013.
Google Scholar
Prajapati GP, Singh DK, Amin PP, Patil HA. Voice privacy through x-vector and CycleGAN-based anonymization. In: INTERSPEECH, Brno, Czechia, 30 August–3 September, 2021; pp. 1684–1688.
Mizuno H, Abe M. A formant frequency modification algorithm dealing with the pole interaction. Electron Commun Jpn (Part III: Fundamental Electronic Science). 1996;79(1):46–55.
Article Google Scholar
Quatieri TF. Discrete-time speech signal processing: principles and practice. Pearson Education India; 2006.
Google Scholar
Povey D, Peddinti V, Galvez D, Ghahremani P, Manohar V, Na X, Wang Y, Khudanpur S. Purely sequence-trained neural networks for ASR based on lattice-free MMI. San Francisco: INTERSPEECH; 2016. p. 2751–5.
Google Scholar
Povey D, Cheng G, Wang Y, Li K, Hainan X, Yarmohammadi M, Khudanpur S. Semi-orthogonal low-rank matrix factorization for deep neural networks. Hyderabad: INTERSPEECH; 2018. p. 3743–7.
Google Scholar
Gales M, Young S. The application of hidden Markov models in speech recognition. Now Publishers Inc; 2008.
Google Scholar
Tomashenko N, Wang X, Vincent E, Patino J, Srivastava BML, Noé P-G, Nautsch A, et al. The VoicePrivacy, challenge: results and findings. Comput Speech Lang. 2022;2020(74): 101362.
Article Google Scholar
Askenfelt AG, Hammarberg B. Speech waveform perturbation analysis: a perceptual-acoustical comparison of seven measures. J Speech Lang Hear Res. 1986;29(1):50–64.
Article Google Scholar
Prajapati GP, Singh DK, Amin PP, Patil HA. Voice privacy using CycleGAN and time-scale modification. Comput Speech Lang. 2022;74: 101353.
Article Google Scholar
Prajapati GP, Singh DK, Amin PP, Patil HA. Voice privacy through x-vector and CycleGAN-based anonymization. In: NTERSPEECH, IEEE, Brno, Czech Republic, 30 August–3 September, 2021; p. 1684–1688.
Oppenheim AV, Willsky AS, Hamid NS. Signals & Systems. 2nd ed. Prentice-Hall Inc; 1996.
Google Scholar
Stéphane GM. A wavelet tour of signal processing. 2nd ed. Elsevier; 1999.
Google Scholar
SoX, audio manipulation tool. http://sox.sourceforge.net/. Accessed 15 May 2022.
Larson CR, Sun J, Hain TC. Effects of simultaneous perturbations of voice pitch and loudness feedback on voice F 0 and amplitude control. J Acoust Soc Am (JASA). 2007;121(5):2862–72.
Article Google Scholar
Laver J, Laver J. Principles of phonetics. Cambridge University Press; 1994.
Book Google Scholar
Verhelst W, Roelands M. An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, Minneapolis, Minnesota, USA, April 27–30, 1993, pp. 554–557.
Kanda N, Takeda R, Obuchi Y. Elastic spectral distortion for low resource speech recognition with deep neural networks. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Olomouc, Czech Republic, 8–13 December, 2013; p. 309–314.
Noé P-G, Bonastre J-F, Matrouf D, Tomashenko N, Nautsch A, Evans N. Speech pseudonymisation assessment using voice similarity matrices. In: INTERSPEECH, Virtual Event, Shanghai, China, 25–29 October, pp. 1718–1722.
Reynolds DA. Experimental evaluation of features for robust speaker identification. IEEE Trans Speech Audio Process. 1994;2(4):639–43.
Article Google Scholar
Ahmed S, Chowdhury AR, Fawaz K, Ramanathan P. Preech: a system for privacy-preserving speech transcription. In: 29th USENIX Security Symposium (USENIX Security 20), 2020; pp. 2703–2720.
Gupta P, Prajapati GP, Singh S, Kamble MR, Patil HA. Design of voice privacy system using linear prediction. In: 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2020; pp. 543–549. IEEE.
Prajapati GP, Singh D, Patil HA. Voice privacy through time-scale and pitch modification. In: International Conference on Pattern Recognition and Machine Intelligence (PReMI), ISI, Kolkata, India, 15–19 December, 2021.
Tomashenko N, Wang X, Miao X, Nourtel H, Champion P, Todisco M, Vincent E, Evans N, Yamagishi J, Bonastre JF. The VoicePrivacy 2022 Challenge Evaluation Plan”. arXiv preprint arXiv:2203.12468. Accessed 15 May 2022.
Noé P-G, Bonastre J-F, Matrouf D, Tomashenko N, Nautsch A, Evans N. Speech pseudonymisation assessment using voice similarity matrices. In: INTERSPEECH, Shanghai, China, 25–29 October, 2020; pp. 1718–1722.
Noé P-G, Nautsch A, Evans N, Patino J, Bonastre J-F, Tomashenko N, Matrouf D. Towards a unified assessment framework of speech pseudonymisation. Comput Speech Lang. 2022;72: 101299.
Article Google Scholar
Hirst D. A Praat plugin for Momel and INTSINT with improved algorithms for modelling and coding intonation. In: 16th International Congress of Phonetic Sciences ICPhS XVI. 2007.
Titze IR, Martin DW. Principles of voice production. Englewood Cliffs: Prentice Hall; 1998. p. 1148–1148.
Google Scholar
Lavner Y, Gath I, Rosenhouse J. The effects of acoustic modifications on the identification of familiar voices speaking isolated vowels. Speech Commun. 2000;30(1):9–26.
Article Google Scholar
Ghahremani P, Nidadavolu PS, Chen N, Villalba J, Povey D, Khudanpur S, Dehak N. End-to-end deep neural network age estimation. In: Interspeech, Hyderabad, India, September 2–6, 2018; pp. 277–281.
Kwasny D, Hemmerling D. Gender and age estimation methods based on speech using deep neural networks. Sensors. 2021;21(14):4785.
Article Google Scholar
Noé P-G, Nautsch A, Evans N, Patino J, Bonastre J-F, Tomashenko N, Matrouf D. Towards a unified assessment framework of speech pseudonymisation. Comput Speech Lang. 2022;72: 101299.
Article Google Scholar

Download references

Acknowledgements

We would like to thank the organizers of the Voice Privacy Challenge 2020 for publicly releasing standard and statistically meaningful corpus without which this work would not have been possible. The authors also thank the authorities of Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar, India for their support during our research work. We also acknowledge the great support from the subjects to participate in the subjective evaluation.

Author information

Authors and Affiliations

Speech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, 382007, India
Dipesh K. Singh, Gauri P. Prajapati & Hemant A. Patil

Authors

Dipesh K. Singh
View author publications
You can also search for this author in PubMed Google Scholar
Gauri P. Prajapati
View author publications
You can also search for this author in PubMed Google Scholar
Hemant A. Patil
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dipesh K. Singh.

Ethics declarations

Conflict of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Pattern Recognition and Machine Learning” guest edited by Ashish Ghosh, Monidipa Das and Anwesha Law.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Singh, D.K., Prajapati, G.P. & Patil, H.A. Voice Privacy Using Time-Scale and Pitch Modification. SN COMPUT. SCI. 5, 243 (2024). https://doi.org/10.1007/s42979-023-02549-8

Download citation

Received: 31 May 2022
Accepted: 10 December 2023
Published: 27 January 2024
DOI: https://doi.org/10.1007/s42979-023-02549-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Voice Privacy Using Time-Scale and Pitch Modification

Abstract

Access this article

Similar content being viewed by others

A Novel Method to Evaluate the Privacy Protection in Speaker Anonymization

Voice Privacy in Biometrics

Privacy-Preserving Speaker Verification and Speech Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Voice Privacy Using Time-Scale and Pitch Modification

Abstract

Access this article

Similar content being viewed by others

A Novel Method to Evaluate the Privacy Protection in Speaker Anonymization

Voice Privacy in Biometrics

Privacy-Preserving Speaker Verification and Speech Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation