ABSTRACT
Hundreds of hours of audios are recorded and transmitted over the Internet for voice interactions such as virtual calls or speech recognitions. As these recordings are uploaded, embedded biometric information, i.e., voiceprints, is unnecessarily exposed. This paper proposes the first privacy-enhanced microphone module (i.e., MicPro) that can produce anonymous audio recordings with biometric information suppressed while preserving speech quality for human perception or linguistic content for speech recognition. Limited by the hardware capabilities of microphone modules, previous works that modify recording at the software level are inapplicable. To achieve anonymity in this scenario, MicPro transforms formants, which are distinct for each person due to the unique physiological structure of the vocal organs, and formant transformations are done by modifying the linear spectrum frequencies (LSFs) provided by a popular codec (i.e., CELP) in low-latency communications.
To strike a balance between anonymity and usability, we use a multi-objective genetic algorithm (NSGA-II) to optimize the transformation coefficients. We implement MicPro on an off-the-shelf microphone module and evaluate the performance of MicPro on several ASV systems, ASR systems, corpora, and in real-world setup. Our experiments show that for the state-of-the-art ASV systems, MicPro outperforms existing software-based strategies that utilize signal processing (SP) techniques, achieving an EER that is 5~10% higher and MMR that is 20% higher than existing works while maintaining a comparable level of usability.
- Bishnu S. Atal A. 2003. Speech Synthesis Based on Linear Prediction. Encyclopedia of Physical Science and Technology (Third Edition) (2003), 645--655.Google Scholar
- Amazon. 2014. Amazon Alexa. https://developer.amazon.com/alexa.Google Scholar
- Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, and Bai et al. 2016. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. In Proceedings of The 33rd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 48). PMLR, 173--182.Google Scholar
- Louie Andre. 2023. 53 Important Statistics About How Much Data Is Created Every Day. https://financesonline.com/how-much-data-is-created-every-day/.Google Scholar
- Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Advances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 12449--12460.Google Scholar
- Fahimeh Bahmaninezhad, Chunlei Zhang, and John Hansen. 2018. Convolutional Neural Network Based Speaker De-Identification. In Proc. The Speaker and Language Recognition Workshop (Odyssey 2018). 255--260.Google ScholarCross Ref
- Adil Benyassine, Eyal Shlomot, H-Y Su, Dominique Massaloux, Claude Lamblin, and J-P Petit. 1997. ITU-T Recommendation G. 729 Annex B: a silence compression scheme for use with G. 729 optimized for V. 70 digital simultaneous voice and data applications. IEEE Communications Magazine, Vol. 35, 9 (1997), 64--73.Google ScholarDigital Library
- Bruno Bessette, Redwan Salami, Roch Lefebvre, Milan Jelinek, Jani Rotola-Pukkila, Janne Vainio, Hannu Mikkola, and Kari Jarvinen. 2002. The adaptive multirate wideband speech codec (AMR-WB). IEEE transactions on speech and audio processing, Vol. 10, 8 (2002), 620--636.Google ScholarCross Ref
- J. Blank and K. Deb. 2020. pymoo: Multi-Objective Optimization in Python. IEEE Access, Vol. 8 (2020), 89497--89509.Google ScholarCross Ref
- Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. 2017. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, 1--5.Google ScholarCross Ref
- Guangke Chen, Sen Chenb, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. 2021. Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems. In 2021 IEEE Symposium on Security and Privacy (SP). 694--711.Google Scholar
- Peterson-Barney database. 1995. https://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/speech/database/pb/.Google Scholar
- Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation, Vol. 6, 2 (2002), 182--197.Google Scholar
- Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2010. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, 4 (2010), 788--798.Google ScholarDigital Library
- Jiangyi Deng, Fei Teng, Yanjiao Chen, Xiaofu Chen, Zhaohui Wang, and Wenyuan Xu. 2023. V-Cloak: Intelligibility-, Naturalness-&Timbre-Preserving Real-Time Voice Anonymization. In 32nd USENIX Security Symposium (USENIX Security 23). 5181--5198.Google Scholar
- Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of Interspeech 2020. 3830--3834.Google Scholar
- Ellen Eide and Herbert Gish. 1996. A parametric approach to vocal tract length normalization. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Vol. 1. IEEE, 346--348.Google ScholarDigital Library
- Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen, Massimiliano Todisco, Nicholas Evans, and Jean-Francois Bonastre. 2019. Speaker anonymization using x-vector and neural waveform models. arXiv preprint arXiv:1905.13561 (2019).Google Scholar
- V Muthu Ganesh and N Janukiruman. 2019. A survey of various effective Codec implementation methods with different real time applications. In 2019 international conference on communication and electronics systems (ICCES). IEEE, 1279--1283.Google ScholarCross Ref
- Joaqu'in González-Rodríguez, Doroteo Torre Toledano, and Javier Ortega-Garc'ia. 2008. Voice biometrics. In Handbook of biometrics. Springer, 151--170.Google Scholar
- Priyanka Gupta, Gauri P Prajapati, Shrishti Singh, Madhu R Kamble, and Hemant A Patil. 2020. Design of voice privacy system using linear prediction. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 543--549.Google Scholar
- Wenbin Huang, Wenjuan Tang, Hanyuan Chen, Hongbo Jiang, and Yaoxue Zhang. 2022a. Unauthorized Microphone Access Restraint Based on User Behavior Perception in Mobile Devices. IEEE Transactions on Mobile Computing 01 (2022), 1--16.Google Scholar
- Wenbin Huang, Wenjuan Tang, Kuan Zhang, Haojin Zhu, and Yaoxue Zhang. 2022b. Thwarting unauthorized voice eavesdropping via touch sensing in mobile systems. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 31--40.Google ScholarDigital Library
- ITU-T. 2012. Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction (CS-ACELP).Google Scholar
- ITU-T. 2021. G.729: Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction (CS-ACELP). https://www.itu.int/rec/T-REC-G.729.Google Scholar
- Sushil Jajodia and Henk CA van van Tilborg. 2011. Encyclopedia of Cryptography and Security: L-Z. Springer.Google Scholar
- Tadej Justin, and France Mihelič. 2015. Speaker de-identification using diphone recognition and speech synthesis. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. 04. 1--7.Google ScholarCross Ref
- Hiroto Kai, Shinnosuke Takamichi, Sayaka Shiota, and Hitoshi Kiya. 2021. Lightweight voice anonymization based on data-driven optimization of cascaded voice modification modules. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 560--566.Google ScholarCross Ref
- Anssi Klapuri. 2006. Introduction to music transcription. In Signal processing methods for music transcription. Springer, 3--20.Google Scholar
- Jacob Leon Kröger, Otto Hans-Martin Lutz, and Philip Raschke. 2020. Privacy implications of voice and speech analysis-information disclosure by inference. Privacy and Identity Management. Data for Better Living: AI and Privacy: 14th IFIP WG 9.2, 9.6/11.7, 11.6/SIG 9.2. 2 International Summer School, Windisch, Switzerland, August 19--23, 2019, Revised Selected Papers 14 (2020), 242--258.Google Scholar
- Paul Lachat, Nadia Bennani, Veronika Rehn-Sonigo, Lionel Brunie, and Harald Kosch. 2022. Detecting Inference Attacks Involving Raw Sensor Data: A Case Study. Sensors, Vol. 22, 21 (2022).Google Scholar
- Marianne Latinus and Pascal Belin. 2011. Human voice perception. Current Biology, Vol. 21, 4 (2011), R143--R145.Google ScholarCross Ref
- Jaemin Lim, Kiyeon Kim, Hyunwoo Yu, and Suk-Bok Lee. 2022. Overo: Sharing Private Audio Recordings. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 1933--1946.Google ScholarDigital Library
- Mishaim Malik, Muhammad Kamran Malik, Khawar Mehmood, and Imran Makhdoom. 2021. Automatic speech recognition: a survey. Multimedia Tools and Applications, Vol. 80 (2021), 9411--9457.Google ScholarDigital Library
- Ian Vince McLoughlin. 2008. Line spectral pairs. Signal processing, Vol. 88, 3 (2008), 448--467.Google Scholar
- Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017).Google Scholar
- Toshiyuki Nomura and Masahiro Iwadare. 1999. Voice over IP systems with speech bitrate adaptation based on MPEG-4 wideband CELP. In 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No. 99EX351). IEEE, 132--134.Google ScholarCross Ref
- Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206--5210.Google ScholarCross Ref
- Jose Patino, Natalia Tomashenko, Massimiliano Todisco, Andreas Nautsch, and Nicholas Evans. 2021. Speaker Anonymisation Using the McAdams Coefficient. In Interspeech 2021. ISCA, 1099--1103.Google Scholar
- Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.Google Scholar
- Jianwei Qian, Haohua Du, Jiahui Hou, Linlin Chen, Taeho Jung, and Xiang-Yang Li. 2019. Speech sanitizer: Speech content desensitization and voice anonymization. IEEE Transactions on Dependable and Secure Computing, Vol. 18, 6 (2019), 2631--2642.Google ScholarDigital Library
- Jianwei Qian, Haohua Du, Jiahui Hou, Linlin Chen, Taeho Jung, Xiang-Yang Li, Yu Wang, and Yanbo Deng. 2017. Voicemask: Anonymize and sanitize voice input on mobile devices. arXiv preprint arXiv:1711.11460 (2017).Google Scholar
- Karthikeyan N Ramamurthy and Andreas S Spanias. 2010. MATLAB® software for the code excited linear prediction algorithm: The federal standard-1016. Synthesis Lectures on Algorithms and Software in Engineering, Vol. 2, 1 (2010), 1--109.Google ScholarCross Ref
- Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio. 2021. SpeechBrain: A General-Purpose Speech Toolkit. arxiv: 2106.04624 [eess.AS] arXiv:2106.04624.Google Scholar
- Manfred Schroeder and B Atal. 1985. Code-excited linear prediction (CELP): High-quality speech at very low bit rates. In ICASSP'85. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 10. IEEE, 937--940.Google ScholarCross Ref
- seeedstudio. 2021. ReSpeaker Core v2.0. https://wiki.seeedstudio.com/ReSpeaker_Core_v2.0/.Google Scholar
- David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5329--5333.Google ScholarDigital Library
- F. Soong and B. Juang. 1984. Line spectrum pair (LSP) and speech data compression. Proc. ICASSP, vol.1, Vol. 9 (1984), 37--40.Google Scholar
- Andreas S Spanias. 1994. Speech coding: A tutorial review. Proc. IEEE, Vol. 82, 10 (1994), 1541--1582.Google ScholarCross Ref
- Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2011. An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, 7 (2011), 2125--2136.Google ScholarDigital Library
- Siri Team. 2018. Personalized Hey Siri. https://machinelearning.apple.com/research/personalized-hey-siri.Google Scholar
- Nuttakorn Thubthong and Boonserm Kijsirikul. 2001. Support vector machines for Thai phoneme recognition. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol. 9, 06 (2001), 803--813.Google ScholarCross Ref
- Tavish Vaidya and Micah Sherr. 2019. You Talk Too Much: Limiting Privacy Exposure Via Voice Input. In 2019 IEEE Security and Privacy Workshops (SPW). 84--91.Google Scholar
- VoicePrivacy2020. 2020. Voice Privacy Challenge 2020. https://github.com/Voice-Privacy-Challenge/Voice-Privacy-Challenge-2020/.Google Scholar
- Zhizheng Wu, Sheng Gao, Eng Siong Cling, and Haizhou Li. 2014. A study on replay attack and anti-spoofing for text-dependent speaker verification. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific. IEEE, 1--5.Google ScholarCross Ref
- Yi Xie, Zhuohang Li, Cong Shi, Jian Liu, Yingying Chen, and Bo Yuan. 2021. Enabling fast and universal audio adversarial attack using generative model. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14129--14137.Google ScholarCross Ref
- Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. 2019. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. University of Edinburgh.Google Scholar
- Young-Sun Yun, Jinman Jung, and Seongbae Eun. 2015. Voice Conversion Between Synthesized Bilingual Voices Using Line Spectral Frequencies. In International Conference on Speech and Computer. Springer, 463--471.Google Scholar
- yuunin. 2020. time-invariant-anonymization. https://github.com/yuunin/time-invariant-anonymization.Google Scholar
- Lei Zhang, Yan Meng, Jiahao Yu, Chong Xiang, Brandon Falk, and Haojin Zhu. 2020. Voiceprint mimicry attack towards speaker verification system in smart home. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 377--386.Google ScholarDigital Library
Index Terms
- MicPro: Microphone-based Voice Privacy Protection
Recommendations
A Novel Method to Evaluate the Privacy Protection in Speaker Anonymization
Artificial Intelligence and SecurityAbstractThe technique to hide the real identity of speakers is called speaker anonymization. Aiming at deceiving automatic speaker verification (ASV) systems, speaker anonymization is usually conducted by modifying the temporal or spectral properties of ...
Voice Privacy Using Time-Scale and Pitch Modification
AbstractThere is a growing demand toward digitization of various day-to-day work and hence, there is a surge in use of Intelligent Personal Assistants. The extensive use of these smart digital assistants asks for security and privacy preservation ...
Spectral Enhancement of Whispered Speech Based on Probability Mass Function
AICT '10: Proceedings of the 2010 Sixth Advanced International Conference on TelecommunicationsWhispered speech can be effectively used for quiet and private communications over mobile phones and is also the communication means for ENT patients under a regime of voice rest. The reconstruction of natural sounding speech from such whispers can be ...
Comments