skip to main content
10.1145/3576915.3616616acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

MicPro: Microphone-based Voice Privacy Protection

Published:21 November 2023Publication History

ABSTRACT

Hundreds of hours of audios are recorded and transmitted over the Internet for voice interactions such as virtual calls or speech recognitions. As these recordings are uploaded, embedded biometric information, i.e., voiceprints, is unnecessarily exposed. This paper proposes the first privacy-enhanced microphone module (i.e., MicPro) that can produce anonymous audio recordings with biometric information suppressed while preserving speech quality for human perception or linguistic content for speech recognition. Limited by the hardware capabilities of microphone modules, previous works that modify recording at the software level are inapplicable. To achieve anonymity in this scenario, MicPro transforms formants, which are distinct for each person due to the unique physiological structure of the vocal organs, and formant transformations are done by modifying the linear spectrum frequencies (LSFs) provided by a popular codec (i.e., CELP) in low-latency communications.

To strike a balance between anonymity and usability, we use a multi-objective genetic algorithm (NSGA-II) to optimize the transformation coefficients. We implement MicPro on an off-the-shelf microphone module and evaluate the performance of MicPro on several ASV systems, ASR systems, corpora, and in real-world setup. Our experiments show that for the state-of-the-art ASV systems, MicPro outperforms existing software-based strategies that utilize signal processing (SP) techniques, achieving an EER that is 5~10% higher and MMR that is 20% higher than existing works while maintaining a comparable level of usability.

References

  1. Bishnu S. Atal A. 2003. Speech Synthesis Based on Linear Prediction. Encyclopedia of Physical Science and Technology (Third Edition) (2003), 645--655.Google ScholarGoogle Scholar
  2. Amazon. 2014. Amazon Alexa. https://developer.amazon.com/alexa.Google ScholarGoogle Scholar
  3. Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, and Bai et al. 2016. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. In Proceedings of The 33rd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 48). PMLR, 173--182.Google ScholarGoogle Scholar
  4. Louie Andre. 2023. 53 Important Statistics About How Much Data Is Created Every Day. https://financesonline.com/how-much-data-is-created-every-day/.Google ScholarGoogle Scholar
  5. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Advances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 12449--12460.Google ScholarGoogle Scholar
  6. Fahimeh Bahmaninezhad, Chunlei Zhang, and John Hansen. 2018. Convolutional Neural Network Based Speaker De-Identification. In Proc. The Speaker and Language Recognition Workshop (Odyssey 2018). 255--260.Google ScholarGoogle ScholarCross RefCross Ref
  7. Adil Benyassine, Eyal Shlomot, H-Y Su, Dominique Massaloux, Claude Lamblin, and J-P Petit. 1997. ITU-T Recommendation G. 729 Annex B: a silence compression scheme for use with G. 729 optimized for V. 70 digital simultaneous voice and data applications. IEEE Communications Magazine, Vol. 35, 9 (1997), 64--73.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bruno Bessette, Redwan Salami, Roch Lefebvre, Milan Jelinek, Jani Rotola-Pukkila, Janne Vainio, Hannu Mikkola, and Kari Jarvinen. 2002. The adaptive multirate wideband speech codec (AMR-WB). IEEE transactions on speech and audio processing, Vol. 10, 8 (2002), 620--636.Google ScholarGoogle ScholarCross RefCross Ref
  9. J. Blank and K. Deb. 2020. pymoo: Multi-Objective Optimization in Python. IEEE Access, Vol. 8 (2020), 89497--89509.Google ScholarGoogle ScholarCross RefCross Ref
  10. Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. 2017. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  11. Guangke Chen, Sen Chenb, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. 2021. Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems. In 2021 IEEE Symposium on Security and Privacy (SP). 694--711.Google ScholarGoogle Scholar
  12. Peterson-Barney database. 1995. https://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/speech/database/pb/.Google ScholarGoogle Scholar
  13. Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation, Vol. 6, 2 (2002), 182--197.Google ScholarGoogle Scholar
  14. Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2010. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, 4 (2010), 788--798.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jiangyi Deng, Fei Teng, Yanjiao Chen, Xiaofu Chen, Zhaohui Wang, and Wenyuan Xu. 2023. V-Cloak: Intelligibility-, Naturalness-&Timbre-Preserving Real-Time Voice Anonymization. In 32nd USENIX Security Symposium (USENIX Security 23). 5181--5198.Google ScholarGoogle Scholar
  16. Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of Interspeech 2020. 3830--3834.Google ScholarGoogle Scholar
  17. Ellen Eide and Herbert Gish. 1996. A parametric approach to vocal tract length normalization. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Vol. 1. IEEE, 346--348.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen, Massimiliano Todisco, Nicholas Evans, and Jean-Francois Bonastre. 2019. Speaker anonymization using x-vector and neural waveform models. arXiv preprint arXiv:1905.13561 (2019).Google ScholarGoogle Scholar
  19. V Muthu Ganesh and N Janukiruman. 2019. A survey of various effective Codec implementation methods with different real time applications. In 2019 international conference on communication and electronics systems (ICCES). IEEE, 1279--1283.Google ScholarGoogle ScholarCross RefCross Ref
  20. Joaqu'in González-Rodríguez, Doroteo Torre Toledano, and Javier Ortega-Garc'ia. 2008. Voice biometrics. In Handbook of biometrics. Springer, 151--170.Google ScholarGoogle Scholar
  21. Priyanka Gupta, Gauri P Prajapati, Shrishti Singh, Madhu R Kamble, and Hemant A Patil. 2020. Design of voice privacy system using linear prediction. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 543--549.Google ScholarGoogle Scholar
  22. Wenbin Huang, Wenjuan Tang, Hanyuan Chen, Hongbo Jiang, and Yaoxue Zhang. 2022a. Unauthorized Microphone Access Restraint Based on User Behavior Perception in Mobile Devices. IEEE Transactions on Mobile Computing 01 (2022), 1--16.Google ScholarGoogle Scholar
  23. Wenbin Huang, Wenjuan Tang, Kuan Zhang, Haojin Zhu, and Yaoxue Zhang. 2022b. Thwarting unauthorized voice eavesdropping via touch sensing in mobile systems. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 31--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. ITU-T. 2012. Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction (CS-ACELP).Google ScholarGoogle Scholar
  25. ITU-T. 2021. G.729: Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction (CS-ACELP). https://www.itu.int/rec/T-REC-G.729.Google ScholarGoogle Scholar
  26. Sushil Jajodia and Henk CA van van Tilborg. 2011. Encyclopedia of Cryptography and Security: L-Z. Springer.Google ScholarGoogle Scholar
  27. Tadej Justin, and France Mihelič. 2015. Speaker de-identification using diphone recognition and speech synthesis. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. 04. 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  28. Hiroto Kai, Shinnosuke Takamichi, Sayaka Shiota, and Hitoshi Kiya. 2021. Lightweight voice anonymization based on data-driven optimization of cascaded voice modification modules. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 560--566.Google ScholarGoogle ScholarCross RefCross Ref
  29. Anssi Klapuri. 2006. Introduction to music transcription. In Signal processing methods for music transcription. Springer, 3--20.Google ScholarGoogle Scholar
  30. Jacob Leon Kröger, Otto Hans-Martin Lutz, and Philip Raschke. 2020. Privacy implications of voice and speech analysis-information disclosure by inference. Privacy and Identity Management. Data for Better Living: AI and Privacy: 14th IFIP WG 9.2, 9.6/11.7, 11.6/SIG 9.2. 2 International Summer School, Windisch, Switzerland, August 19--23, 2019, Revised Selected Papers 14 (2020), 242--258.Google ScholarGoogle Scholar
  31. Paul Lachat, Nadia Bennani, Veronika Rehn-Sonigo, Lionel Brunie, and Harald Kosch. 2022. Detecting Inference Attacks Involving Raw Sensor Data: A Case Study. Sensors, Vol. 22, 21 (2022).Google ScholarGoogle Scholar
  32. Marianne Latinus and Pascal Belin. 2011. Human voice perception. Current Biology, Vol. 21, 4 (2011), R143--R145.Google ScholarGoogle ScholarCross RefCross Ref
  33. Jaemin Lim, Kiyeon Kim, Hyunwoo Yu, and Suk-Bok Lee. 2022. Overo: Sharing Private Audio Recordings. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 1933--1946.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Mishaim Malik, Muhammad Kamran Malik, Khawar Mehmood, and Imran Makhdoom. 2021. Automatic speech recognition: a survey. Multimedia Tools and Applications, Vol. 80 (2021), 9411--9457.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ian Vince McLoughlin. 2008. Line spectral pairs. Signal processing, Vol. 88, 3 (2008), 448--467.Google ScholarGoogle Scholar
  36. Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017).Google ScholarGoogle Scholar
  37. Toshiyuki Nomura and Masahiro Iwadare. 1999. Voice over IP systems with speech bitrate adaptation based on MPEG-4 wideband CELP. In 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No. 99EX351). IEEE, 132--134.Google ScholarGoogle ScholarCross RefCross Ref
  38. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206--5210.Google ScholarGoogle ScholarCross RefCross Ref
  39. Jose Patino, Natalia Tomashenko, Massimiliano Todisco, Andreas Nautsch, and Nicholas Evans. 2021. Speaker Anonymisation Using the McAdams Coefficient. In Interspeech 2021. ISCA, 1099--1103.Google ScholarGoogle Scholar
  40. Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.Google ScholarGoogle Scholar
  41. Jianwei Qian, Haohua Du, Jiahui Hou, Linlin Chen, Taeho Jung, and Xiang-Yang Li. 2019. Speech sanitizer: Speech content desensitization and voice anonymization. IEEE Transactions on Dependable and Secure Computing, Vol. 18, 6 (2019), 2631--2642.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jianwei Qian, Haohua Du, Jiahui Hou, Linlin Chen, Taeho Jung, Xiang-Yang Li, Yu Wang, and Yanbo Deng. 2017. Voicemask: Anonymize and sanitize voice input on mobile devices. arXiv preprint arXiv:1711.11460 (2017).Google ScholarGoogle Scholar
  43. Karthikeyan N Ramamurthy and Andreas S Spanias. 2010. MATLAB® software for the code excited linear prediction algorithm: The federal standard-1016. Synthesis Lectures on Algorithms and Software in Engineering, Vol. 2, 1 (2010), 1--109.Google ScholarGoogle ScholarCross RefCross Ref
  44. Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio. 2021. SpeechBrain: A General-Purpose Speech Toolkit. arxiv: 2106.04624 [eess.AS] arXiv:2106.04624.Google ScholarGoogle Scholar
  45. Manfred Schroeder and B Atal. 1985. Code-excited linear prediction (CELP): High-quality speech at very low bit rates. In ICASSP'85. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 10. IEEE, 937--940.Google ScholarGoogle ScholarCross RefCross Ref
  46. seeedstudio. 2021. ReSpeaker Core v2.0. https://wiki.seeedstudio.com/ReSpeaker_Core_v2.0/.Google ScholarGoogle Scholar
  47. David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5329--5333.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. F. Soong and B. Juang. 1984. Line spectrum pair (LSP) and speech data compression. Proc. ICASSP, vol.1, Vol. 9 (1984), 37--40.Google ScholarGoogle Scholar
  49. Andreas S Spanias. 1994. Speech coding: A tutorial review. Proc. IEEE, Vol. 82, 10 (1994), 1541--1582.Google ScholarGoogle ScholarCross RefCross Ref
  50. Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2011. An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, 7 (2011), 2125--2136.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Siri Team. 2018. Personalized Hey Siri. https://machinelearning.apple.com/research/personalized-hey-siri.Google ScholarGoogle Scholar
  52. Nuttakorn Thubthong and Boonserm Kijsirikul. 2001. Support vector machines for Thai phoneme recognition. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol. 9, 06 (2001), 803--813.Google ScholarGoogle ScholarCross RefCross Ref
  53. Tavish Vaidya and Micah Sherr. 2019. You Talk Too Much: Limiting Privacy Exposure Via Voice Input. In 2019 IEEE Security and Privacy Workshops (SPW). 84--91.Google ScholarGoogle Scholar
  54. VoicePrivacy2020. 2020. Voice Privacy Challenge 2020. https://github.com/Voice-Privacy-Challenge/Voice-Privacy-Challenge-2020/.Google ScholarGoogle Scholar
  55. Zhizheng Wu, Sheng Gao, Eng Siong Cling, and Haizhou Li. 2014. A study on replay attack and anti-spoofing for text-dependent speaker verification. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific. IEEE, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  56. Yi Xie, Zhuohang Li, Cong Shi, Jian Liu, Yingying Chen, and Bo Yuan. 2021. Enabling fast and universal audio adversarial attack using generative model. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14129--14137.Google ScholarGoogle ScholarCross RefCross Ref
  57. Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. 2019. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. University of Edinburgh.Google ScholarGoogle Scholar
  58. Young-Sun Yun, Jinman Jung, and Seongbae Eun. 2015. Voice Conversion Between Synthesized Bilingual Voices Using Line Spectral Frequencies. In International Conference on Speech and Computer. Springer, 463--471.Google ScholarGoogle Scholar
  59. yuunin. 2020. time-invariant-anonymization. https://github.com/yuunin/time-invariant-anonymization.Google ScholarGoogle Scholar
  60. Lei Zhang, Yan Meng, Jiahao Yu, Chong Xiang, Brandon Falk, and Haojin Zhu. 2020. Voiceprint mimicry attack towards speaker verification system in smart home. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 377--386.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MicPro: Microphone-based Voice Privacy Protection

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security
        November 2023
        3722 pages
        ISBN:9798400700507
        DOI:10.1145/3576915

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 November 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,261of6,999submissions,18%

        Upcoming Conference

        CCS '24
        ACM SIGSAC Conference on Computer and Communications Security
        October 14 - 18, 2024
        Salt Lake City , UT , USA
      • Article Metrics

        • Downloads (Last 12 months)452
        • Downloads (Last 6 weeks)89

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader