research-article

MicPro: Microphone-based Voice Privacy Protection

Authors:
Shilin Xiao

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China

0009-0008-1319-0428
View Profile

,
Xiaoyu Ji

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China

0000-0002-1101-0007
View Profile

,
Chen Yan

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China

0000-0003-4430-5263
View Profile

,
Zhicong Zheng

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China

0000-0002-7298-0381
View Profile

,
Wenyuan Xu

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China

0000-0002-5043-9148
View Profile

CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications SecurityNovember 2023Pages 1302–1316https://doi.org/10.1145/3576915.3616616

Published:21 November 2023Publication History

CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

Pages 1302–1316

ABSTRACT

Hundreds of hours of audios are recorded and transmitted over the Internet for voice interactions such as virtual calls or speech recognitions. As these recordings are uploaded, embedded biometric information, i.e., voiceprints, is unnecessarily exposed. This paper proposes the first privacy-enhanced microphone module (i.e., MicPro) that can produce anonymous audio recordings with biometric information suppressed while preserving speech quality for human perception or linguistic content for speech recognition. Limited by the hardware capabilities of microphone modules, previous works that modify recording at the software level are inapplicable. To achieve anonymity in this scenario, MicPro transforms formants, which are distinct for each person due to the unique physiological structure of the vocal organs, and formant transformations are done by modifying the linear spectrum frequencies (LSFs) provided by a popular codec (i.e., CELP) in low-latency communications.

To strike a balance between anonymity and usability, we use a multi-objective genetic algorithm (NSGA-II) to optimize the transformation coefficients. We implement MicPro on an off-the-shelf microphone module and evaluate the performance of MicPro on several ASV systems, ASR systems, corpora, and in real-world setup. Our experiments show that for the state-of-the-art ASV systems, MicPro outperforms existing software-based strategies that utilize signal processing (SP) techniques, achieving an EER that is 5~10% higher and MMR that is 20% higher than existing works while maintaining a comparable level of usability.

References

Bishnu S. Atal A. 2003. Speech Synthesis Based on Linear Prediction. Encyclopedia of Physical Science and Technology (Third Edition) (2003), 645--655.Google Scholar
Amazon. 2014. Amazon Alexa. https://developer.amazon.com/alexa.Google Scholar
Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, and Bai et al. 2016. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. In Proceedings of The 33rd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 48). PMLR, 173--182.Google Scholar
Louie Andre. 2023. 53 Important Statistics About How Much Data Is Created Every Day. https://financesonline.com/how-much-data-is-created-every-day/.Google Scholar
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Advances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 12449--12460.Google Scholar
Fahimeh Bahmaninezhad, Chunlei Zhang, and John Hansen. 2018. Convolutional Neural Network Based Speaker De-Identification. In Proc. The Speaker and Language Recognition Workshop (Odyssey 2018). 255--260.Google ScholarCross Ref
Adil Benyassine, Eyal Shlomot, H-Y Su, Dominique Massaloux, Claude Lamblin, and J-P Petit. 1997. ITU-T Recommendation G. 729 Annex B: a silence compression scheme for use with G. 729 optimized for V. 70 digital simultaneous voice and data applications. IEEE Communications Magazine, Vol. 35, 9 (1997), 64--73.Google ScholarDigital Library
Bruno Bessette, Redwan Salami, Roch Lefebvre, Milan Jelinek, Jani Rotola-Pukkila, Janne Vainio, Hannu Mikkola, and Kari Jarvinen. 2002. The adaptive multirate wideband speech codec (AMR-WB). IEEE transactions on speech and audio processing, Vol. 10, 8 (2002), 620--636.Google ScholarCross Ref
J. Blank and K. Deb. 2020. pymoo: Multi-Objective Optimization in Python. IEEE Access, Vol. 8 (2020), 89497--89509.Google ScholarCross Ref
Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. 2017. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, 1--5.Google ScholarCross Ref
Guangke Chen, Sen Chenb, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. 2021. Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems. In 2021 IEEE Symposium on Security and Privacy (SP). 694--711.Google Scholar
Peterson-Barney database. 1995. https://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/speech/database/pb/.Google Scholar
Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. 2002. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation, Vol. 6, 2 (2002), 182--197.Google Scholar
Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2010. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, 4 (2010), 788--798.Google ScholarDigital Library
Jiangyi Deng, Fei Teng, Yanjiao Chen, Xiaofu Chen, Zhaohui Wang, and Wenyuan Xu. 2023. V-Cloak: Intelligibility-, Naturalness-&Timbre-Preserving Real-Time Voice Anonymization. In 32nd USENIX Security Symposium (USENIX Security 23). 5181--5198.Google Scholar
Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of Interspeech 2020. 3830--3834.Google Scholar
Ellen Eide and Herbert Gish. 1996. A parametric approach to vocal tract length normalization. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Vol. 1. IEEE, 346--348.Google ScholarDigital Library
Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen, Massimiliano Todisco, Nicholas Evans, and Jean-Francois Bonastre. 2019. Speaker anonymization using x-vector and neural waveform models. arXiv preprint arXiv:1905.13561 (2019).Google Scholar
V Muthu Ganesh and N Janukiruman. 2019. A survey of various effective Codec implementation methods with different real time applications. In 2019 international conference on communication and electronics systems (ICCES). IEEE, 1279--1283.Google ScholarCross Ref
Joaqu'in González-Rodríguez, Doroteo Torre Toledano, and Javier Ortega-Garc'ia. 2008. Voice biometrics. In Handbook of biometrics. Springer, 151--170.Google Scholar
Priyanka Gupta, Gauri P Prajapati, Shrishti Singh, Madhu R Kamble, and Hemant A Patil. 2020. Design of voice privacy system using linear prediction. In 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 543--549.Google Scholar
Wenbin Huang, Wenjuan Tang, Hanyuan Chen, Hongbo Jiang, and Yaoxue Zhang. 2022a. Unauthorized Microphone Access Restraint Based on User Behavior Perception in Mobile Devices. IEEE Transactions on Mobile Computing 01 (2022), 1--16.Google Scholar
Wenbin Huang, Wenjuan Tang, Kuan Zhang, Haojin Zhu, and Yaoxue Zhang. 2022b. Thwarting unauthorized voice eavesdropping via touch sensing in mobile systems. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 31--40.Google ScholarDigital Library
ITU-T. 2012. Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction (CS-ACELP).Google Scholar
ITU-T. 2021. G.729: Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction (CS-ACELP). https://www.itu.int/rec/T-REC-G.729.Google Scholar
Sushil Jajodia and Henk CA van van Tilborg. 2011. Encyclopedia of Cryptography and Security: L-Z. Springer.Google Scholar
Tadej Justin, and France Mihelič. 2015. Speaker de-identification using diphone recognition and speech synthesis. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. 04. 1--7.Google ScholarCross Ref
Hiroto Kai, Shinnosuke Takamichi, Sayaka Shiota, and Hitoshi Kiya. 2021. Lightweight voice anonymization based on data-driven optimization of cascaded voice modification modules. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 560--566.Google ScholarCross Ref
Anssi Klapuri. 2006. Introduction to music transcription. In Signal processing methods for music transcription. Springer, 3--20.Google Scholar
Jacob Leon Kröger, Otto Hans-Martin Lutz, and Philip Raschke. 2020. Privacy implications of voice and speech analysis-information disclosure by inference. Privacy and Identity Management. Data for Better Living: AI and Privacy: 14th IFIP WG 9.2, 9.6/11.7, 11.6/SIG 9.2. 2 International Summer School, Windisch, Switzerland, August 19--23, 2019, Revised Selected Papers 14 (2020), 242--258.Google Scholar
Paul Lachat, Nadia Bennani, Veronika Rehn-Sonigo, Lionel Brunie, and Harald Kosch. 2022. Detecting Inference Attacks Involving Raw Sensor Data: A Case Study. Sensors, Vol. 22, 21 (2022).Google Scholar
Marianne Latinus and Pascal Belin. 2011. Human voice perception. Current Biology, Vol. 21, 4 (2011), R143--R145.Google ScholarCross Ref
Jaemin Lim, Kiyeon Kim, Hyunwoo Yu, and Suk-Bok Lee. 2022. Overo: Sharing Private Audio Recordings. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 1933--1946.Google ScholarDigital Library
Mishaim Malik, Muhammad Kamran Malik, Khawar Mehmood, and Imran Makhdoom. 2021. Automatic speech recognition: a survey. Multimedia Tools and Applications, Vol. 80 (2021), 9411--9457.Google ScholarDigital Library
Ian Vince McLoughlin. 2008. Line spectral pairs. Signal processing, Vol. 88, 3 (2008), 448--467.Google Scholar
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017).Google Scholar
Toshiyuki Nomura and Masahiro Iwadare. 1999. Voice over IP systems with speech bitrate adaptation based on MPEG-4 wideband CELP. In 1999 IEEE Workshop on Speech Coding Proceedings. Model, Coders, and Error Criteria (Cat. No. 99EX351). IEEE, 132--134.Google ScholarCross Ref
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206--5210.Google ScholarCross Ref
Jose Patino, Natalia Tomashenko, Massimiliano Todisco, Andreas Nautsch, and Nicholas Evans. 2021. Speaker Anonymisation Using the McAdams Coefficient. In Interspeech 2021. ISCA, 1099--1103.Google Scholar
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.Google Scholar
Jianwei Qian, Haohua Du, Jiahui Hou, Linlin Chen, Taeho Jung, and Xiang-Yang Li. 2019. Speech sanitizer: Speech content desensitization and voice anonymization. IEEE Transactions on Dependable and Secure Computing, Vol. 18, 6 (2019), 2631--2642.Google ScholarDigital Library
Jianwei Qian, Haohua Du, Jiahui Hou, Linlin Chen, Taeho Jung, Xiang-Yang Li, Yu Wang, and Yanbo Deng. 2017. Voicemask: Anonymize and sanitize voice input on mobile devices. arXiv preprint arXiv:1711.11460 (2017).Google Scholar
Karthikeyan N Ramamurthy and Andreas S Spanias. 2010. MATLAB® software for the code excited linear prediction algorithm: The federal standard-1016. Synthesis Lectures on Algorithms and Software in Engineering, Vol. 2, 1 (2010), 1--109.Google ScholarCross Ref
Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio. 2021. SpeechBrain: A General-Purpose Speech Toolkit. arxiv: 2106.04624 [eess.AS] arXiv:2106.04624.Google Scholar
Manfred Schroeder and B Atal. 1985. Code-excited linear prediction (CELP): High-quality speech at very low bit rates. In ICASSP'85. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 10. IEEE, 937--940.Google ScholarCross Ref
seeedstudio. 2021. ReSpeaker Core v2.0. https://wiki.seeedstudio.com/ReSpeaker_Core_v2.0/.Google Scholar
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5329--5333.Google ScholarDigital Library
F. Soong and B. Juang. 1984. Line spectrum pair (LSP) and speech data compression. Proc. ICASSP, vol.1, Vol. 9 (1984), 37--40.Google Scholar
Andreas S Spanias. 1994. Speech coding: A tutorial review. Proc. IEEE, Vol. 82, 10 (1994), 1541--1582.Google ScholarCross Ref
Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2011. An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, 7 (2011), 2125--2136.Google ScholarDigital Library
Siri Team. 2018. Personalized Hey Siri. https://machinelearning.apple.com/research/personalized-hey-siri.Google Scholar
Nuttakorn Thubthong and Boonserm Kijsirikul. 2001. Support vector machines for Thai phoneme recognition. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol. 9, 06 (2001), 803--813.Google ScholarCross Ref
Tavish Vaidya and Micah Sherr. 2019. You Talk Too Much: Limiting Privacy Exposure Via Voice Input. In 2019 IEEE Security and Privacy Workshops (SPW). 84--91.Google Scholar
VoicePrivacy2020. 2020. Voice Privacy Challenge 2020. https://github.com/Voice-Privacy-Challenge/Voice-Privacy-Challenge-2020/.Google Scholar
Zhizheng Wu, Sheng Gao, Eng Siong Cling, and Haizhou Li. 2014. A study on replay attack and anti-spoofing for text-dependent speaker verification. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific. IEEE, 1--5.Google ScholarCross Ref
Yi Xie, Zhuohang Li, Cong Shi, Jian Liu, Yingying Chen, and Bo Yuan. 2021. Enabling fast and universal audio adversarial attack using generative model. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14129--14137.Google ScholarCross Ref
Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. 2019. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. University of Edinburgh.Google Scholar
Young-Sun Yun, Jinman Jung, and Seongbae Eun. 2015. Voice Conversion Between Synthesized Bilingual Voices Using Line Spectral Frequencies. In International Conference on Speech and Computer. Springer, 463--471.Google Scholar
yuunin. 2020. time-invariant-anonymization. https://github.com/yuunin/time-invariant-anonymization.Google Scholar
Lei Zhang, Yan Meng, Jiahao Yu, Chong Xiang, Brandon Falk, and Haojin Zhu. 2020. Voiceprint mimicry attack towards speaker verification system in smart home. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 377--386.Google ScholarDigital Library

Index Terms

MicPro: Microphone-based Voice Privacy Protection
1. Security and privacy
  1. Human and societal aspects of security and privacy
    1. Privacy protections
    2. Usability in security and privacy

Recommendations

A Novel Method to Evaluate the Privacy Protection in Speaker Anonymization
Artificial Intelligence and Security
Abstract
The technique to hide the real identity of speakers is called speaker anonymization. Aiming at deceiving automatic speaker verification (ASV) systems, speaker anonymization is usually conducted by modifying the temporal or spectral properties of ...
Read More
Voice Privacy Using Time-Scale and Pitch Modification
Abstract
There is a growing demand toward digitization of various day-to-day work and hence, there is a surge in use of Intelligent Personal Assistants. The extensive use of these smart digital assistants asks for security and privacy preservation ... $_{}$
Read More
Spectral Enhancement of Whispered Speech Based on Probability Mass Function
AICT '10: Proceedings of the 2010 Sixth Advanced International Conference on Telecommunications

Whispered speech can be effectively used for quiet and private communications over mobile phones and is also the communication means for ENT patients under a regime of voice rest. The reconstruction of natural sounding speech from such whispers can be ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security
November 2023
3722 pages
ISBN:9798400700507
DOI:10.1145/3576915
General Chairs:
Weizhi Meng
Technical University of Denmark
,
Christian D. Jensen
Technical University of Denmark
,
Program Chairs:
Cas Cremers
CISPA Helmholtz Center for Information Security
,
Engin Kirda
Khoury College of Computer Sciences
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 November 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
anonymization
celp codec
microphone
voiceprint protection
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,261of6,999submissions,18%
Upcoming Conference
CCS '24

Sponsor:

sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 14 - 18, 2024

Salt Lake City , UT , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 452
  Total Downloads
- Downloads (Last 12 months)452
- Downloads (Last 6 weeks)89
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MicPro: Microphone-based Voice Privacy Protection

CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Novel Method to Evaluate the Privacy Protection in Speaker Anonymization

Voice Privacy Using Time-Scale and Pitch Modification

Spectral Enhancement of Whispered Speech Based on Probability Mass Function

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

MicPro: Microphone-based Voice Privacy Protection

CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Novel Method to Evaluate the Privacy Protection in Speaker Anonymization

Voice Privacy Using Time-Scale and Pitch Modification

Spectral Enhancement of Whispered Speech Based on Probability Mass Function

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media