skip to main content
10.1145/3581791.3596832acmconferencesArticle/Chapter ViewAbstractPublication PagesmobisysConference Proceedingsconference-collections
research-article

Towards Bone-Conducted Vibration Speech Enhancement on Head-Mounted Wearables

Published:18 June 2023Publication History

ABSTRACT

Head-mounted wearables are rapidly growing in popularity. However, a gap exists in providing robust voice-related applications like conversation or command control in complex environments, such as competing speakers and strong noises. The compact design of HMWs introduces non-trivial challenges to existing speech enhancement systems that use microphone recording only. In this paper, we handle this problem by using bone vibration conducted through the head skull. The principle is that the accelerometer is widely installed on head-mounted wearables and can capture the clean user's voice. Hence, we develop VibVoice, a lightweight multi-modal speech enhancement system for head-mounted wearables. We design a two-branch encoder-decoder deep neural network to fuse the high-level features of the two modalities and reconstruct clean speech. To address the issue of insufficient paired data for training, we extensively measure the bone conduction effect from a limited dataset to extract the physical impulse function for cross-modal data augmentation. We evaluate VibVoice on a dataset collected in real world and compare it with two state-of-the-art baselines. Results show that VibVoice yields up to 21% better performance in PESQ and up to 26% better performance in SNR compared with the baseline with 72 times less paired data required. We also conduct a user study with 35 participants, in which 87% participants prefer VibVoice compared with the baseline. In addition, VibVoice requires 4 to 31 times less execution time compared with baselines on mobile devices. The demo audio of VibVoice is available at https://www.youtube.com/watch?v=8_-s_C_NGRI.

References

  1. 2022. Bone-Conduction Ear Microphone | Shop Motorola Solutions. https://shop.motorolasolutions.com/bone-conduction-ear-microphone-system/product/PMLN5464A. (Accessed on 08/19/2022).Google ScholarGoogle Scholar
  2. 2022. Ear Bone Microphone - EarHugger®. https://earhugger.com/product/ear-bone-microphone/. (Accessed on 08/19/2022).Google ScholarGoogle Scholar
  3. 2022. What Noises Cause Hearing Loss? | NCEH | CDC. https://www.cdc.gov/nceh/hearing_loss/what_noises_cause_hearing_loss.html. (Accessed on 03/16/2023).Google ScholarGoogle Scholar
  4. S Abhishek Anand and Nitesh Saxena. 2018. Speechless: Analyzing the threat to speech privacy from smartphone motion sensors. In 2018 IEEE Symposium on Security and Privacy (SP). IEEE, 1000--1017.Google ScholarGoogle ScholarCross RefCross Ref
  5. Apple. 2022. Listen with spatial audio for AirPods and Beats - Apple Support. https://support.apple.com/en-us/HT211775. (Accessed on 12/09/2022).Google ScholarGoogle Scholar
  6. Apple. 2023. CMHeadphoneMotionManager | Apple Developer Documentation. https://developer.apple.com/documentation/coremotion/cmheadphonemotionmanager. (Accessed on 05/16/2023).Google ScholarGoogle Scholar
  7. Zhongjie Ba, Tianhang Zheng, Xinyu Zhang, Zhan Qin, Baochun Li, Xue Liu, and Kui Ren. 2020. Learning-based Practical Smartphone Eavesdropping with Built-in Accelerometer.. In NDSS.Google ScholarGoogle Scholar
  8. You Chang, Namkeun Kim, and Stefan Stenfelt. 2016. The development of a whole-head human finite-element model for simulation of the transmission of bone-conducted sound. The Journal of the Acoustical Society of America 140, 3 (2016), 1635--1651.Google ScholarGoogle ScholarCross RefCross Ref
  9. Ishan Chatterjee, Maruchi Kim, Vivek Jayaram, Shyamnath Gollakota, Ira Kemelmacher, Shwetak Patel, and Steven M Seitz. 2022. ClearBuds: wireless binaural earbuds for learning-based speech enhancement. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. 384--396.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. 2020. Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 (2020).Google ScholarGoogle Scholar
  11. Martin Dietz, Lars Liljeryd, Kristofer Kjorling, and Oliver Kunz. 2002. Spectral Band Replication, a novel approach in audio coding. In Audio Engineering Society Convention 112. Audio Engineering Society.Google ScholarGoogle Scholar
  12. Yariv Ephraim and Harry L Van Trees. 1995. A signal subspace approach for speech enhancement. IEEE Transactions on speech and audio processing 3, 4 (1995), 251--266.Google ScholarGoogle ScholarCross RefCross Ref
  13. Ruohan Gao and Kristen Grauman. 2021. Visualvoice: Audio-visual speech separation with cross-modal consistency. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 15490--15500.Google ScholarGoogle ScholarCross RefCross Ref
  14. Pranav Gupta, Yaesuk Jeong, Jaehoo Choi, Mark Faingold, and F Ayazi. 2018. Precision high-bandwidth out-of-plane accelerometer as contact microphone for body-worn auscultation devices. In 2018 Hilton Head Workshop. 30--33.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jun Han, Albert Jin Chung, and Patrick Tague. 2017. Pitchln: eavesdropping via intelligible speech reconstruction using non-acoustic sensor fusion. In Proceedings of the 16th ACM/IEEE International Conference on Information Processing in Sensor Networks. 181--192.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Xiang Hao, Xiangdong Su, Radu Horaud, and Xiaofei Li. 2021. Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6633--6637.Google ScholarGoogle ScholarCross RefCross Ref
  17. Youna Ji, Jun Byun, and Young-cheol Park. 2017. Coherence-Based Dual-Channel Noise Reduction Algorithm in a Complex Noisy Environment.. In INTERSPEECH. 2670--2674.Google ScholarGoogle Scholar
  18. Prerna Khanna, Tanmay Srivastava, Shijia Pan, Shubham Jain, and Phuc Nguyen. 2021. JawSense: recognizing unvoiced sound using a low-cost ear-worn system. In Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications. 44--49.Google ScholarGoogle Scholar
  19. Daichi Kitamura, Nobutaka Ono, Hiroshi Sawada, Hirokazu Kameoka, and Hiroshi Saruwatari. 2016. Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 9 (2016), 1626--1641.Google ScholarGoogle ScholarCross RefCross Ref
  20. Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, and Sanjeev Khudanpur. 2017. A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5220--5224.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Huining Li, Chenhan Xu, Aditya Singh Rathore, Zhengxiong Li, Hanbin Zhang, Chen Song, Kun Wang, Lu Su, Feng Lin, Kui Ren, et al. 2020. VocalPrint: exploring a resilient and secure voice authentication via mmWave biometric interrogation. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems. 312--325.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Tiantian Liu, Ming Gao, Feng Lin, Chao Wang, Zhongjie Ba, Jinsong Han, Wenyao Xu, and Kui Ren. 2021. Wavoice: A Noise-resistant Multi-modal Speech Recognition System Fusing mmWave and Audio Signals. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 97--110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. He Lixing, Hou Haozheng, Shi Shuyao, Shuai Xian, and Yan Zhenyu. 2023. Code of VibVoice. https://github.com/CUHK-AIoT-Sensing/vibvoice.Google ScholarGoogle Scholar
  24. Logitech. 2022. Logitech H650e Business Headset with Noise Cancelling Mic. https://www.logitech.com/en-hk/products/headsets/h650e-business-noise-cancelling.html. (Accessed on 04/13/2022).Google ScholarGoogle Scholar
  25. Héctor A Cordourier Maruri, Paulo Lopez-Meyer, Jonathan Huang, Willem Marco Beltman, Lama Nachman, and Hong Lu. 2018. V-speech: Noise-robust speech capturing glasses using vibration sensors. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 4 (2018), 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, and Jesper Jensen. 2021. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Frederik Nagel and Sascha Disch. 2009. A harmonic bandwidth extension method for audio codecs. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 145--148.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Nobuyuki Otsu. 1979. A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics 9, 1 (1979), 62--66.Google ScholarGoogle ScholarCross RefCross Ref
  29. Muhammed Zahid Ozturk, Chenshu Wu, Beibei Wang, andKJ Liu. 2021. RadioMic: Sound Sensing via mmWave Signals. arXiv preprint arXiv:2108.03164 (2021).Google ScholarGoogle Scholar
  30. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206--5210.Google ScholarGoogle ScholarCross RefCross Ref
  31. Karol J. Piczak. 2022. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia (Brisbane, Australia, 2015-10-13). ACM Press, 1015--1018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Christoph Pörschmann, Tim Lübeck, and Johannes M Arend. 2020. Impact of face masks on voice radiation. The Journal of the Acoustical Society of America 148, 6 (2020), 3663--3670.Google ScholarGoogle ScholarCross RefCross Ref
  33. Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, et al. 2021. SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624 (2021).Google ScholarGoogle Scholar
  34. Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), Vol. 2. IEEE, 749--752.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.Google ScholarGoogle ScholarCross RefCross Ref
  36. Sriram Sami, Yimin Dai, Sean Rui Xiang Tan, Nirupam Roy, and Jun Han. 2020. Spying with your robot vacuum cleaner: eavesdropping via lidar sensors. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems. 354--367.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Bosch Sensortec. 2021. Inertial Measurement Unit BMI160. https://www.bosch-sensortec.com/products/motion-sensors/imus/bmi160/.Google ScholarGoogle Scholar
  38. Cong Shi, Xiangyu Xu, Tianfang Zhang, Payton Walker, Yi Wu, Jian Liu, Nitesh Saxena, Yingying Chen, and Jiadi Yu. 2021. Face-Mic: inferring live speech and speaker identity via subtle facial dynamics captured by AR/VR motion sensors. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 478--490.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Statista. 2020. Wearable shipments by category 2024. https://www.statista.com/statistics/690731/wearables-worldwide-shipments-by-product-category/. (Accessed on 02/15/2022).Google ScholarGoogle Scholar
  40. Weigao Su, Daibo Liu, Taiyuan Zhang, and Hongbo Jiang. 2021. Towards Device Independent Eavesdropping on Telephone Conversations with Built-in Accelerometer. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 4 (2021), 1--29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. 2021. Attention is all you need in speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 21--25.Google ScholarGoogle ScholarCross RefCross Ref
  42. Ke Sun and Xinyu Zhang. 2021. UltraSE: single-channel speech enhancement using ultrasound. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 160--173.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, and Dominik Roblek. 2020. SEANet: A multi-modal speech enhancement network. arXiv preprint arXiv:2009.02095 (2020).Google ScholarGoogle Scholar
  44. Department of the State USA. 2022. Everyday Conversations: Learning American English | American English. https://americanenglish.state.gov/resources/everyday-conversations-learning-american-english. (Accessed on 02/02/2022).Google ScholarGoogle Scholar
  45. Heming Wang, Xueliang Zhang, and DeLiang Wang. 2022. Fusing Bone-Conduction and Air-Conduction Sensors for Complex-Domain Speech Enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 3134--3143.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Tianshi Wang, Shuochao Yao, Shengzhong Liu, Jinyang Li, Dongxin Liu, Huajie Shao, Ruijie Wang, and Tarek Abdelzaher. 2021. Audio Keyword Reconstruction from On-Device Motion Sensor Signals via Neural Frequency Unfolding. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 3 (2021), 1--29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Wikipedia. 2020. Perceptual Evaluation of Speech Quality - Wikipedia. https://en.wikipedia.org/wiki/Perceptual_Evaluation_of_Speech_Quality. (Accessed on 12/02/2022).Google ScholarGoogle Scholar
  48. Wikipedia. 2020. Voice frequency - Wikipedia. https://en.wikipedia.org/wiki/Voice_frequency. (Accessed on 07/12/2022).Google ScholarGoogle Scholar
  49. WikiPedia. 2022. Bluetooth - Wikipedia. https://en.wikipedia.org/wiki/Bluetooth. (Accessed on 07/15/2022).Google ScholarGoogle Scholar
  50. Wikipedia. 2022. Bone conduction - Wikipedia. https://en.wikipedia.org/wiki/Bone_conduction. (Accessed on 07/26/2022).Google ScholarGoogle Scholar
  51. Wikipedia. 2022. Contact microphone - Wikipedia. https://en.wikipedia.org/wiki/Contact_microphone. (Accessed on 07/26/2022).Google ScholarGoogle Scholar
  52. Sook Young Won and Jonathan Berger. 2005. Estimating transfer function from air to bone conduction using singing voice. In ICMC.Google ScholarGoogle Scholar
  53. Sean UN Wood, Jean Rouat, Stéphane Dupont, and Gueorgui Pironkov. 2017. Blind speech separation and enhancement with GCC-NMF. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 4 (2017), 745--755.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Nima Yousefian, John HL Hansen, and Philipos C Loizou. 2014. A hybrid coherence model for noise reduction in reverberant environments. IEEE Signal Processing Letters 22, 3 (2014), 279--282.Google ScholarGoogle ScholarCross RefCross Ref
  55. Qian Zhang, Dong Wang, Run Zhao, Yinggang Yu, and Junjie Shen. 2021. Sensing to hear: Speech enhancement for mobile devices using acoustic signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 3 (2021), 1--30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Siyuan Zhang and Xiaofei Li. 2021. Microphone Array Generalization for Multichannel Narrowband Deep Speech Enhancement. arXiv preprint arXiv:2107.12601 (2021).Google ScholarGoogle Scholar

Index Terms

  1. Towards Bone-Conducted Vibration Speech Enhancement on Head-Mounted Wearables

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MobiSys '23: Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services
          June 2023
          651 pages
          ISBN:9798400701108
          DOI:10.1145/3581791

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 June 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          MobiSys '23 Paper Acceptance Rate41of198submissions,21%Overall Acceptance Rate274of1,679submissions,16%

          Upcoming Conference

          MOBISYS '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader