research-article

Towards Bone-Conducted Vibration Speech Enhancement on Head-Mounted Wearables

Authors:
Lixing He

Information Engineering, The Chinese University of Hong Kong, Hong Kong, Hong Kong

Information Engineering, The Chinese University of Hong Kong, Hong Kong, Hong Kong

https://orcid.org/0000-0002-2130-1385
View Profile

,
Haozheng Hou

Information Engineering, The Chinese University of Hong Kong, Hong Kong, Hong Kong

Information Engineering, The Chinese University of Hong Kong, Hong Kong, Hong Kong

https://orcid.org/0000-0001-7763-0527
View Profile

,
Shuyao Shi

Information Engineering, The Chinese University of Hong Kong, Hong Kong, Hong Kong

Information Engineering, The Chinese University of Hong Kong, Hong Kong, Hong Kong

https://orcid.org/0000-0003-2013-907X
View Profile

,
Xian Shuai

The Chinese University of Hong Kong, Hong Kong, Hong Kong

The Chinese University of Hong Kong, Hong Kong, Hong Kong

https://orcid.org/0000-0002-6750-6706
View Profile

,
Zhenyu Yan

Information Engineering, The Chinese University of Hong Kong, Hong Kong, Hong Kong

Information Engineering, The Chinese University of Hong Kong, Hong Kong, Hong Kong

https://orcid.org/0000-0002-4433-5211
View Profile

MobiSys '23: Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and ServicesJune 2023Pages 14–27https://doi.org/10.1145/3581791.3596832

Published:18 June 2023Publication History

MobiSys '23: Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services

Pages 14–27

ABSTRACT

Head-mounted wearables are rapidly growing in popularity. However, a gap exists in providing robust voice-related applications like conversation or command control in complex environments, such as competing speakers and strong noises. The compact design of HMWs introduces non-trivial challenges to existing speech enhancement systems that use microphone recording only. In this paper, we handle this problem by using bone vibration conducted through the head skull. The principle is that the accelerometer is widely installed on head-mounted wearables and can capture the clean user's voice. Hence, we develop VibVoice, a lightweight multi-modal speech enhancement system for head-mounted wearables. We design a two-branch encoder-decoder deep neural network to fuse the high-level features of the two modalities and reconstruct clean speech. To address the issue of insufficient paired data for training, we extensively measure the bone conduction effect from a limited dataset to extract the physical impulse function for cross-modal data augmentation. We evaluate VibVoice on a dataset collected in real world and compare it with two state-of-the-art baselines. Results show that VibVoice yields up to 21% better performance in PESQ and up to 26% better performance in SNR compared with the baseline with 72 times less paired data required. We also conduct a user study with 35 participants, in which 87% participants prefer VibVoice compared with the baseline. In addition, VibVoice requires 4 to 31 times less execution time compared with baselines on mobile devices. The demo audio of VibVoice is available at https://www.youtube.com/watch?v=8_-s_C_NGRI.

References

2022. Bone-Conduction Ear Microphone | Shop Motorola Solutions. https://shop.motorolasolutions.com/bone-conduction-ear-microphone-system/product/PMLN5464A. (Accessed on 08/19/2022).Google Scholar
2022. Ear Bone Microphone - EarHugger^®. https://earhugger.com/product/ear-bone-microphone/. (Accessed on 08/19/2022).Google Scholar
2022. What Noises Cause Hearing Loss? | NCEH | CDC. https://www.cdc.gov/nceh/hearing_loss/what_noises_cause_hearing_loss.html. (Accessed on 03/16/2023).Google Scholar
S Abhishek Anand and Nitesh Saxena. 2018. Speechless: Analyzing the threat to speech privacy from smartphone motion sensors. In 2018 IEEE Symposium on Security and Privacy (SP). IEEE, 1000--1017.Google ScholarCross Ref
Apple. 2022. Listen with spatial audio for AirPods and Beats - Apple Support. https://support.apple.com/en-us/HT211775. (Accessed on 12/09/2022).Google Scholar
Apple. 2023. CMHeadphoneMotionManager | Apple Developer Documentation. https://developer.apple.com/documentation/coremotion/cmheadphonemotionmanager. (Accessed on 05/16/2023).Google Scholar
Zhongjie Ba, Tianhang Zheng, Xinyu Zhang, Zhan Qin, Baochun Li, Xue Liu, and Kui Ren. 2020. Learning-based Practical Smartphone Eavesdropping with Built-in Accelerometer.. In NDSS.Google Scholar
You Chang, Namkeun Kim, and Stefan Stenfelt. 2016. The development of a whole-head human finite-element model for simulation of the transmission of bone-conducted sound. The Journal of the Acoustical Society of America 140, 3 (2016), 1635--1651.Google ScholarCross Ref
Ishan Chatterjee, Maruchi Kim, Vivek Jayaram, Shyamnath Gollakota, Ira Kemelmacher, Shwetak Patel, and Steven M Seitz. 2022. ClearBuds: wireless binaural earbuds for learning-based speech enhancement. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. 384--396.Google ScholarDigital Library
Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. 2020. Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 (2020).Google Scholar
Martin Dietz, Lars Liljeryd, Kristofer Kjorling, and Oliver Kunz. 2002. Spectral Band Replication, a novel approach in audio coding. In Audio Engineering Society Convention 112. Audio Engineering Society.Google Scholar
Yariv Ephraim and Harry L Van Trees. 1995. A signal subspace approach for speech enhancement. IEEE Transactions on speech and audio processing 3, 4 (1995), 251--266.Google ScholarCross Ref
Ruohan Gao and Kristen Grauman. 2021. Visualvoice: Audio-visual speech separation with cross-modal consistency. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 15490--15500.Google ScholarCross Ref
Pranav Gupta, Yaesuk Jeong, Jaehoo Choi, Mark Faingold, and F Ayazi. 2018. Precision high-bandwidth out-of-plane accelerometer as contact microphone for body-worn auscultation devices. In 2018 Hilton Head Workshop. 30--33.Google ScholarCross Ref
Jun Han, Albert Jin Chung, and Patrick Tague. 2017. Pitchln: eavesdropping via intelligible speech reconstruction using non-acoustic sensor fusion. In Proceedings of the 16th ACM/IEEE International Conference on Information Processing in Sensor Networks. 181--192.Google ScholarDigital Library
Xiang Hao, Xiangdong Su, Radu Horaud, and Xiaofei Li. 2021. Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6633--6637.Google ScholarCross Ref
Youna Ji, Jun Byun, and Young-cheol Park. 2017. Coherence-Based Dual-Channel Noise Reduction Algorithm in a Complex Noisy Environment.. In INTERSPEECH. 2670--2674.Google Scholar
Prerna Khanna, Tanmay Srivastava, Shijia Pan, Shubham Jain, and Phuc Nguyen. 2021. JawSense: recognizing unvoiced sound using a low-cost ear-worn system. In Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications. 44--49.Google Scholar
Daichi Kitamura, Nobutaka Ono, Hiroshi Sawada, Hirokazu Kameoka, and Hiroshi Saruwatari. 2016. Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 9 (2016), 1626--1641.Google ScholarCross Ref
Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, and Sanjeev Khudanpur. 2017. A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5220--5224.Google ScholarDigital Library
Huining Li, Chenhan Xu, Aditya Singh Rathore, Zhengxiong Li, Hanbin Zhang, Chen Song, Kun Wang, Lu Su, Feng Lin, Kui Ren, et al. 2020. VocalPrint: exploring a resilient and secure voice authentication via mmWave biometric interrogation. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems. 312--325.Google ScholarDigital Library
Tiantian Liu, Ming Gao, Feng Lin, Chao Wang, Zhongjie Ba, Jinsong Han, Wenyao Xu, and Kui Ren. 2021. Wavoice: A Noise-resistant Multi-modal Speech Recognition System Fusing mmWave and Audio Signals. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 97--110.Google ScholarDigital Library
He Lixing, Hou Haozheng, Shi Shuyao, Shuai Xian, and Yan Zhenyu. 2023. Code of VibVoice. https://github.com/CUHK-AIoT-Sensing/vibvoice.Google Scholar
Logitech. 2022. Logitech H650e Business Headset with Noise Cancelling Mic. https://www.logitech.com/en-hk/products/headsets/h650e-business-noise-cancelling.html. (Accessed on 04/13/2022).Google Scholar
Héctor A Cordourier Maruri, Paulo Lopez-Meyer, Jonathan Huang, Willem Marco Beltman, Lama Nachman, and Hong Lu. 2018. V-speech: Noise-robust speech capturing glasses using vibration sensors. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 4 (2018), 1--23.Google ScholarDigital Library
Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, and Jesper Jensen. 2021. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2021).Google ScholarDigital Library
Frederik Nagel and Sascha Disch. 2009. A harmonic bandwidth extension method for audio codecs. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 145--148.Google ScholarDigital Library
Nobuyuki Otsu. 1979. A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics 9, 1 (1979), 62--66.Google ScholarCross Ref
Muhammed Zahid Ozturk, Chenshu Wu, Beibei Wang, andKJ Liu. 2021. RadioMic: Sound Sensing via mmWave Signals. arXiv preprint arXiv:2108.03164 (2021).Google Scholar
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206--5210.Google ScholarCross Ref
Karol J. Piczak. 2022. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia (Brisbane, Australia, 2015-10-13). ACM Press, 1015--1018. Google ScholarDigital Library
Christoph Pörschmann, Tim Lübeck, and Johannes M Arend. 2020. Impact of face masks on voice radiation. The Journal of the Acoustical Society of America 148, 6 (2020), 3663--3670.Google ScholarCross Ref
Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, et al. 2021. SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624 (2021).Google Scholar
Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), Vol. 2. IEEE, 749--752.Google ScholarDigital Library
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.Google ScholarCross Ref
Sriram Sami, Yimin Dai, Sean Rui Xiang Tan, Nirupam Roy, and Jun Han. 2020. Spying with your robot vacuum cleaner: eavesdropping via lidar sensors. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems. 354--367.Google ScholarDigital Library
Bosch Sensortec. 2021. Inertial Measurement Unit BMI160. https://www.bosch-sensortec.com/products/motion-sensors/imus/bmi160/.Google Scholar
Cong Shi, Xiangyu Xu, Tianfang Zhang, Payton Walker, Yi Wu, Jian Liu, Nitesh Saxena, Yingying Chen, and Jiadi Yu. 2021. Face-Mic: inferring live speech and speaker identity via subtle facial dynamics captured by AR/VR motion sensors. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 478--490.Google ScholarDigital Library
Statista. 2020. Wearable shipments by category 2024. https://www.statista.com/statistics/690731/wearables-worldwide-shipments-by-product-category/. (Accessed on 02/15/2022).Google Scholar
Weigao Su, Daibo Liu, Taiyuan Zhang, and Hongbo Jiang. 2021. Towards Device Independent Eavesdropping on Telephone Conversations with Built-in Accelerometer. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 4 (2021), 1--29.Google ScholarDigital Library
Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. 2021. Attention is all you need in speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 21--25.Google ScholarCross Ref
Ke Sun and Xinyu Zhang. 2021. UltraSE: single-channel speech enhancement using ultrasound. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 160--173.Google ScholarDigital Library
Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, and Dominik Roblek. 2020. SEANet: A multi-modal speech enhancement network. arXiv preprint arXiv:2009.02095 (2020).Google Scholar
Department of the State USA. 2022. Everyday Conversations: Learning American English | American English. https://americanenglish.state.gov/resources/everyday-conversations-learning-american-english. (Accessed on 02/02/2022).Google Scholar
Heming Wang, Xueliang Zhang, and DeLiang Wang. 2022. Fusing Bone-Conduction and Air-Conduction Sensors for Complex-Domain Speech Enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 3134--3143.Google ScholarDigital Library
Tianshi Wang, Shuochao Yao, Shengzhong Liu, Jinyang Li, Dongxin Liu, Huajie Shao, Ruijie Wang, and Tarek Abdelzaher. 2021. Audio Keyword Reconstruction from On-Device Motion Sensor Signals via Neural Frequency Unfolding. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 3 (2021), 1--29.Google ScholarDigital Library
Wikipedia. 2020. Perceptual Evaluation of Speech Quality - Wikipedia. https://en.wikipedia.org/wiki/Perceptual_Evaluation_of_Speech_Quality. (Accessed on 12/02/2022).Google Scholar
Wikipedia. 2020. Voice frequency - Wikipedia. https://en.wikipedia.org/wiki/Voice_frequency. (Accessed on 07/12/2022).Google Scholar
WikiPedia. 2022. Bluetooth - Wikipedia. https://en.wikipedia.org/wiki/Bluetooth. (Accessed on 07/15/2022).Google Scholar
Wikipedia. 2022. Bone conduction - Wikipedia. https://en.wikipedia.org/wiki/Bone_conduction. (Accessed on 07/26/2022).Google Scholar
Wikipedia. 2022. Contact microphone - Wikipedia. https://en.wikipedia.org/wiki/Contact_microphone. (Accessed on 07/26/2022).Google Scholar
Sook Young Won and Jonathan Berger. 2005. Estimating transfer function from air to bone conduction using singing voice. In ICMC.Google Scholar
Sean UN Wood, Jean Rouat, Stéphane Dupont, and Gueorgui Pironkov. 2017. Blind speech separation and enhancement with GCC-NMF. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 4 (2017), 745--755.Google ScholarDigital Library
Nima Yousefian, John HL Hansen, and Philipos C Loizou. 2014. A hybrid coherence model for noise reduction in reverberant environments. IEEE Signal Processing Letters 22, 3 (2014), 279--282.Google ScholarCross Ref
Qian Zhang, Dong Wang, Run Zhao, Yinggang Yu, and Junjie Shen. 2021. Sensing to hear: Speech enhancement for mobile devices using acoustic signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 3 (2021), 1--30.Google ScholarDigital Library
Siyuan Zhang and Xiaofei Li. 2021. Microphone Array Generalization for Multichannel Narrowband Deep Speech Enhancement. arXiv preprint arXiv:2107.12601 (2021).Google Scholar

Index Terms

Towards Bone-Conducted Vibration Speech Enhancement on Head-Mounted Wearables

Recommendations

Combined speech enhancement and auditory modelling for robust distributed speech recognition

The performance of automatic speech recognition (ASR) systems in the presence of noise is an area that has attracted a lot of research interest. Additive noise from interfering noise sources, and convolutional noise arising from transmission channel ...
Read More
Speech enhancement for robust automatic speech recognition

Evaluation of baseline CHiME3 recogniser in diverse range of acoustic conditions.Performance curves indicate relative influence of noise and reverberation.Evaluation of 6 different speech enhancement pipelines.Deverberation and beamforming dramatically ...
Read More
Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement

In this paper, we present statistical approaches to enhance body-conducted unvoiced speech for silent speech communication. A body-conductive microphone called nonaudible murmur (NAM) microphone is effectively used to detect very soft unvoiced speech ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MobiSys '23: Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services
June 2023
651 pages
ISBN:9798400701108
DOI:10.1145/3581791
General Chairs:
Petteri Nurmi,
Pan Hui,
Program Chairs:
Ardalan Amiri Sani,
Yunxin Liu
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
speech enhancement
ear-worn wearable
earable sensing
Qualifiers
- research-article
Conference

Acceptance Rates
MobiSys '23 Paper Acceptance Rate41of198submissions,21%Overall Acceptance Rate274of1,679submissions,16%
More
Upcoming Conference
MOBISYS '24

Sponsor:

sigmobile

The 22nd Annual International Conference on Mobile Systems, Applications and Services

June 3 - 7, 2024

Minato-ku, Tokyo , Japan
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 545
  Total Downloads
- Downloads (Last 12 months)545
- Downloads (Last 6 weeks)59
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Towards Bone-Conducted Vibration Speech Enhancement on Head-Mounted Wearables

MobiSys '23: Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services

ABSTRACT

References

Cited By

Index Terms

Recommendations

Combined speech enhancement and auditory modelling for robust distributed speech recognition

Speech enhancement for robust automatic speech recognition

Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Towards Bone-Conducted Vibration Speech Enhancement on Head-Mounted Wearables

MobiSys '23: Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services

ABSTRACT

References

Cited By

Index Terms

Recommendations

Combined speech enhancement and auditory modelling for robust distributed speech recognition

Speech enhancement for robust automatic speech recognition

Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media