ABSTRACT
Head-mounted wearables are rapidly growing in popularity. However, a gap exists in providing robust voice-related applications like conversation or command control in complex environments, such as competing speakers and strong noises. The compact design of HMWs introduces non-trivial challenges to existing speech enhancement systems that use microphone recording only. In this paper, we handle this problem by using bone vibration conducted through the head skull. The principle is that the accelerometer is widely installed on head-mounted wearables and can capture the clean user's voice. Hence, we develop VibVoice, a lightweight multi-modal speech enhancement system for head-mounted wearables. We design a two-branch encoder-decoder deep neural network to fuse the high-level features of the two modalities and reconstruct clean speech. To address the issue of insufficient paired data for training, we extensively measure the bone conduction effect from a limited dataset to extract the physical impulse function for cross-modal data augmentation. We evaluate VibVoice on a dataset collected in real world and compare it with two state-of-the-art baselines. Results show that VibVoice yields up to 21% better performance in PESQ and up to 26% better performance in SNR compared with the baseline with 72 times less paired data required. We also conduct a user study with 35 participants, in which 87% participants prefer VibVoice compared with the baseline. In addition, VibVoice requires 4 to 31 times less execution time compared with baselines on mobile devices. The demo audio of VibVoice is available at https://www.youtube.com/watch?v=8_-s_C_NGRI.
- 2022. Bone-Conduction Ear Microphone | Shop Motorola Solutions. https://shop.motorolasolutions.com/bone-conduction-ear-microphone-system/product/PMLN5464A. (Accessed on 08/19/2022).Google Scholar
- 2022. Ear Bone Microphone - EarHugger®. https://earhugger.com/product/ear-bone-microphone/. (Accessed on 08/19/2022).Google Scholar
- 2022. What Noises Cause Hearing Loss? | NCEH | CDC. https://www.cdc.gov/nceh/hearing_loss/what_noises_cause_hearing_loss.html. (Accessed on 03/16/2023).Google Scholar
- S Abhishek Anand and Nitesh Saxena. 2018. Speechless: Analyzing the threat to speech privacy from smartphone motion sensors. In 2018 IEEE Symposium on Security and Privacy (SP). IEEE, 1000--1017.Google ScholarCross Ref
- Apple. 2022. Listen with spatial audio for AirPods and Beats - Apple Support. https://support.apple.com/en-us/HT211775. (Accessed on 12/09/2022).Google Scholar
- Apple. 2023. CMHeadphoneMotionManager | Apple Developer Documentation. https://developer.apple.com/documentation/coremotion/cmheadphonemotionmanager. (Accessed on 05/16/2023).Google Scholar
- Zhongjie Ba, Tianhang Zheng, Xinyu Zhang, Zhan Qin, Baochun Li, Xue Liu, and Kui Ren. 2020. Learning-based Practical Smartphone Eavesdropping with Built-in Accelerometer.. In NDSS.Google Scholar
- You Chang, Namkeun Kim, and Stefan Stenfelt. 2016. The development of a whole-head human finite-element model for simulation of the transmission of bone-conducted sound. The Journal of the Acoustical Society of America 140, 3 (2016), 1635--1651.Google ScholarCross Ref
- Ishan Chatterjee, Maruchi Kim, Vivek Jayaram, Shyamnath Gollakota, Ira Kemelmacher, Shwetak Patel, and Steven M Seitz. 2022. ClearBuds: wireless binaural earbuds for learning-based speech enhancement. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. 384--396.Google ScholarDigital Library
- Alexandre Defossez, Gabriel Synnaeve, and Yossi Adi. 2020. Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 (2020).Google Scholar
- Martin Dietz, Lars Liljeryd, Kristofer Kjorling, and Oliver Kunz. 2002. Spectral Band Replication, a novel approach in audio coding. In Audio Engineering Society Convention 112. Audio Engineering Society.Google Scholar
- Yariv Ephraim and Harry L Van Trees. 1995. A signal subspace approach for speech enhancement. IEEE Transactions on speech and audio processing 3, 4 (1995), 251--266.Google ScholarCross Ref
- Ruohan Gao and Kristen Grauman. 2021. Visualvoice: Audio-visual speech separation with cross-modal consistency. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 15490--15500.Google ScholarCross Ref
- Pranav Gupta, Yaesuk Jeong, Jaehoo Choi, Mark Faingold, and F Ayazi. 2018. Precision high-bandwidth out-of-plane accelerometer as contact microphone for body-worn auscultation devices. In 2018 Hilton Head Workshop. 30--33.Google ScholarCross Ref
- Jun Han, Albert Jin Chung, and Patrick Tague. 2017. Pitchln: eavesdropping via intelligible speech reconstruction using non-acoustic sensor fusion. In Proceedings of the 16th ACM/IEEE International Conference on Information Processing in Sensor Networks. 181--192.Google ScholarDigital Library
- Xiang Hao, Xiangdong Su, Radu Horaud, and Xiaofei Li. 2021. Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6633--6637.Google ScholarCross Ref
- Youna Ji, Jun Byun, and Young-cheol Park. 2017. Coherence-Based Dual-Channel Noise Reduction Algorithm in a Complex Noisy Environment.. In INTERSPEECH. 2670--2674.Google Scholar
- Prerna Khanna, Tanmay Srivastava, Shijia Pan, Shubham Jain, and Phuc Nguyen. 2021. JawSense: recognizing unvoiced sound using a low-cost ear-worn system. In Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications. 44--49.Google Scholar
- Daichi Kitamura, Nobutaka Ono, Hiroshi Sawada, Hirokazu Kameoka, and Hiroshi Saruwatari. 2016. Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 9 (2016), 1626--1641.Google ScholarCross Ref
- Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, and Sanjeev Khudanpur. 2017. A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5220--5224.Google ScholarDigital Library
- Huining Li, Chenhan Xu, Aditya Singh Rathore, Zhengxiong Li, Hanbin Zhang, Chen Song, Kun Wang, Lu Su, Feng Lin, Kui Ren, et al. 2020. VocalPrint: exploring a resilient and secure voice authentication via mmWave biometric interrogation. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems. 312--325.Google ScholarDigital Library
- Tiantian Liu, Ming Gao, Feng Lin, Chao Wang, Zhongjie Ba, Jinsong Han, Wenyao Xu, and Kui Ren. 2021. Wavoice: A Noise-resistant Multi-modal Speech Recognition System Fusing mmWave and Audio Signals. In Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 97--110.Google ScholarDigital Library
- He Lixing, Hou Haozheng, Shi Shuyao, Shuai Xian, and Yan Zhenyu. 2023. Code of VibVoice. https://github.com/CUHK-AIoT-Sensing/vibvoice.Google Scholar
- Logitech. 2022. Logitech H650e Business Headset with Noise Cancelling Mic. https://www.logitech.com/en-hk/products/headsets/h650e-business-noise-cancelling.html. (Accessed on 04/13/2022).Google Scholar
- Héctor A Cordourier Maruri, Paulo Lopez-Meyer, Jonathan Huang, Willem Marco Beltman, Lama Nachman, and Hong Lu. 2018. V-speech: Noise-robust speech capturing glasses using vibration sensors. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 4 (2018), 1--23.Google ScholarDigital Library
- Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, and Jesper Jensen. 2021. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2021).Google ScholarDigital Library
- Frederik Nagel and Sascha Disch. 2009. A harmonic bandwidth extension method for audio codecs. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 145--148.Google ScholarDigital Library
- Nobuyuki Otsu. 1979. A threshold selection method from gray-level histograms. IEEE transactions on systems, man, and cybernetics 9, 1 (1979), 62--66.Google ScholarCross Ref
- Muhammed Zahid Ozturk, Chenshu Wu, Beibei Wang, andKJ Liu. 2021. RadioMic: Sound Sensing via mmWave Signals. arXiv preprint arXiv:2108.03164 (2021).Google Scholar
- Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206--5210.Google ScholarCross Ref
- Karol J. Piczak. 2022. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia (Brisbane, Australia, 2015-10-13). ACM Press, 1015--1018. Google ScholarDigital Library
- Christoph Pörschmann, Tim Lübeck, and Johannes M Arend. 2020. Impact of face masks on voice radiation. The Journal of the Acoustical Society of America 148, 6 (2020), 3663--3670.Google ScholarCross Ref
- Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, et al. 2021. SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624 (2021).Google Scholar
- Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), Vol. 2. IEEE, 749--752.Google ScholarDigital Library
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.Google ScholarCross Ref
- Sriram Sami, Yimin Dai, Sean Rui Xiang Tan, Nirupam Roy, and Jun Han. 2020. Spying with your robot vacuum cleaner: eavesdropping via lidar sensors. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems. 354--367.Google ScholarDigital Library
- Bosch Sensortec. 2021. Inertial Measurement Unit BMI160. https://www.bosch-sensortec.com/products/motion-sensors/imus/bmi160/.Google Scholar
- Cong Shi, Xiangyu Xu, Tianfang Zhang, Payton Walker, Yi Wu, Jian Liu, Nitesh Saxena, Yingying Chen, and Jiadi Yu. 2021. Face-Mic: inferring live speech and speaker identity via subtle facial dynamics captured by AR/VR motion sensors. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 478--490.Google ScholarDigital Library
- Statista. 2020. Wearable shipments by category 2024. https://www.statista.com/statistics/690731/wearables-worldwide-shipments-by-product-category/. (Accessed on 02/15/2022).Google Scholar
- Weigao Su, Daibo Liu, Taiyuan Zhang, and Hongbo Jiang. 2021. Towards Device Independent Eavesdropping on Telephone Conversations with Built-in Accelerometer. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 4 (2021), 1--29.Google ScholarDigital Library
- Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. 2021. Attention is all you need in speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 21--25.Google ScholarCross Ref
- Ke Sun and Xinyu Zhang. 2021. UltraSE: single-channel speech enhancement using ultrasound. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 160--173.Google ScholarDigital Library
- Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, and Dominik Roblek. 2020. SEANet: A multi-modal speech enhancement network. arXiv preprint arXiv:2009.02095 (2020).Google Scholar
- Department of the State USA. 2022. Everyday Conversations: Learning American English | American English. https://americanenglish.state.gov/resources/everyday-conversations-learning-american-english. (Accessed on 02/02/2022).Google Scholar
- Heming Wang, Xueliang Zhang, and DeLiang Wang. 2022. Fusing Bone-Conduction and Air-Conduction Sensors for Complex-Domain Speech Enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 3134--3143.Google ScholarDigital Library
- Tianshi Wang, Shuochao Yao, Shengzhong Liu, Jinyang Li, Dongxin Liu, Huajie Shao, Ruijie Wang, and Tarek Abdelzaher. 2021. Audio Keyword Reconstruction from On-Device Motion Sensor Signals via Neural Frequency Unfolding. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 3 (2021), 1--29.Google ScholarDigital Library
- Wikipedia. 2020. Perceptual Evaluation of Speech Quality - Wikipedia. https://en.wikipedia.org/wiki/Perceptual_Evaluation_of_Speech_Quality. (Accessed on 12/02/2022).Google Scholar
- Wikipedia. 2020. Voice frequency - Wikipedia. https://en.wikipedia.org/wiki/Voice_frequency. (Accessed on 07/12/2022).Google Scholar
- WikiPedia. 2022. Bluetooth - Wikipedia. https://en.wikipedia.org/wiki/Bluetooth. (Accessed on 07/15/2022).Google Scholar
- Wikipedia. 2022. Bone conduction - Wikipedia. https://en.wikipedia.org/wiki/Bone_conduction. (Accessed on 07/26/2022).Google Scholar
- Wikipedia. 2022. Contact microphone - Wikipedia. https://en.wikipedia.org/wiki/Contact_microphone. (Accessed on 07/26/2022).Google Scholar
- Sook Young Won and Jonathan Berger. 2005. Estimating transfer function from air to bone conduction using singing voice. In ICMC.Google Scholar
- Sean UN Wood, Jean Rouat, Stéphane Dupont, and Gueorgui Pironkov. 2017. Blind speech separation and enhancement with GCC-NMF. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 4 (2017), 745--755.Google ScholarDigital Library
- Nima Yousefian, John HL Hansen, and Philipos C Loizou. 2014. A hybrid coherence model for noise reduction in reverberant environments. IEEE Signal Processing Letters 22, 3 (2014), 279--282.Google ScholarCross Ref
- Qian Zhang, Dong Wang, Run Zhao, Yinggang Yu, and Junjie Shen. 2021. Sensing to hear: Speech enhancement for mobile devices using acoustic signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 3 (2021), 1--30.Google ScholarDigital Library
- Siyuan Zhang and Xiaofei Li. 2021. Microphone Array Generalization for Multichannel Narrowband Deep Speech Enhancement. arXiv preprint arXiv:2107.12601 (2021).Google Scholar
Index Terms
- Towards Bone-Conducted Vibration Speech Enhancement on Head-Mounted Wearables
Recommendations
Combined speech enhancement and auditory modelling for robust distributed speech recognition
The performance of automatic speech recognition (ASR) systems in the presence of noise is an area that has attracted a lot of research interest. Additive noise from interfering noise sources, and convolutional noise arising from transmission channel ...
Speech enhancement for robust automatic speech recognition
Evaluation of baseline CHiME3 recogniser in diverse range of acoustic conditions.Performance curves indicate relative influence of noise and reverberation.Evaluation of 6 different speech enhancement pipelines.Deverberation and beamforming dramatically ...
Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement
In this paper, we present statistical approaches to enhance body-conducted unvoiced speech for silent speech communication. A body-conductive microphone called nonaudible murmur (NAM) microphone is effectively used to detect very soft unvoiced speech ...
Comments