research-article

CELIP: Ultrasonic-based Lip Reading with Channel Estimation Approach for Virtual Reality Systems

Authors:

Yongzhao Zhang,

Xingyu JinAuthors Info & Claims

UbiComp/ISWC '21 Adjunct: Adjunct Proceedings of the 2021 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2021 ACM International Symposium on Wearable Computers

Pages 580 - 585

https://doi.org/10.1145/3460418.3480163

Published: 24 September 2021 Publication History

Abstract

We developed an ultrasonic-based silent speech interface for Virtual Reality (VR). As more and more customized devices are proposed to enhance the immersion and experience of VR, our system can be used to improve the capability of interactions between users and the systems, while retaining the possibilities of using various customized devices and avoiding some limitations of traditional speech recognition. By employing the channel estimation techniques with ultrasonic waves, we can derive movement characteristics of users’ lips, which can be used to fine-tune existing speech recognition models and augmented by vast open-sourced speech datasets. Moreover, we use the speech interface to guide the initialization of customized models for new users, so that they can easily have the access to our system. A two-stage experiment has been conducted and the results show that our system can achieve 90.8% command-level accuracy and 1.3% word-error-rate in sentence-level accuracy.

References

[1]

Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2017. LipNet: End-to-End Sentence-level Lipreading. arXiv: Learning (2017).

[2]

Jonathan S. Brumberg, Alfonso Nieto-Castanon, Philip R. Kennedy, and Frank H. Guenther. 2010. Brain-computer Interfaces for Speech Communication. Speech Commun. 52, 4 (April 2010), 367–379. https://doi.org/10.1016/j.specom.2010.01.001

Digital Library

[3]

William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. 2015. Listen, attend and spell. arXiv preprint arXiv:1508.01211(2015).

[4]

C. Chiu and Colin Raffel. 2018. Monotonic Chunkwise Attention. ArXiv abs/1712.05382(2018).

[5]

Inrak Choi, Heather Culbertson, Mark R Miller, Alex Olwal, and Sean Follmer. 2017. Grabity: A wearable haptic interface for simulating weight and grasping in virtual reality. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. 119–130.

Digital Library

[6]

Masaaki Fukumoto. 2018. SilentVoice: Unnoticeable Voice Input by Ingressive Speech. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (Berlin, Germany) (UIST ’18). ACM, New York, NY, USA, 237–246. https://doi.org/10.1145/3242587.3242603

Digital Library

[7]

Yang Gao, Yincheng Jin, Jiyang Li, Seokmin Choi, and Zhanpeng Jin. 2020. EchoWhisper: Exploring an Acoustic-based Silent Speech Interface for Smartphone Users. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 3 (2020), 1–27.

Digital Library

[8]

Alex Graves. 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711(2012).

[9]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369–376.

Digital Library

[10]

Susumu Harada, Jacob O Wobbrock, and James A Landay. 2011. Voice games: investigation into the use of non-speech voice input for making computer games more accessible. In IFIP Conference on Human-Computer Interaction. Springer, 11–29.

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

[12]

Daniel Hepperle, Yannick Weiß, Andreas Siess, and Matthias Wölfel. 2019. 2D, 3D or speech? A case study on which user interface is preferable for what kind of object interaction in immersive virtual reality. Computers & Graphics 82(2019), 321–331.

Digital Library

[13]

Tatsuya Hirahara, Makoto Otani, Shota Shimizu, Tomoki Toda, Keigo Nakamura, Yoshitaka Nakajima, and Kiyohiro Shikano. 2010. Silent-speech enhancement using body-conducted vocal-tract resonance signals. Speech Communication 52, 4 (2010), 301 – 313. https://doi.org/10.1016/j.specom.2009.12.001 Silent Speech Interfaces.

Digital Library

[14]

Robin Hofe, Stephen R. Ell, Michael J. Fagan, James M. Gilbert, Phil D. Green, Roger K. Moore, and Sergey I. Rybchenko. 2013. Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing. Speech Communication 55, 1 (2013), 22 – 32. https://doi.org/10.1016/j.specom.2012.02.001

Digital Library

[15]

B Hunt. 1971. A matrix theory proof of the discrete convolution theorem. IEEE Transactions on Audio and Electroacoustics 19, 4(1971), 285–288.

[16]

Elias Kellner, Bibek Dhital, Valerij G Kiselev, and Marco Reisert. 2016. Gibbs-ringing artifact removal based on local subvoxel-shifts. Magnetic resonance in medicine 76, 5 (2016), 1574–1581.

[17]

Naoki Kimura, Michinari Kono, and Jun Rekimoto. 2019. SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems(CHI ’19). ACM, New York, NY, USA.

Digital Library

[18]

Pascal Knierim, Thomas Kosch, Valentin Schwind, Markus Funk, Francisco Kiss, Stefan Schneegass, and Niels Henze. 2017. Tactile drones-providing immersive tactile feedback in virtual reality through quadcopters. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems. 433–436.

Digital Library

[19]

Richard Li, Jason Wu, and Thad Starner. 2019. TongueBoard: An Oral Interface for Subtle Input. In Proceedings of the 10th Augmented Human International Conference 2019 (Reims, France) (AH2019). ACM, New York, NY, USA, Article 1, 9 pages. https://doi.org/10.1145/3311823.3311831

Digital Library

[20]

G. S. Meltzner, J. T. Heaton, Y. Deng, G. De Luca, S. H. Roy, and J. C. Kline. 2017. Silent Speech Recognition as an Alternative Communication Device for Persons With Laryngectomy. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 12 (Dec 2017), 2386–2398. https://doi.org/10.1109/TASLP.2017.2740000

Digital Library

[21]

Yoshitaka Nakajima, Hideki Kashioka, Nick Campbell, and Kiyohiro Shikano. 2006. Non-Audible Murmur (NAM) Recognition. IEICE - Trans. Inf. Syst. E89-D, 1 (Jan. 2006), 1–4. https://doi.org/10.1093/ietisy/e89-d.1.1

[22]

Y. Nakajima, H. Kashioka, K. Shikano, and N. Campbell. 2003. Non-audible murmur recognition input interface using stethoscopic microphone attached to the skin. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., Vol. 5. V–708. https://doi.org/10.1109/ICASSP.2003.1200069

[23]

Chuong H Nguyen, George K Karavas, and Panagiotis Artemiadis. 2017. Inferring imagined speech using EEG signals: a new approach using Riemannian manifold features. Journal of Neural Engineering 15, 1 (nov 2017), 016002. https://doi.org/10.1088/1741-2552/aa8235

[24]

Chaeyong Park, Jinhyuk Yoon, Seungjae Oh, and Seungmoon Choi. 2020. Augmenting Physical Buttons with Vibrotactile Feedback for Programmable Feels. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 924–937.

Digital Library

[25]

B.M. Popovic. 1992. Generalized chirp-like polyphase sequences with optimum correlation properties. IEEE Transactions on Information Theory 38, 4 (1992), 1406–1409. https://doi.org/10.1109/18.144727

Digital Library

[26]

Misha Sra, Xuhai Xu, and Pattie Maes. 2018. Breathvr: Leveraging breathing as a directly controlled interface for virtual reality games. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–12.

Digital Library

[27]

Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and Yuanchun Shi. 2018. Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology(Berlin, Germany) (UIST ’18). ACM, New York, NY, USA, 581–593. https://doi.org/10.1145/3242587.3242599

Digital Library

[28]

Ke Sun, Ting Zhao, Wei Wang, and Lei Xie. 2018. Vskin: Sensing touch gestures on surfaces of mobile devices using acoustic signals. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. 591–605.

Digital Library

[29]

Wei Wang, Alex X Liu, and Ke Sun. 2016. Device-free gesture tracking using acoustic signals. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking. 82–94.

[30]

Tzu-Yun Wei, Hsin-Ruey Tsai, Yu-So Liao, Chieh Tsai, Yi-Shan Chen, Chi Wang, and Bing-Yu Chen. 2020. ElastiLinks: Force Feedback between VR Controllers with Dynamic Points of Application of Force. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 1023–1034.

Digital Library

[31]

Yannick Weiß, Daniel Hepperle, Andreas Sieß, and Matthias Wölfel. 2018. What user interface to use for virtual reality? 2d, 3d or speech–a user study. In 2018 International Conference on Cyberworlds (CW). IEEE, 50–57.

[32]

Sangki Yun, Yi-Chao Chen, Huihuang Zheng, Lili Qiu, and Wenguang Mao. 2017. Strata: Fine-grained acoustic-based device-free tracking. In Proceedings of the 15th annual international conference on mobile systems, applications, and services. 15–28.

Digital Library

[33]

Qian Zhang, Dong Wang, Run Zhao, and Yinggang Yu. 2021. SoundLip: Enabling Word and Sentence-level Lip Interaction for Smart Devices. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 1 (2021), 1–28.

Digital Library

[34]

Yongzhao Zhang, Wei-Hsiang Huang, Chih-Yun Yang, Wen-Ping Wang, Yi-Chao Chen, Chuang-Wen You, Da-Yuan Huang, Guangtao Xue, and Jiadi Yu. 2020. Endophasia: Utilizing Acoustic-Based Imaging for Issuing Contact-Free Silent Speech Commands. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 1 (2020), 1–26.

Digital Library

Cited By

Chen YYu JKong LZhu Y(2025)A Comprehensive Survey of Side-Channel Sound-Sensing MethodsIEEE Internet of Things Journal10.1109/JIOT.2024.350133412:2(1554-1578)Online publication date: 15-Jan-2025
https://doi.org/10.1109/JIOT.2024.3501334
Cai ZMa YLu F(2024)Robust Dual-Modal Speech Keyword Spotting for XR HeadsetsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.337209230:5(2507-2516)Online publication date: 5-Mar-2024
https://dl.acm.org/doi/10.1109/TVCG.2024.3372092
Zhang HRuan HZhao HWang ZHu SCui Tdel Hougne PLi L(2024)Microwave Speech Recognizer Empowered by a Programmable MetasurfaceAdvanced Science10.1002/advs.20230982611:17Online publication date: 21-Feb-2024
https://doi.org/10.1002/advs.202309826
Show More Cited By

Recommendations

Statistical conversion of silent articulation into audible speech using full-covariance HMM

Conversion of silent articulation captured by ultrasound and video to modal speech.Comparison of GMM and full-covariance phonetic HMM without vocabulary limitation.HMM-based approach allows the use of linguistic information for regularization.Objective ...
Speed reading on virtual reality and augmented reality
Abstract
Many virtual reality (VR) and augmented reality (AR) applications in education require speed reading. The current study aimed to explore whether the reading performance on VR and AR is different from that on traditional desktop display,...
Highlights
- We explored performance of speed reading on virtual and augmented reality.
- ...
Speech Reconstruction from Silent Lip and Tongue Articulation by Diffusion Models and Text-Guided Pseudo Target Generation
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

This paper studies the task of speech reconstruction from ultrasound tongue images and optical lip videos recorded in a silent speaking mode, where people only activate their intra-oral and extra-oral articulators without producing real speech. This task ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

UbiComp/ISWC '21 Adjunct: Adjunct Proceedings of the 2021 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2021 ACM International Symposium on Wearable Computers

September 2021

711 pages

ISBN:9781450384612

DOI:10.1145/3460418

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 September 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Startup Fund for Youngman Research at SJTU, and Program of Shanghai Academic Research Leader
NSFC

Conference

UbiComp '21

Sponsor:

UbiComp '21: The 2021 ACM International Joint Conference on Pervasive and Ubiquitous Computing

September 21 - 26, 2021

Virtual, USA

Acceptance Rates

Overall Acceptance Rate 764 of 2,912 submissions, 26%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
297
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)4

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen YYu JKong LZhu Y(2025)A Comprehensive Survey of Side-Channel Sound-Sensing MethodsIEEE Internet of Things Journal10.1109/JIOT.2024.350133412:2(1554-1578)Online publication date: 15-Jan-2025
https://doi.org/10.1109/JIOT.2024.3501334
Cai ZMa YLu F(2024)Robust Dual-Modal Speech Keyword Spotting for XR HeadsetsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.337209230:5(2507-2516)Online publication date: 5-Mar-2024
https://dl.acm.org/doi/10.1109/TVCG.2024.3372092
Zhang HRuan HZhao HWang ZHu SCui Tdel Hougne PLi L(2024)Microwave Speech Recognizer Empowered by a Programmable MetasurfaceAdvanced Science10.1002/advs.20230982611:17Online publication date: 21-Feb-2024
https://doi.org/10.1002/advs.202309826
Zhang RChen HAgarwal DJin RLi KGuimbretière FZhang C(2023)HPSpeech: Silent Speech Interface for Commodity HeadphonesProceedings of the 2023 ACM International Symposium on Wearable Computers10.1145/3594738.3611365(60-65)Online publication date: 8-Oct-2023
https://dl.acm.org/doi/10.1145/3594738.3611365
Zhang RLi KHao YWang YLai ZGuimbretière FZhang C(2023)EchoSpeech: Continuous Silent Speech Recognition on Minimally-obtrusive Eyewear Powered by Acoustic SensingProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580801(1-18)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3580801

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten