skip to main content
10.1145/3460418.3480163acmconferencesArticle/Chapter ViewAbstractPublication PagesubicompConference Proceedingsconference-collections
research-article

CELIP: Ultrasonic-based Lip Reading with Channel Estimation Approach for Virtual Reality Systems

Published: 24 September 2021 Publication History

Abstract

We developed an ultrasonic-based silent speech interface for Virtual Reality (VR). As more and more customized devices are proposed to enhance the immersion and experience of VR, our system can be used to improve the capability of interactions between users and the systems, while retaining the possibilities of using various customized devices and avoiding some limitations of traditional speech recognition. By employing the channel estimation techniques with ultrasonic waves, we can derive movement characteristics of users’ lips, which can be used to fine-tune existing speech recognition models and augmented by vast open-sourced speech datasets. Moreover, we use the speech interface to guide the initialization of customized models for new users, so that they can easily have the access to our system. A two-stage experiment has been conducted and the results show that our system can achieve 90.8% command-level accuracy and 1.3% word-error-rate in sentence-level accuracy.

References

[1]
Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2017. LipNet: End-to-End Sentence-level Lipreading. arXiv: Learning (2017).
[2]
Jonathan S. Brumberg, Alfonso Nieto-Castanon, Philip R. Kennedy, and Frank H. Guenther. 2010. Brain-computer Interfaces for Speech Communication. Speech Commun. 52, 4 (April 2010), 367–379. https://doi.org/10.1016/j.specom.2010.01.001
[3]
William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. 2015. Listen, attend and spell. arXiv preprint arXiv:1508.01211(2015).
[4]
C. Chiu and Colin Raffel. 2018. Monotonic Chunkwise Attention. ArXiv abs/1712.05382(2018).
[5]
Inrak Choi, Heather Culbertson, Mark R Miller, Alex Olwal, and Sean Follmer. 2017. Grabity: A wearable haptic interface for simulating weight and grasping in virtual reality. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. 119–130.
[6]
Masaaki Fukumoto. 2018. SilentVoice: Unnoticeable Voice Input by Ingressive Speech. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (Berlin, Germany) (UIST ’18). ACM, New York, NY, USA, 237–246. https://doi.org/10.1145/3242587.3242603
[7]
Yang Gao, Yincheng Jin, Jiyang Li, Seokmin Choi, and Zhanpeng Jin. 2020. EchoWhisper: Exploring an Acoustic-based Silent Speech Interface for Smartphone Users. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 3 (2020), 1–27.
[8]
Alex Graves. 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711(2012).
[9]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. 369–376.
[10]
Susumu Harada, Jacob O Wobbrock, and James A Landay. 2011. Voice games: investigation into the use of non-speech voice input for making computer games more accessible. In IFIP Conference on Human-Computer Interaction. Springer, 11–29.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[12]
Daniel Hepperle, Yannick Weiß, Andreas Siess, and Matthias Wölfel. 2019. 2D, 3D or speech? A case study on which user interface is preferable for what kind of object interaction in immersive virtual reality. Computers & Graphics 82(2019), 321–331.
[13]
Tatsuya Hirahara, Makoto Otani, Shota Shimizu, Tomoki Toda, Keigo Nakamura, Yoshitaka Nakajima, and Kiyohiro Shikano. 2010. Silent-speech enhancement using body-conducted vocal-tract resonance signals. Speech Communication 52, 4 (2010), 301 – 313. https://doi.org/10.1016/j.specom.2009.12.001 Silent Speech Interfaces.
[14]
Robin Hofe, Stephen R. Ell, Michael J. Fagan, James M. Gilbert, Phil D. Green, Roger K. Moore, and Sergey I. Rybchenko. 2013. Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing. Speech Communication 55, 1 (2013), 22 – 32. https://doi.org/10.1016/j.specom.2012.02.001
[15]
B Hunt. 1971. A matrix theory proof of the discrete convolution theorem. IEEE Transactions on Audio and Electroacoustics 19, 4(1971), 285–288.
[16]
Elias Kellner, Bibek Dhital, Valerij G Kiselev, and Marco Reisert. 2016. Gibbs-ringing artifact removal based on local subvoxel-shifts. Magnetic resonance in medicine 76, 5 (2016), 1574–1581.
[17]
Naoki Kimura, Michinari Kono, and Jun Rekimoto. 2019. SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems(CHI ’19). ACM, New York, NY, USA.
[18]
Pascal Knierim, Thomas Kosch, Valentin Schwind, Markus Funk, Francisco Kiss, Stefan Schneegass, and Niels Henze. 2017. Tactile drones-providing immersive tactile feedback in virtual reality through quadcopters. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems. 433–436.
[19]
Richard Li, Jason Wu, and Thad Starner. 2019. TongueBoard: An Oral Interface for Subtle Input. In Proceedings of the 10th Augmented Human International Conference 2019 (Reims, France) (AH2019). ACM, New York, NY, USA, Article 1, 9 pages. https://doi.org/10.1145/3311823.3311831
[20]
G. S. Meltzner, J. T. Heaton, Y. Deng, G. De Luca, S. H. Roy, and J. C. Kline. 2017. Silent Speech Recognition as an Alternative Communication Device for Persons With Laryngectomy. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 12 (Dec 2017), 2386–2398. https://doi.org/10.1109/TASLP.2017.2740000
[21]
Yoshitaka Nakajima, Hideki Kashioka, Nick Campbell, and Kiyohiro Shikano. 2006. Non-Audible Murmur (NAM) Recognition. IEICE - Trans. Inf. Syst. E89-D, 1 (Jan. 2006), 1–4. https://doi.org/10.1093/ietisy/e89-d.1.1
[22]
Y. Nakajima, H. Kashioka, K. Shikano, and N. Campbell. 2003. Non-audible murmur recognition input interface using stethoscopic microphone attached to the skin. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., Vol. 5. V–708. https://doi.org/10.1109/ICASSP.2003.1200069
[23]
Chuong H Nguyen, George K Karavas, and Panagiotis Artemiadis. 2017. Inferring imagined speech using EEG signals: a new approach using Riemannian manifold features. Journal of Neural Engineering 15, 1 (nov 2017), 016002. https://doi.org/10.1088/1741-2552/aa8235
[24]
Chaeyong Park, Jinhyuk Yoon, Seungjae Oh, and Seungmoon Choi. 2020. Augmenting Physical Buttons with Vibrotactile Feedback for Programmable Feels. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 924–937.
[25]
B.M. Popovic. 1992. Generalized chirp-like polyphase sequences with optimum correlation properties. IEEE Transactions on Information Theory 38, 4 (1992), 1406–1409. https://doi.org/10.1109/18.144727
[26]
Misha Sra, Xuhai Xu, and Pattie Maes. 2018. Breathvr: Leveraging breathing as a directly controlled interface for virtual reality games. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–12.
[27]
Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and Yuanchun Shi. 2018. Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology(Berlin, Germany) (UIST ’18). ACM, New York, NY, USA, 581–593. https://doi.org/10.1145/3242587.3242599
[28]
Ke Sun, Ting Zhao, Wei Wang, and Lei Xie. 2018. Vskin: Sensing touch gestures on surfaces of mobile devices using acoustic signals. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. 591–605.
[29]
Wei Wang, Alex X Liu, and Ke Sun. 2016. Device-free gesture tracking using acoustic signals. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking. 82–94.
[30]
Tzu-Yun Wei, Hsin-Ruey Tsai, Yu-So Liao, Chieh Tsai, Yi-Shan Chen, Chi Wang, and Bing-Yu Chen. 2020. ElastiLinks: Force Feedback between VR Controllers with Dynamic Points of Application of Force. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 1023–1034.
[31]
Yannick Weiß, Daniel Hepperle, Andreas Sieß, and Matthias Wölfel. 2018. What user interface to use for virtual reality? 2d, 3d or speech–a user study. In 2018 International Conference on Cyberworlds (CW). IEEE, 50–57.
[32]
Sangki Yun, Yi-Chao Chen, Huihuang Zheng, Lili Qiu, and Wenguang Mao. 2017. Strata: Fine-grained acoustic-based device-free tracking. In Proceedings of the 15th annual international conference on mobile systems, applications, and services. 15–28.
[33]
Qian Zhang, Dong Wang, Run Zhao, and Yinggang Yu. 2021. SoundLip: Enabling Word and Sentence-level Lip Interaction for Smart Devices. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 1 (2021), 1–28.
[34]
Yongzhao Zhang, Wei-Hsiang Huang, Chih-Yun Yang, Wen-Ping Wang, Yi-Chao Chen, Chuang-Wen You, Da-Yuan Huang, Guangtao Xue, and Jiadi Yu. 2020. Endophasia: Utilizing Acoustic-Based Imaging for Issuing Contact-Free Silent Speech Commands. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 1 (2020), 1–26.

Cited By

View all
  • (2025)A Comprehensive Survey of Side-Channel Sound-Sensing MethodsIEEE Internet of Things Journal10.1109/JIOT.2024.350133412:2(1554-1578)Online publication date: 15-Jan-2025
  • (2024)Robust Dual-Modal Speech Keyword Spotting for XR HeadsetsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.337209230:5(2507-2516)Online publication date: 5-Mar-2024
  • (2024)Microwave Speech Recognizer Empowered by a Programmable MetasurfaceAdvanced Science10.1002/advs.20230982611:17Online publication date: 21-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
UbiComp/ISWC '21 Adjunct: Adjunct Proceedings of the 2021 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2021 ACM International Symposium on Wearable Computers
September 2021
711 pages
ISBN:9781450384612
DOI:10.1145/3460418
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 September 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Acoustic Sensing
  2. Silent Speech Interface
  3. Virtual Reality

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • Startup Fund for Youngman Research at SJTU, and Program of Shanghai Academic Research Leader
  • NSFC

Conference

UbiComp '21

Acceptance Rates

Overall Acceptance Rate 764 of 2,912 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)4
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A Comprehensive Survey of Side-Channel Sound-Sensing MethodsIEEE Internet of Things Journal10.1109/JIOT.2024.350133412:2(1554-1578)Online publication date: 15-Jan-2025
  • (2024)Robust Dual-Modal Speech Keyword Spotting for XR HeadsetsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.337209230:5(2507-2516)Online publication date: 5-Mar-2024
  • (2024)Microwave Speech Recognizer Empowered by a Programmable MetasurfaceAdvanced Science10.1002/advs.20230982611:17Online publication date: 21-Feb-2024
  • (2023)HPSpeech: Silent Speech Interface for Commodity HeadphonesProceedings of the 2023 ACM International Symposium on Wearable Computers10.1145/3594738.3611365(60-65)Online publication date: 8-Oct-2023
  • (2023)EchoSpeech: Continuous Silent Speech Recognition on Minimally-obtrusive Eyewear Powered by Acoustic SensingProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580801(1-18)Online publication date: 19-Apr-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media