ABSTRACT
As eye-tracking technologies develop, gaze becomes more and more popular as an input modality. However, in situations that require fast and precise object selection, gaze is hard to use because of limited accuracy. We present Gaze+Lip, a hands-free interface that combines gaze and lip reading to enable rapid and precise remote controls when interacting with big displays. Gaze+Lip takes advantage of gaze for target selection and leverages silent speech to ensure accurate and reliable command execution in noisy scenarios such as watching TV or playing videos on a computer. For evaluation, we implemented a system on a TV, and conducted an experiment to compare our method with the dwell-based gaze-only input method. Results showed that Gaze+Lip outperformed the gaze-only approach in accuracy and input speed. Furthermore, subjective evaluations indicated that Gaze+Lip is easy to understand, easy to use, and has higher perceived speed than the gaze-only approach.
- Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. 2016. LipNet: Sentence-level Lipreading. CoRR abs/1611.01599(2016). arxiv:1611.01599http://arxiv.org/abs/1611.01599Google Scholar
- Ishan Chatterjee, Robert Xiao, and Chris Harrison. 2015. Gaze+Gesture: Expressive, Precise and Targeted Free-Space Interactions. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (Seattle, Washington, USA) (ICMI ’15). Association for Computing Machinery, New York, NY, USA, 131–138. https://doi.org/10.1145/2818346.2820752 Google ScholarDigital Library
- J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. 2017. Lip Reading Sentences in the Wild. In IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
- Joon Son Chung, Andrew W. Senior, Oriol Vinyals, and Andrew Zisserman. 2016. Lip Reading Sentences in the Wild. CoRR abs/1611.05358(2016). arxiv:1611.05358http://arxiv.org/abs/1611.05358Google Scholar
- J. S. Chung and A. Zisserman. 2016. Lip Reading in the Wild. In Asian Conference on Computer Vision.Google Scholar
- Ing-Shiou Hwang, Yi-Ying Tsai, Bo-Han Zeng, Chien-Ming Lin, Huei-Sheng Shiue, and Gwo-Ching Chang. 2020. Integration of eye tracking and lip motion for hands-free computer access. Universal Access in the Information Society(2020), 1–12.Google Scholar
- Robert J. K. Jacob. 1990. What You Look at is What You Get: Eye Movement-Based Interaction Techniques. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Seattle, Washington, USA) (CHI ’90). Association for Computing Machinery, New York, NY, USA, 11–18. https://doi.org/10.1145/97243.97246 Google ScholarDigital Library
- S. Ji, W. Xu, M. Yang, and K. Yu. 2013. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1(2013), 221–231. https://doi.org/10.1109/TPAMI.2012.59 Google ScholarDigital Library
- Naoki Kimura, Kentaro Hayashi, and Jun Rekimoto. 2020. TieLent: A Casual Neck-Mounted Mouth Capturing Device for Silent Speech Interaction. In Proceedings of the International Conference on Advanced Visual Interfaces (Salerno, Italy) (AVI ’20). Association for Computing Machinery, New York, NY, USA, Article 33, 8 pages. https://doi.org/10.1145/3399715.3399852 Google ScholarDigital Library
- Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research 10 (2009), 1755–1758. Google ScholarDigital Library
- I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey. 2002. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 2 (Feb 2002), 198–213. https://doi.org/10.1109/34.982900 Google ScholarDigital Library
- Eric David Petajan. 1984. Automatic Lipreading to Enhance Speech Recognition (Speech Reading). Ph.D. Dissertation. Champaign, IL, USA. AAI8502266. Google ScholarDigital Library
- Ken Pfeuffer, Jason Alexander, Ming Ki Chong, and Hans Gellersen. 2014. Gaze-Touch: Combining Gaze with Multi-Touch for Interaction on the Same Surface. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology(Honolulu, Hawaii, USA) (UIST ’14). Association for Computing Machinery, New York, NY, USA, 509–518. https://doi.org/10.1145/2642918.2647397 Google ScholarDigital Library
- KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13796–13805.Google ScholarCross Ref
- Korok Sengupta, Min Ke, Raphael Menges, Chandan Kumar, and Steffen Staab. 2018. Hands-Free Web Browsing: Enriching the User Experience with Gaze and Voice Modality. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications (Warsaw, Poland) (ETRA ’18). Association for Computing Machinery, New York, NY, USA, Article 88, 3 pages. https://doi.org/10.1145/3204493.3208338 Google ScholarDigital Library
- Linda E. Sibert and Robert J. K. Jacob. 2000. Evaluation of Eye Gaze Interaction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (The Hague, The Netherlands) (CHI ’00). Association for Computing Machinery, New York, NY, USA, 281–288. https://doi.org/10.1145/332040.332445 Google ScholarDigital Library
- M. Stone. 1974. Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society. Series B (Methodological) 36, 2(1974), 111–147. http://www.jstor.org/stable/2984809Google ScholarCross Ref
- Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and Yuanchun Shi. 2018. Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology(Berlin, Germany) (UIST ’18). Association for Computing Machinery, New York, NY, USA, 581–593. https://doi.org/10.1145/3242587.3242599 Google ScholarDigital Library
- Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Michael Wand, Jan Koutník, and Jürgen Schmidhuber. 2016. Lipreading with Long Short-Term Memory. CoRR abs/1601.08188(2016). arxiv:1601.08188http://arxiv.org/abs/1601.08188Google Scholar
- Colin Ware and Harutune H. Mikaelian. 1986. An Evaluation of an Eye Tracker as a Device for Computer Input2. SIGCHI Bull. 17, SI (May 1986), 183–188. https://doi.org/10.1145/30851.275627 Google ScholarDigital Library
- Shumin Zhai, Carlos Morimoto, and Steven Ihde. 1999a. Manual and Gaze Input Cascaded (MAGIC) Pointing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Pittsburgh, Pennsylvania, USA) (CHI ’99). Association for Computing Machinery, New York, NY, USA, 246–253. https://doi.org/10.1145/302979.303053 Google ScholarDigital Library
- Shumin Zhai, Carlos Morimoto, and Steven Ihde. 1999b. Manual and Gaze Input Cascaded (MAGIC) Pointing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Pittsburgh, Pennsylvania, USA) (CHI ’99). Association for Computing Machinery, New York, NY, USA, 246–253. https://doi.org/10.1145/302979.303053 Google ScholarDigital Library
- Yanxia Zhang, Sophie Stellmach, Abigail Sellen, and Andrew Blake. 2015. The costs and benefits of combining gaze and hand gestures for remote interaction. In IFIP Conference on Human-Computer Interaction. Springer, 570–577. Google ScholarDigital Library
- Ziheng Zhou, Guoying Zhao, Xiaopeng Hong, and Matti Pietikäinen. 2014. Editor’s Choice Article. Image and Vision Computing 32, 9 (2014), 590–605. https://doi.org/10.1016/j.imavis.2014.06.004Google ScholarCross Ref
Index Terms
- Gaze+Lip: Rapid, Precise and Expressive Interactions Combining Gaze Input and Silent Speech Commands for Hands-free Smart TV Control
Recommendations
Design and Evaluation of a Silent Speech-Based Selection Method for Eye-Gaze Pointing
We investigate silent speech as a hands-free selection method in eye-gaze pointing. We first propose a stripped-down image-based model that can recognize a small number of silent commands almost as fast as state-of-the-art speech recognition models. ...
Advantage of Gaze-Only Content Browsing in VR using Cumulative Dwell Time Compared to Hand Controller
SUI '23: Proceedings of the 2023 ACM Symposium on Spatial User InteractionHead-mounted displays(HMDs) are expected to be used as daily devices. Developing interfaces to control contents projected in a head-mounted display (HMD) is key to leading the spread of HMD usage. With the need for a new interface of the HMD, gaze ...
Control prediction based on cumulative gaze dwell time while browsing contents
ETRA '23: Proceedings of the 2023 Symposium on Eye Tracking Research and ApplicationsThe utilization of gaze behavior for control has been studied as one of the hands-free control methods. With the recent spread of Head Mounted Display devices, it has become a vital issue to establish hands-free control methods. Previously proposed ...
Comments