short-paper

Gaze+Lip: Rapid, Precise and Expressive Interactions Combining Gaze Input and Silent Speech Commands for Hands-free Smart TV Control

Authors:
Zixiong Su

GSII The University of Tokyo, Japan

GSII The University of Tokyo, Japan
View Profile

,
Xinlei Zhang

The University of Tokyo / Rekimoto Lab Graduate School of Interdisciplinary Information Studies, Japan

The University of Tokyo / Rekimoto Lab Graduate School of Interdisciplinary Information Studies, Japan
View Profile

,
Naoki Kimura

The University of Tokyo The University of Tokyo, Japan

The University of Tokyo The University of Tokyo, Japan
View Profile

,
Jun Rekimoto

Rekimoto The University of Tokyo, Japan

Rekimoto The University of Tokyo, Japan
View Profile

ETRA '21 Short Papers: ACM Symposium on Eye Tracking Research and ApplicationsMay 2021Article No.: 13Pages 1–6https://doi.org/10.1145/3448018.3458011

Published:25 May 2021Publication History

ETRA '21 Short Papers: ACM Symposium on Eye Tracking Research and Applications

Pages 1–6

ABSTRACT

As eye-tracking technologies develop, gaze becomes more and more popular as an input modality. However, in situations that require fast and precise object selection, gaze is hard to use because of limited accuracy. We present Gaze+Lip, a hands-free interface that combines gaze and lip reading to enable rapid and precise remote controls when interacting with big displays. Gaze+Lip takes advantage of gaze for target selection and leverages silent speech to ensure accurate and reliable command execution in noisy scenarios such as watching TV or playing videos on a computer. For evaluation, we implemented a system on a TV, and conducted an experiment to compare our method with the dwell-based gaze-only input method. Results showed that Gaze+Lip outperformed the gaze-only approach in accuracy and input speed. Furthermore, subjective evaluations indicated that Gaze+Lip is easy to understand, easy to use, and has higher perceived speed than the gaze-only approach.

References

Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. 2016. LipNet: Sentence-level Lipreading. CoRR abs/1611.01599(2016). arxiv:1611.01599http://arxiv.org/abs/1611.01599Google Scholar
Ishan Chatterjee, Robert Xiao, and Chris Harrison. 2015. Gaze+Gesture: Expressive, Precise and Targeted Free-Space Interactions. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (Seattle, Washington, USA) (ICMI ’15). Association for Computing Machinery, New York, NY, USA, 131–138. https://doi.org/10.1145/2818346.2820752 Google ScholarDigital Library
J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. 2017. Lip Reading Sentences in the Wild. In IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Joon Son Chung, Andrew W. Senior, Oriol Vinyals, and Andrew Zisserman. 2016. Lip Reading Sentences in the Wild. CoRR abs/1611.05358(2016). arxiv:1611.05358http://arxiv.org/abs/1611.05358Google Scholar
J. S. Chung and A. Zisserman. 2016. Lip Reading in the Wild. In Asian Conference on Computer Vision.Google Scholar
Ing-Shiou Hwang, Yi-Ying Tsai, Bo-Han Zeng, Chien-Ming Lin, Huei-Sheng Shiue, and Gwo-Ching Chang. 2020. Integration of eye tracking and lip motion for hands-free computer access. Universal Access in the Information Society(2020), 1–12.Google Scholar
Robert J. K. Jacob. 1990. What You Look at is What You Get: Eye Movement-Based Interaction Techniques. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Seattle, Washington, USA) (CHI ’90). Association for Computing Machinery, New York, NY, USA, 11–18. https://doi.org/10.1145/97243.97246 Google ScholarDigital Library
S. Ji, W. Xu, M. Yang, and K. Yu. 2013. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1(2013), 221–231. https://doi.org/10.1109/TPAMI.2012.59 Google ScholarDigital Library
Naoki Kimura, Kentaro Hayashi, and Jun Rekimoto. 2020. TieLent: A Casual Neck-Mounted Mouth Capturing Device for Silent Speech Interaction. In Proceedings of the International Conference on Advanced Visual Interfaces (Salerno, Italy) (AVI ’20). Association for Computing Machinery, New York, NY, USA, Article 33, 8 pages. https://doi.org/10.1145/3399715.3399852 Google ScholarDigital Library
Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research 10 (2009), 1755–1758. Google ScholarDigital Library
I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey. 2002. Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 2 (Feb 2002), 198–213. https://doi.org/10.1109/34.982900 Google ScholarDigital Library
Eric David Petajan. 1984. Automatic Lipreading to Enhance Speech Recognition (Speech Reading). Ph.D. Dissertation. Champaign, IL, USA. AAI8502266. Google ScholarDigital Library
Ken Pfeuffer, Jason Alexander, Ming Ki Chong, and Hans Gellersen. 2014. Gaze-Touch: Combining Gaze with Multi-Touch for Interaction on the Same Surface. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology(Honolulu, Hawaii, USA) (UIST ’14). Association for Computing Machinery, New York, NY, USA, 509–518. https://doi.org/10.1145/2642918.2647397 Google ScholarDigital Library
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13796–13805.Google ScholarCross Ref
Korok Sengupta, Min Ke, Raphael Menges, Chandan Kumar, and Steffen Staab. 2018. Hands-Free Web Browsing: Enriching the User Experience with Gaze and Voice Modality. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications (Warsaw, Poland) (ETRA ’18). Association for Computing Machinery, New York, NY, USA, Article 88, 3 pages. https://doi.org/10.1145/3204493.3208338 Google ScholarDigital Library
Linda E. Sibert and Robert J. K. Jacob. 2000. Evaluation of Eye Gaze Interaction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (The Hague, The Netherlands) (CHI ’00). Association for Computing Machinery, New York, NY, USA, 281–288. https://doi.org/10.1145/332040.332445 Google ScholarDigital Library
M. Stone. 1974. Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society. Series B (Methodological) 36, 2(1974), 111–147. http://www.jstor.org/stable/2984809Google ScholarCross Ref
Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and Yuanchun Shi. 2018. Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology(Berlin, Germany) (UIST ’18). Association for Computing Machinery, New York, NY, USA, 581–593. https://doi.org/10.1145/3242587.3242599 Google ScholarDigital Library
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Michael Wand, Jan Koutník, and Jürgen Schmidhuber. 2016. Lipreading with Long Short-Term Memory. CoRR abs/1601.08188(2016). arxiv:1601.08188http://arxiv.org/abs/1601.08188Google Scholar
Colin Ware and Harutune H. Mikaelian. 1986. An Evaluation of an Eye Tracker as a Device for Computer Input2. SIGCHI Bull. 17, SI (May 1986), 183–188. https://doi.org/10.1145/30851.275627 Google ScholarDigital Library
Shumin Zhai, Carlos Morimoto, and Steven Ihde. 1999a. Manual and Gaze Input Cascaded (MAGIC) Pointing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Pittsburgh, Pennsylvania, USA) (CHI ’99). Association for Computing Machinery, New York, NY, USA, 246–253. https://doi.org/10.1145/302979.303053 Google ScholarDigital Library
Shumin Zhai, Carlos Morimoto, and Steven Ihde. 1999b. Manual and Gaze Input Cascaded (MAGIC) Pointing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Pittsburgh, Pennsylvania, USA) (CHI ’99). Association for Computing Machinery, New York, NY, USA, 246–253. https://doi.org/10.1145/302979.303053 Google ScholarDigital Library
Yanxia Zhang, Sophie Stellmach, Abigail Sellen, and Andrew Blake. 2015. The costs and benefits of combining gaze and hand gestures for remote interaction. In IFIP Conference on Human-Computer Interaction. Springer, 570–577. Google ScholarDigital Library
Ziheng Zhou, Guoying Zhao, Xiaopeng Hong, and Matti Pietikäinen. 2014. Editor’s Choice Article. Image and Vision Computing 32, 9 (2014), 590–605. https://doi.org/10.1016/j.imavis.2014.06.004Google ScholarCross Ref

Index Terms

Gaze+Lip: Rapid, Precise and Expressive Interactions Combining Gaze Input and Silent Speech Commands for Hands-free Smart TV Control
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction devices
      1. Touch screens
    2. Interaction paradigms

Index terms have been assigned to the content through auto-classification.

Recommendations

Design and Evaluation of a Silent Speech-Based Selection Method for Eye-Gaze Pointing

We investigate silent speech as a hands-free selection method in eye-gaze pointing. We first propose a stripped-down image-based model that can recognize a small number of silent commands almost as fast as state-of-the-art speech recognition models. ...
Read More
Advantage of Gaze-Only Content Browsing in VR using Cumulative Dwell Time Compared to Hand Controller
SUI '23: Proceedings of the 2023 ACM Symposium on Spatial User Interaction

Head-mounted displays(HMDs) are expected to be used as daily devices. Developing interfaces to control contents projected in a head-mounted display (HMD) is key to leading the spread of HMD usage. With the need for a new interface of the HMD, gaze ...
Read More
Control prediction based on cumulative gaze dwell time while browsing contents
ETRA '23: Proceedings of the 2023 Symposium on Eye Tracking Research and Applications

The utilization of gaze behavior for control has been studied as one of the hands-free control methods. With the recent spread of Head Mounted Display devices, it has become a vital issue to establish hands-free control methods. Previously proposed ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ETRA '21 Short Papers: ACM Symposium on Eye Tracking Research and Applications
May 2021
232 pages
ISBN:9781450383455
DOI:10.1145/3448018
Editors:
Andreas Bulling
University of Stuttgart, Germany
,
Anke Huckauf
Ulm University, Germany
,
Hans Gellersen
Aarhus University, Denmark
,
Daniel Weiskopf
University of Stuttgart, Germany
,
Mihai Bace
ETH Zürich, Switzerland
,
Teresa Hirzle
Ulm University, Germany
,
Florian Alt
Bundeswehr University Munich, Germany
,
Thies Pfeiffer
Hochschule Emden/Leer, Germany
,
Roman Bednarik
University of Eastern Finland, Finland
,
Krzysztof Krejtz
SWPS University of Social Sciences and Humanities, Poland
,
Tanja Blascheck
University of Stuttgart, Germany
,
Michael Burch
University of Applied Sciences, Chur, Switzerland
,
Peter Kiefer
ETH Zurich, Switzerland
,
Michael Dodd
University of Nebraska-Lincoln, USA
,
Bonita Sharif
University of Nebraska-Lincoln, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 May 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Gaze Input
Hands-free Interaction
Lip Reading
Multimodal Interface
Silent Speech
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate69of137submissions,50%
Upcoming Conference
ETRA '24

Sponsor:

sigchi

sigchi

The 2024 Symposium on Eye Tracking Research and Applications

June 4 - 7, 2024

Glasgow , United Kingdom
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 320
  Total Downloads
- Downloads (Last 12 months)82
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Gaze+Lip: Rapid, Precise and Expressive Interactions Combining Gaze Input and Silent Speech Commands for Hands-free Smart TV Control

ETRA '21 Short Papers: ACM Symposium on Eye Tracking Research and Applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Design and Evaluation of a Silent Speech-Based Selection Method for Eye-Gaze Pointing

Advantage of Gaze-Only Content Browsing in VR using Cumulative Dwell Time Compared to Hand Controller

Control prediction based on cumulative gaze dwell time while browsing contents