skip to main content
10.1145/3664647.3688987acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Less is More: Adaptive Feature Selection and Fusion for Eye Contact Detection

Published: 28 October 2024 Publication History

Abstract

Detecting eye contact is essential for embodied robots to engage in natural interactions with humans, enhancing the intuitiveness and comfort of these exchanges. However, eye contact detection often presents a significant challenge due to a variety of factors, such as low contrast and various forms of occlusions. Existing methods incorporate convolutional neural networks (CNNs) or Transformers to learn discriminative representations, but usually ignore the influence of noisy or less relevant regions in facial images. To address this gap, we propose the deep feature selection and fusion network (FSFNet) for eye contact detection in multi-party conversations. Our proposed method adaptively selects fine-grained visual features and reduces the impacts of irrelevant features. Specifically, we present a local feature selection scheme that leverages the attention scores to progressively concentrate on the most informative features. By integrating the carefully selected features into the multi-head self-attention module, we can maintain the superior properties of Transformers while simultaneously reducing the overall computational demands. We evaluate the proposed method on the official eye contact detection datasets, which achieves promising results of 0.8174 and 0.79 on the validation and test sets, respectively. We have made the source code publicly accessible in https://github.com/ma-hnu/FSFNet.

References

[1]
Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 59--66.
[2]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213--229.
[3]
Eunji Chong, Elysha Clark-Whitney, Audrey Southerland, Elizabeth Stubbs, Chanel Miller, Eliana L Ajodan, Melanie R Silverman, Catherine Lord, Agata Rozga, Rebecca M Jones, et al. 2020. Detection of eye contact with deep neural networks is as accurate as human experts. Nature Communications, Vol. 11, 1 (2020), 6386.
[4]
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4690--4699.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.
[6]
Michael Dietz, Daniel Schork, and Elisabeth André. 2016. Exploring eye-tracking-based detection of visual search for elderly people. In 2016 12th International Conference on Intelligent Environments (IE). IEEE, 151--154.
[7]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[8]
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. 2020. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412 (2020).
[9]
Eugene Yujun Fu and Michael W Ngai. 2021. Using Motion Histories for Eye Contact Detection in Multiperson Group Conversations. In Proceedings of the 29th ACM International Conference on Multimedia. 4873--4877.
[10]
Kenneth Alberto Funes Mora, Florent Monay, and Jean-Marc Odobez. 2014. EYEDIAP: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In Proceedings of the Symposium on Eye Tracking Research and Applications. 255--258.
[11]
Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision. Springer, 87--102.
[12]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[13]
Kun Li, Dan Guo, Guoliang Chen, Feiyang Liu, and Meng Wang. 2023. Data Augmentation for Human Behavior Analysis in Multi-Person Conversations. In Proceedings of the 31st ACM International Conference on Multimedia. 9516--9520.
[14]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.
[15]
Feng Lu, Yusuke Sugano, Takahiro Okabe, and Yoichi Sato. 2014. Adaptive linear regression for appearance-based gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, 10 (2014), 2033--2046.
[16]
Fuyan Ma, Ziyu Ma, Bin Sun, and Shutao Li. 2022. TA-CNN: A Unified Network for Human Behavior Analysis in Multi-Person Conversations. In Proceedings of the 30th ACM International Conference on Multimedia. 7099--7103.
[17]
Fuyan Ma, Bin Sun, and Shutao Li. 2023. Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion. IEEE Transactions on Affective Computing, Vol. 14, 2 (2023), 1236--1248. https://doi.org/10.1109/TAFFC.2021.3122146
[18]
Fuyan Ma, Bin Sun, and Shutao Li. 2024. Transformer-Augmented Network With Online Label Correction for Facial Expression Recognition. IEEE Transactions on Affective Computing, Vol. 15, 2 (2024), 593--605. https://doi.org/10.1109/TAFFC.2023.3285231
[19]
Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Anna Penzkofer, Dominik Schiller, Frankcois Brémond, Jan Alexandersson, Elisabeth André, and Andreas Bulling. 2024. MultiMediate'24: Multi-Domain Engagement Estimation. In Proceedings of the 32nd ACM International Conference on Multimedia. https://doi.org/10.1145/3664647.3689004
[20]
Philipp Müller, Michael Dietz, Dominik Schiller, Dominike Thomas, Hali Lindsay, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2022. MultiMediate '22: Backchannel Detection and Agreement Estimation in Group Interactions. In Proceedings of the 30th ACM International Conference on Multimedia. ACM New York, NY, USA, 6 pages.
[21]
Philipp Müller, Michael Dietz, Dominik Schiller, Dominike Thomas, Guanhua Zhang, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2021. MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation. In Proceedings of the 29th ACM International Conference on Multimedia. 4878--4882.
[22]
Philipp Müller, Michael Xuelin Huang, and Andreas Bulling. 2018. Detecting low rapport during natural interactions in small groups from non-verbal behaviour. In 23rd International Conference on Intelligent User Interfaces. 153--164.
[23]
Philipp Müller, Michael Xuelin Huang, Xucong Zhang, and Andreas Bulling. 2018. Robust eye contact detection in natural multi-person interactions using gaze and speaking behaviour. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications. 1--10.
[24]
Philipp Müller, Ekta Sood, and Andreas Bulling. 2020. Anticipating averted gaze in dyadic interactions. In ACM Symposium on Eye Tracking Research and Applications. 1--10.
[25]
Kazuhiro Otsuka, Keisuke Kasuga, and Martina Köhler. 2018. Estimating visual focus of attention in multiparty meetings using deep convolutional neural networks. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 191--199.
[26]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).
[27]
Akshay Rangesh, Bowen Zhang, and Mohan M Trivedi. 2020. Driver gaze estimation in the real world: Overcoming the eyeglass challenge. In IEEE Intelligent Vehicles Symposium (IV). IEEE, 1054--1059.
[28]
Yong-Goo Shin, Kang-A Choi, Sung-Tae Kim, and Sung-Jea Ko. 2015. A novel single IR light based gaze estimation method using virtual glints. IEEE Transactions on Consumer Electronics, Vol. 61, 2 (2015), 254--260.
[29]
Brian A Smith, Qi Yin, Steven K Feiner, and Shree K Nayar. 2013. Gaze locking: passive eye contact detection for human-object interaction. In Proceedings of the 26th annual ACM symposium on User interface software and technology. 271--280.
[30]
Qiya Song, Renwei Dian, Bin Sun, Jie Xie, and Shutao Li. 2023. Multi-scale Conformer Fusion Network for Multi-participant Behavior Analysis. In Proceedings of the 31st ACM International Conference on Multimedia. 9472--9476.
[31]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, Vol. 9, 11 (2008).
[32]
Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020).
[33]
Xinming Wang, Jianhua Zhang, Hanlin Zhang, Shuwen Zhao, and Honghai Liu. 2021. Vision-based gaze estimation: A review. IEEE Transactions on Cognitive and Developmental Systems, Vol. 14, 2 (2021), 316--332.
[34]
Zhefan Ye, Yin Li, Alireza Fathi, Yi Han, Agata Rozga, Gregory D Abowd, and James M Rehg. 2012. Detecting eye contact using wearable eye-tracking glasses. In Proceedings of the 2012 ACM conference on ubiquitous computing. 699--704.
[35]
Xucong Zhang, Yusuke Sugano, and Andreas Bulling. 2017. Everyday eye contact detection using unsupervised gaze target discovery. In Proceedings of the 30th annual ACM symposium on user interface software and technology. 193--203.
[36]
Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2015. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4511--4520.
[37]
Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2017. MPIIGaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 1 (2017), 162--175.

Cited By

View all
  • (2024)MultiMediate'24: Multi-Domain Engagement EstimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3689004(11377-11382)Online publication date: 28-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. eye contact detection
  2. feature selection and fusion
  3. multi-party conversation
  4. transformer

Qualifiers

  • Research-article

Funding Sources

  • Hunan Provincial Natural Science Foundation of China
  • National Natural Science Fund of China

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)73
  • Downloads (Last 6 weeks)21
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)MultiMediate'24: Multi-Domain Engagement EstimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3689004(11377-11382)Online publication date: 28-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media