research-article

Less is More: Adaptive Feature Selection and Fusion for Eye Contact Detection

Authors:

Shutao LiAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 11390 - 11396

https://doi.org/10.1145/3664647.3688987

Published: 28 October 2024 Publication History

Abstract

Detecting eye contact is essential for embodied robots to engage in natural interactions with humans, enhancing the intuitiveness and comfort of these exchanges. However, eye contact detection often presents a significant challenge due to a variety of factors, such as low contrast and various forms of occlusions. Existing methods incorporate convolutional neural networks (CNNs) or Transformers to learn discriminative representations, but usually ignore the influence of noisy or less relevant regions in facial images. To address this gap, we propose the deep feature selection and fusion network (FSFNet) for eye contact detection in multi-party conversations. Our proposed method adaptively selects fine-grained visual features and reduces the impacts of irrelevant features. Specifically, we present a local feature selection scheme that leverages the attention scores to progressively concentrate on the most informative features. By integrating the carefully selected features into the multi-head self-attention module, we can maintain the superior properties of Transformers while simultaneously reducing the overall computational demands. We evaluate the proposed method on the official eye contact detection datasets, which achieves promising results of 0.8174 and 0.79 on the validation and test sets, respectively. We have made the source code publicly accessible in https://github.com/ma-hnu/FSFNet.

References

[1]

Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 59--66.

Digital Library

[2]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213--229.

Digital Library

[3]

Eunji Chong, Elysha Clark-Whitney, Audrey Southerland, Elizabeth Stubbs, Chanel Miller, Eliana L Ajodan, Melanie R Silverman, Catherine Lord, Agata Rozga, Rebecca M Jones, et al. 2020. Detection of eye contact with deep neural networks is as accurate as human experts. Nature Communications, Vol. 11, 1 (2020), 6386.

[4]

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4690--4699.

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.

[6]

Michael Dietz, Daniel Schork, and Elisabeth André. 2016. Exploring eye-tracking-based detection of visual search for elderly people. In 2016 12th International Conference on Intelligent Environments (IE). IEEE, 151--154.

[7]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[8]

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. 2020. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412 (2020).

[9]

Eugene Yujun Fu and Michael W Ngai. 2021. Using Motion Histories for Eye Contact Detection in Multiperson Group Conversations. In Proceedings of the 29th ACM International Conference on Multimedia. 4873--4877.

Digital Library

[10]

Kenneth Alberto Funes Mora, Florent Monay, and Jean-Marc Odobez. 2014. EYEDIAP: A database for the development and evaluation of gaze estimation algorithms from rgb and rgb-d cameras. In Proceedings of the Symposium on Eye Tracking Research and Applications. 255--258.

Digital Library

[11]

Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision. Springer, 87--102.

[12]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[13]

Kun Li, Dan Guo, Guoliang Chen, Feiyang Liu, and Meng Wang. 2023. Data Augmentation for Human Behavior Analysis in Multi-Person Conversations. In Proceedings of the 31st ACM International Conference on Multimedia. 9516--9520.

Digital Library

[14]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.

[15]

Feng Lu, Yusuke Sugano, Takahiro Okabe, and Yoichi Sato. 2014. Adaptive linear regression for appearance-based gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, 10 (2014), 2033--2046.

[16]

Fuyan Ma, Ziyu Ma, Bin Sun, and Shutao Li. 2022. TA-CNN: A Unified Network for Human Behavior Analysis in Multi-Person Conversations. In Proceedings of the 30th ACM International Conference on Multimedia. 7099--7103.

Digital Library

[17]

Fuyan Ma, Bin Sun, and Shutao Li. 2023. Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion. IEEE Transactions on Affective Computing, Vol. 14, 2 (2023), 1236--1248. https://doi.org/10.1109/TAFFC.2021.3122146

Digital Library

[18]

Fuyan Ma, Bin Sun, and Shutao Li. 2024. Transformer-Augmented Network With Online Label Correction for Facial Expression Recognition. IEEE Transactions on Affective Computing, Vol. 15, 2 (2024), 593--605. https://doi.org/10.1109/TAFFC.2023.3285231

Digital Library

[19]

Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Anna Penzkofer, Dominik Schiller, Frankcois Brémond, Jan Alexandersson, Elisabeth André, and Andreas Bulling. 2024. MultiMediate'24: Multi-Domain Engagement Estimation. In Proceedings of the 32nd ACM International Conference on Multimedia. https://doi.org/10.1145/3664647.3689004

Digital Library

[20]

Philipp Müller, Michael Dietz, Dominik Schiller, Dominike Thomas, Hali Lindsay, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2022. MultiMediate '22: Backchannel Detection and Agreement Estimation in Group Interactions. In Proceedings of the 30th ACM International Conference on Multimedia. ACM New York, NY, USA, 6 pages.

Digital Library

[21]

Philipp Müller, Michael Dietz, Dominik Schiller, Dominike Thomas, Guanhua Zhang, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2021. MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation. In Proceedings of the 29th ACM International Conference on Multimedia. 4878--4882.

Digital Library

[22]

Philipp Müller, Michael Xuelin Huang, and Andreas Bulling. 2018. Detecting low rapport during natural interactions in small groups from non-verbal behaviour. In 23rd International Conference on Intelligent User Interfaces. 153--164.

Digital Library

[23]

Philipp Müller, Michael Xuelin Huang, Xucong Zhang, and Andreas Bulling. 2018. Robust eye contact detection in natural multi-person interactions using gaze and speaking behaviour. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications. 1--10.

Digital Library

[24]

Philipp Müller, Ekta Sood, and Andreas Bulling. 2020. Anticipating averted gaze in dyadic interactions. In ACM Symposium on Eye Tracking Research and Applications. 1--10.

Digital Library

[25]

Kazuhiro Otsuka, Keisuke Kasuga, and Martina Köhler. 2018. Estimating visual focus of attention in multiparty meetings using deep convolutional neural networks. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 191--199.

Digital Library

[26]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).

[27]

Akshay Rangesh, Bowen Zhang, and Mohan M Trivedi. 2020. Driver gaze estimation in the real world: Overcoming the eyeglass challenge. In IEEE Intelligent Vehicles Symposium (IV). IEEE, 1054--1059.

Digital Library

[28]

Yong-Goo Shin, Kang-A Choi, Sung-Tae Kim, and Sung-Jea Ko. 2015. A novel single IR light based gaze estimation method using virtual glints. IEEE Transactions on Consumer Electronics, Vol. 61, 2 (2015), 254--260.

Digital Library

[29]

Brian A Smith, Qi Yin, Steven K Feiner, and Shree K Nayar. 2013. Gaze locking: passive eye contact detection for human-object interaction. In Proceedings of the 26th annual ACM symposium on User interface software and technology. 271--280.

Digital Library

[30]

Qiya Song, Renwei Dian, Bin Sun, Jie Xie, and Shutao Li. 2023. Multi-scale Conformer Fusion Network for Multi-participant Behavior Analysis. In Proceedings of the 31st ACM International Conference on Multimedia. 9472--9476.

Digital Library

[31]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, Vol. 9, 11 (2008).

[32]

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020).

[33]

Xinming Wang, Jianhua Zhang, Hanlin Zhang, Shuwen Zhao, and Honghai Liu. 2021. Vision-based gaze estimation: A review. IEEE Transactions on Cognitive and Developmental Systems, Vol. 14, 2 (2021), 316--332.

[34]

Zhefan Ye, Yin Li, Alireza Fathi, Yi Han, Agata Rozga, Gregory D Abowd, and James M Rehg. 2012. Detecting eye contact using wearable eye-tracking glasses. In Proceedings of the 2012 ACM conference on ubiquitous computing. 699--704.

Digital Library

[35]

Xucong Zhang, Yusuke Sugano, and Andreas Bulling. 2017. Everyday eye contact detection using unsupervised gaze target discovery. In Proceedings of the 30th annual ACM symposium on user interface software and technology. 193--203.

Digital Library

[36]

Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2015. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4511--4520.

[37]

Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2017. MPIIGaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 1 (2017), 162--175.

Digital Library

Cited By

Müller PBalazia MBaur TDietz MHeimerl APenzkofer ASchiller DBrémond FAlexandersson JAndré EBulling ACai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)MultiMediate'24: Multi-Domain Engagement EstimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3689004(11377-11382)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3689004

Index Terms

Less is More: Adaptive Feature Selection and Fusion for Eye Contact Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning

Recommendations

Using Motion Histories for Eye Contact Detection in Multiperson Group Conversations
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Eye contact detection in group conversations is the key to developing artificial mediators that can understand and interact with a group. In this paper, we propose to model a group's appearances and behavioral features to perform eye contact detection ...
Improved YOLOv4 Based on Transformer and Feature Fusion Module
ICNCC '22: Proceedings of the 2022 11th International Conference on Networks, Communication and Computing

In order to solve the problem that mainstream object detection models such as R-CNN, YOLO and SSD focus too much on network depth and neglect the fusion of deep semantic feature information with shallow layers, this paper proposes a new network model ...
Feature selection via neighborhood multi-granulation fusion

Feature selection is an important data preprocessing technique, and has been widely studied in data mining, machine learning, and granular computing. However, very little research has considered a multi-granulation perspective. In this paper, we present ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Hunan Provincial Natural Science Foundation of China
National Natural Science Fund of China

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
73
Total Downloads

Downloads (Last 12 months)73
Downloads (Last 6 weeks)21

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Müller PBalazia MBaur TDietz MHeimerl APenzkofer ASchiller DBrémond FAlexandersson JAndré EBulling ACai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)MultiMediate'24: Multi-Domain Engagement EstimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3689004(11377-11382)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3689004

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten