short-paper

Improving Users Engagement Detection using End-to-End Spatio-Temporal Convolutional Neural Networks

Authors:

Fang ChenAuthors Info & Claims

HRI '21 Companion: Companion of the 2021 ACM/IEEE International Conference on Human-Robot Interaction

Pages 190 - 194

https://doi.org/10.1145/3434074.3447157

Published: 08 March 2021 Publication History

Abstract

The ability to infer latent behaviours such as the degree of engagement of humans interacting with social robots is still considered one challenging task in the human-robot interaction (HRI) field. Data-driven techniques based on machine learning were recently shown to be a promising approach for tackling the users' engagement detection problem, however, the resolution often involves multiple consecutive stages. This in return makes these techniques either incapable of capturing the users' engagement especially in a dynamic environment or un-deployable because of their inability to track engagement in real-time. This study is based on a data-driven framework, and we propose an end-to-end technique based on a unique 3D convolutional neural network architecture. Our proposed framework was trained and evaluated using a real-life dataset of users interacting spontaneously with a social robot in a dynamic environment. The framework has shown promising results over three different evaluation metrics when compared against three baseline approaches from the literature with an F1-score of 76.72. Additionally, our framework has achieved a resilient real-time performance of 25 Hz.

References

[1]

Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 59--66.

Digital Library

[2]

Atef Ben-Youssef, Chloé Clavel, Slim Essid, Miriam Bilac, Marine Chamoux, and Angelica Lim. 2017. UE-HRI: a new dataset for the study of user engagement in spontaneous human-robot interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 464--472.

Digital Library

[3]

Atef Ben-Youssef, Giovanna Varni, Slim Essid, and Chloé Clavel. 2019. On-the-Fly Detection of User Engagement Decrease in Spontaneous Human--Robot Interaction Using Recurrent and Deep Neural Networks. International Journal of Social Robotics, Vol. 11, 5 (2019), 815--828.

[4]

Cynthia Breazeal, Kerstin Dautenhahn, and Takayuki Kanda. 2016. Social robotics. In Springer handbook of robotics. Springer, 1935--1972.

[5]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[6]

Ginevra Castellano, André Pereira, Iolanda Leite, Ana Paiva, and Peter W McOwan. 2009. Detecting user engagement with a robot companion using task and social interaction-based features. In Proceedings of the 2009 international conference on Multimodal interfaces. 119--126.

Digital Library

[7]

Francesco Del Duchetto, Paul Baxter, and Marc Hanheide. 2020. Are you still with me? Continuous Engagement Assessment from a Robot's Point of View. arXiv preprint arXiv:2001.03515 (2020).

[8]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546--6555.

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[10]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).

Digital Library

[11]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[12]

Tianlin Liu and Arvid Kappas. 2018. Predicting engagement breakdown in HRI using thin-slices of facial expressions. In Workshops at the thirty-second AAAI conference on artificial intelligence .

[13]

Ognjen Rudovic, Meiru Zhang, Bjorn Schuller, and Rosalind Picard. 2019. Multi-modal Active Learning From Human Data: A Deep Reinforcement Learning Approach. In 2019 International Conference on Multimodal Interaction. 6--15.

[14]

Khaled Saleh, Mohammed Hossny, and Saeid Nahavandi. 2019. Real-time intent prediction of pedestrians for autonomous ground vehicles via spatio-temporal densenet. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 9704--9710.

Digital Library

[15]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.

[16]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818--2826.

[17]

Meg Tonkin, Jonathan Vitale, Suman Ojha, Mary-Anne Williams, Paul Fuller, William Judge, and Xun Wang. 2017. Would you like to sample? robot engagement in a shopping centre. In 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 42--49.

Digital Library

[18]

Du Tran, Lubomir D Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2014. C3D: generic features for video analysis. CoRR, abs/1412.0767, Vol. 2, 7 (2014), 8.

[19]

Dominique Vaufreydaz, Wafa Johal, and Claudine Combe. 2016. Starting engagement detection towards a companion robot using multimodal features. Robotics and Autonomous Systems, Vol. 75 (2016), 4--16.

Digital Library

[20]

Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. 2007. On early stopping in gradient descent learning. Constructive Approximation, Vol. 26, 2 (2007), 289--315.

[21]

Atef Ben Youssef, Chloé Clavel, and Slim Essid. 2019. Early detection of user engagement breakdown in spontaneous human-humanoid interaction. IEEE Transactions on Affective Computing (2019).

Cited By

Lim JSee JDondrup C(2025)Multimodal Engagement Prediction in Human-Robot Interaction Using Transformer Neural NetworksMultiMedia Modeling10.1007/978-981-96-2074-6_1(3-17)Online publication date: 1-Jan-2025
https://doi.org/10.1007/978-981-96-2074-6_1
Huang MWang WLin ZTesema FJi SGu JWan MSong WLi TZhu S(2022)Real-time Architecture for Audio-Visual Active Speaker Detection2022 IEEE International Conference on Robotics and Biomimetics (ROBIO)10.1109/ROBIO55434.2022.10011692(1377-1382)Online publication date: 5-Dec-2022
https://doi.org/10.1109/ROBIO55434.2022.10011692
Zhang ZZheng JMagnenat Thalmann N(2022)Engagement estimation of the elderly from wild multiparty human–robot interactionComputer Animation and Virtual Worlds10.1002/cav.212033:6Online publication date: 18-Aug-2022
https://doi.org/10.1002/cav.2120

Index Terms

Improving Users Engagement Detection using End-to-End Spatio-Temporal Convolutional Neural Networks
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Ubiquitous end-user live editing of interactive multimedia programs
WebMedia '08: Proceedings of the 14th Brazilian Symposium on Multimedia and the Web

Watching TV is a practice many people enjoy and feel comfortable with. In the context of interactive TV, a program is defined by means of a structured multimedia document delivered to the viewer's digital TV equipment. While watching TV, a user can be ...
Continual spatio-temporal graph convolutional networks
Highlights
- We proposed a highly efficient online skeleton-based action recognition method.
Abstract
Graph-based reasoning over skeleton data has emerged as a promising approach for human action recognition. However, the application of prior graph-based methods, which predominantly employ whole temporal sequences as their input, to ...
Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting
IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial Intelligence

Timely accurate traffic forecast is crucial for urban traffic control and guidance. Due to the high nonlinearity and complexity of traffic flow, traditional methods cannot satisfy the requirements of mid-and-long term prediction tasks and often neglect ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HRI '21 Companion: Companion of the 2021 ACM/IEEE International Conference on Human-Robot Interaction

March 2021

756 pages

ISBN:9781450382908

DOI:10.1145/3434074

General Chairs:
Cindy Bethel
Mississippi State University, USA
,
Ana Paiva
INESC-ID, IST, University of Lisbon, Portugal & Radcliffe Institute for Advanced Study, Harvard University, USA
,
Program Chairs:
Elizabeth Broadbent
University of Auckland, New Zealand
,
David Feil-Seifer
University of Nevada Reno, USA
,
Daniel Szafir
University of Colorado Boulder, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

HRI '21

Sponsor:

HRI '21: ACM/IEEE International Conference on Human-Robot Interaction

March 8 - 11, 2021

CO, Boulder, USA

Acceptance Rates

Overall Acceptance Rate 192 of 519 submissions, 37%

Upcoming Conference

HRI '25

Sponsor:
sigai
sigai

ACM/IEEE International Conference on Human-Robot Interaction

March 4 - 6, 2025

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
177
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)2

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lim JSee JDondrup C(2025)Multimodal Engagement Prediction in Human-Robot Interaction Using Transformer Neural NetworksMultiMedia Modeling10.1007/978-981-96-2074-6_1(3-17)Online publication date: 1-Jan-2025
https://doi.org/10.1007/978-981-96-2074-6_1
Huang MWang WLin ZTesema FJi SGu JWan MSong WLi TZhu S(2022)Real-time Architecture for Audio-Visual Active Speaker Detection2022 IEEE International Conference on Robotics and Biomimetics (ROBIO)10.1109/ROBIO55434.2022.10011692(1377-1382)Online publication date: 5-Dec-2022
https://doi.org/10.1109/ROBIO55434.2022.10011692
Zhang ZZheng JMagnenat Thalmann N(2022)Engagement estimation of the elderly from wild multiparty human–robot interactionComputer Animation and Virtual Worlds10.1002/cav.212033:6Online publication date: 18-Aug-2022
https://doi.org/10.1002/cav.2120

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten