skip to main content
10.1145/3434074.3447157acmconferencesArticle/Chapter ViewAbstractPublication PageshriConference Proceedingsconference-collections
short-paper

Improving Users Engagement Detection using End-to-End Spatio-Temporal Convolutional Neural Networks

Published: 08 March 2021 Publication History

Abstract

The ability to infer latent behaviours such as the degree of engagement of humans interacting with social robots is still considered one challenging task in the human-robot interaction (HRI) field. Data-driven techniques based on machine learning were recently shown to be a promising approach for tackling the users' engagement detection problem, however, the resolution often involves multiple consecutive stages. This in return makes these techniques either incapable of capturing the users' engagement especially in a dynamic environment or un-deployable because of their inability to track engagement in real-time. This study is based on a data-driven framework, and we propose an end-to-end technique based on a unique 3D convolutional neural network architecture. Our proposed framework was trained and evaluated using a real-life dataset of users interacting spontaneously with a social robot in a dynamic environment. The framework has shown promising results over three different evaluation metrics when compared against three baseline approaches from the literature with an F1-score of 76.72. Additionally, our framework has achieved a resilient real-time performance of 25 Hz.

References

[1]
Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 59--66.
[2]
Atef Ben-Youssef, Chloé Clavel, Slim Essid, Miriam Bilac, Marine Chamoux, and Angelica Lim. 2017. UE-HRI: a new dataset for the study of user engagement in spontaneous human-robot interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 464--472.
[3]
Atef Ben-Youssef, Giovanna Varni, Slim Essid, and Chloé Clavel. 2019. On-the-Fly Detection of User Engagement Decrease in Spontaneous Human--Robot Interaction Using Recurrent and Deep Neural Networks. International Journal of Social Robotics, Vol. 11, 5 (2019), 815--828.
[4]
Cynthia Breazeal, Kerstin Dautenhahn, and Takayuki Kanda. 2016. Social robotics. In Springer handbook of robotics. Springer, 1935--1972.
[5]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[6]
Ginevra Castellano, André Pereira, Iolanda Leite, Ana Paiva, and Peter W McOwan. 2009. Detecting user engagement with a robot companion using task and social interaction-based features. In Proceedings of the 2009 international conference on Multimodal interfaces. 119--126.
[7]
Francesco Del Duchetto, Paul Baxter, and Marc Hanheide. 2020. Are you still with me? Continuous Engagement Assessment from a Robot's Point of View. arXiv preprint arXiv:2001.03515 (2020).
[8]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546--6555.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[10]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
[11]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[12]
Tianlin Liu and Arvid Kappas. 2018. Predicting engagement breakdown in HRI using thin-slices of facial expressions. In Workshops at the thirty-second AAAI conference on artificial intelligence .
[13]
Ognjen Rudovic, Meiru Zhang, Bjorn Schuller, and Rosalind Picard. 2019. Multi-modal Active Learning From Human Data: A Deep Reinforcement Learning Approach. In 2019 International Conference on Multimodal Interaction. 6--15.
[14]
Khaled Saleh, Mohammed Hossny, and Saeid Nahavandi. 2019. Real-time intent prediction of pedestrians for autonomous ground vehicles via spatio-temporal densenet. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 9704--9710.
[15]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.
[16]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818--2826.
[17]
Meg Tonkin, Jonathan Vitale, Suman Ojha, Mary-Anne Williams, Paul Fuller, William Judge, and Xun Wang. 2017. Would you like to sample? robot engagement in a shopping centre. In 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 42--49.
[18]
Du Tran, Lubomir D Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2014. C3D: generic features for video analysis. CoRR, abs/1412.0767, Vol. 2, 7 (2014), 8.
[19]
Dominique Vaufreydaz, Wafa Johal, and Claudine Combe. 2016. Starting engagement detection towards a companion robot using multimodal features. Robotics and Autonomous Systems, Vol. 75 (2016), 4--16.
[20]
Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. 2007. On early stopping in gradient descent learning. Constructive Approximation, Vol. 26, 2 (2007), 289--315.
[21]
Atef Ben Youssef, Chloé Clavel, and Slim Essid. 2019. Early detection of user engagement breakdown in spontaneous human-humanoid interaction. IEEE Transactions on Affective Computing (2019).

Cited By

View all
  • (2025)Multimodal Engagement Prediction in Human-Robot Interaction Using Transformer Neural NetworksMultiMedia Modeling10.1007/978-981-96-2074-6_1(3-17)Online publication date: 1-Jan-2025
  • (2022)Real-time Architecture for Audio-Visual Active Speaker Detection2022 IEEE International Conference on Robotics and Biomimetics (ROBIO)10.1109/ROBIO55434.2022.10011692(1377-1382)Online publication date: 5-Dec-2022
  • (2022)Engagement estimation of the elderly from wild multiparty human–robot interactionComputer Animation and Virtual Worlds10.1002/cav.212033:6Online publication date: 18-Aug-2022

Index Terms

  1. Improving Users Engagement Detection using End-to-End Spatio-Temporal Convolutional Neural Networks

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HRI '21 Companion: Companion of the 2021 ACM/IEEE International Conference on Human-Robot Interaction
    March 2021
    756 pages
    ISBN:9781450382908
    DOI:10.1145/3434074
    • General Chairs:
    • Cindy Bethel,
    • Ana Paiva,
    • Program Chairs:
    • Elizabeth Broadbent,
    • David Feil-Seifer,
    • Daniel Szafir
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 March 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3D convnet
    2. engagement prediction
    3. spatio-temporal modelling

    Qualifiers

    • Short-paper

    Conference

    HRI '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 192 of 519 submissions, 37%

    Upcoming Conference

    HRI '25
    ACM/IEEE International Conference on Human-Robot Interaction
    March 4 - 6, 2025
    Melbourne , VIC , Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)25
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Multimodal Engagement Prediction in Human-Robot Interaction Using Transformer Neural NetworksMultiMedia Modeling10.1007/978-981-96-2074-6_1(3-17)Online publication date: 1-Jan-2025
    • (2022)Real-time Architecture for Audio-Visual Active Speaker Detection2022 IEEE International Conference on Robotics and Biomimetics (ROBIO)10.1109/ROBIO55434.2022.10011692(1377-1382)Online publication date: 5-Dec-2022
    • (2022)Engagement estimation of the elderly from wild multiparty human–robot interactionComputer Animation and Virtual Worlds10.1002/cav.212033:6Online publication date: 18-Aug-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media