skip to main content
10.1145/3503161.3551587acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

TA-CNN: A Unified Network for Human Behavior Analysis in Multi-Person Conversations

Published: 10 October 2022 Publication History

Abstract

Human behavior analysis in multi-person conversations has been one of the most important research issues for natural human-robot interaction. However, previous datasets and studies mainly focus on single-person behavior analysis, therefore, can hardly be generalized in real-world application scenarios. Fortunately, the MultiMediate'22 Challenge provides various video clips of multi-party conversations. In this paper, we present a unified network named TA-CNN for both sub-challenges. Our TA-CNN can not only model the spatio-temporal dependencies for eye contact detection, but also capture the group-level discriminative features for multi-label next speaker prediction. We empirically evaluate the performance of our method on the officially provided datasets. Our method achieves the state-of-the-art result of 0.7261 for eye contact detection in terms of accuracy and the UAR of 0.5965 for next speaker prediction on the corresponding test sets.

Supplementary Material

MP4 File (MM22-mmgc26.mp4)
Presentation video

References

[1]
Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 59--66.
[2]
Chris Birmingham, Kalin Stefanov, and Maja J Mataric. 2021. Group-Level Focus of Visual Attention for Improved Next Speaker Prediction. In Proceedings of the 29th ACM International Conference on Multimedia. 4838--4842.
[3]
Michael Dietz, Daniel Schork, and Elisabeth André. 2016. Exploring eye-tracking-based detection of visual search for elderly people. In 2016 12th International Conference on Intelligent Environments (IE). IEEE, 151--154.
[4]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[5]
Lourdes Ezpeleta, Roser Granero, Núria De La Osa, and Josep M Domènech. 2015. Clinical characteristics of preschool children with oppositional defiant disorder and callous-unemotional traits. PloS one, Vol. 10, 9 (2015), e0139346.
[6]
Eugene Yujun Fu and Michael W Ngai. 2021. Using Motion Histories for Eye Contact Detection in Multiperson Group Conversations. In Proceedings of the 29th ACM International Conference on Multimedia. 4873--4877.
[7]
Xiaoxue Fu, Eric E Nelson, Marcela Borge, Kristin A Buss, and Koraly Pérez-Edgar. 2019. Stationary and ambulatory attention patterns are differentially associated with early temperamental risk for socioemotional problems: Preliminary evidence from a multimodal eye-tracking investigation. Development and Psychopathology, Vol. 31, 3 (2019), 971--988.
[8]
Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision. Springer, 87--102.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[10]
Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka. 2015. Predicting next speaker based on head movement in multi-party meetings. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2319--2323.
[11]
Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, Ryuichiro Higashinaka, and Junji Tomita. 2019. Prediction of who will be next speaker and when using mouth-opening pattern in multi-party conversation. Multimodal Technologies and Interaction, Vol. 3, 4 (2019), 70.
[12]
Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, and Junji Yamato. 2016. Prediction of who will be the next speaker and when using gaze behavior in multiparty meetings. ACM Transactions on Interactive Intelligent Systems (TIIS), Vol. 6, 1 (2016), 1--31.
[13]
Shomik Jain, Balasubramanian Thiagarajan, Zhonghao Shi, Caitlyn Clabaugh, and Maja J Matarić. 2020. Modeling engagement in long-term, in-home socially assistive robot interventions for children with autism spectrum disorders. Science Robotics, Vol. 5, 39 (2020), eaaz3791.
[14]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[15]
Feng Lu, Yusuke Sugano, Takahiro Okabe, and Yoichi Sato. 2014. Adaptive linear regression for appearance-based gaze estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, 10 (2014), 2033--2046.
[16]
Fuyan Ma, Bin Sun, and Shutao Li. 2021. Facial Expression Recognition with Visual Transformers and Attentional Selective Fusion. IEEE Transactions on Affective Computing (2021).
[17]
Ryo Masumura, Mana Ihori, Tomohiro Tanaka, Atsushi Ando, Ryo Ishii, Takanobu Oba, and Ryuichiro Higashinaka. 2019. Improving speech-based end-of-turn detection via cross-modal representation learning with punctuated text data. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 1062--1069.
[18]
Philipp Müller, Michael Dietz, Dominik Schiller, Dominike Thomas, Hali Lindsay, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2022. MultiMediate '22: Backchannel Detection and Agreement Estimation in Group Interactions. In Proceedings of the 30th ACM International Conference on Multimedia. ACM New York, NY, USA, 6 pages. https://doi.org/10.1145/3503161.3551589
[19]
Philipp Müller, Michael Dietz, Dominik Schiller, Dominike Thomas, Guanhua Zhang, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2021. MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation. In Proceedings of the 29th ACM International Conference on Multimedia. 4878--4882.
[20]
Philipp Müller, Michael Xuelin Huang, and Andreas Bulling. 2018a. Detecting low rapport during natural interactions in small groups from non-verbal behaviour. In 23rd International Conference on Intelligent User Interfaces. 153--164.
[21]
Philipp Müller, Michael Xuelin Huang, Xucong Zhang, and Andreas Bulling. 2018b. Robust eye contact detection in natural multi-person interactions using gaze and speaking behaviour. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications. 1--10.
[22]
Philipp Müller, Ekta Sood, and Andreas Bulling. 2020. Anticipating averted gaze in dyadic interactions. In ACM Symposium on Eye Tracking Research and Applications. 1--10.
[23]
Kazuhiro Otsuka, Keisuke Kasuga, and Martina Köhler. 2018. Estimating visual focus of attention in multiparty meetings using deep convolutional neural networks. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 191--199.
[24]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).
[25]
Brian Scassellati, Laura Boccanfuso, Chien-Ming Huang, Marilena Mademtzi, Meiying Qin, Nicole Salomons, Pamela Ventola, and Frederick Shic. 2018. Improving social skills in children with ASD using a long-term, in-home social robot. Science Robotics, Vol. 3, 21 (2018), eaat7544.
[26]
Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2015. Appearance-based gaze estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4511--4520.

Cited By

View all
  • (2024)Less is More: Adaptive Feature Selection and Fusion for Eye Contact DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688987(11390-11396)Online publication date: 28-Oct-2024
  • (2023)MultiMediate '23: Engagement Estimation and Bodily Behaviour Recognition in Social InteractionsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613851(9640-9645)Online publication date: 26-Oct-2023
  • (2023)Data Augmentation for Human Behavior Analysis in Multi-Person ConversationsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612856(9516-9520)Online publication date: 26-Oct-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. eye contact detection
  3. multi-person conversation
  4. next speaker prediction

Qualifiers

  • Research-article

Funding Sources

  • the National Natural Science Fund of China
  • the Hunan Provincial Natural Science Foundation of China

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)67
  • Downloads (Last 6 weeks)3
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Less is More: Adaptive Feature Selection and Fusion for Eye Contact DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688987(11390-11396)Online publication date: 28-Oct-2024
  • (2023)MultiMediate '23: Engagement Estimation and Bodily Behaviour Recognition in Social InteractionsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613851(9640-9645)Online publication date: 26-Oct-2023
  • (2023)Data Augmentation for Human Behavior Analysis in Multi-Person ConversationsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612856(9516-9520)Online publication date: 26-Oct-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media