research-article

Multi-scale Conformer Fusion Network for Multi-participant Behavior Analysis

Authors:

Shutao LiAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9472 - 9476

https://doi.org/10.1145/3581783.3612847

Published: 27 October 2023 Publication History

Abstract

Understanding and elucidating human behavior across diverse scenarios represents a pivotal research challenge in pursuing seamless human-computer interaction. However, previous research on multi-participant dialogues has mostly relied on proprietary datasets, which are not standardized and openly accessible. To propel advancements in this domain, the MultiMediate'23 Challenge presents two sub-challenges: Eye contact detection and Next speaker prediction, aiming to foster a comprehensive understanding of multi-participant behavior. To tackle these challenges, we propose a multi-scale conformer fusion network (MSCFN) for enhancing the perception of multi-participant group behaviors. The conformer block combines the strengths of transformers and convolution networks to facilitate the establishment of global and local contextual relationships between sequences. Then the output features from all Conformer blocks are concatenated to fusion multi-scale representations. Our proposed method was evaluated using the officially provided dataset, and it achieves the best and second best performance in next speaker prediction and gaze detection tasks of MultiMediate'23, respectively.

References

[1]

Tanay Agrawal, Michal Balazia, Philipp Müller, and Francc ois Brémond. 2023. Multimodal Vision Transformers with Forced Attention for Behavior Analysis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3392--3402.

[2]

Ahmed Amer, Chirag Bhuvaneshwara, Gowtham K Addluri, Mohammed M Shaik, Vedant Bonde, and Philipp Müller. 2023. Backchannel Detection and Agreement Estimation from Video with Transformer Networks. arXiv preprint arXiv:2306.01656 (2023).

[3]

Michal Balazia, Philipp Müller, Ákos Levente Tánczos, August von Liechtenstein, and Franccois Brémond. 2022. Bodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art Evaluation. In Proceedings of the 30th ACM International Conference on Multimedia. 70--79. https://doi.org/10.1145/3503161.3548363

Digital Library

[4]

Chris Birmingham, Kalin Stefanov, and Maja J Mataric. 2021. Group-level focus of visual attention for improved next speaker prediction. In Proceedings of the 29th ACM International Conference on Multimedia. 4838--4842.

Digital Library

[5]

Eunji Chong, Katha Chanda, Zhefan Ye, Audrey Southerland, Nataniel Ruiz, Rebecca M Jones, Agata Rozga, and James M Rehg. 2017. Detecting gaze towards eyes in natural social interactions and its use in child assessment. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 1, 3 (2017), 1--20.

Digital Library

[6]

Ruth E Corps, Chiara Gambi, and Martin J Pickering. 2018. Coordinating utterances during turn-taking: The role of prediction, response preparation, and articulation. Discourse processes, Vol. 55, 2 (2018), 230--240.

[7]

Fred Cummins. 2012. Gaze and blinking in dyadic conversation: A study in coordinated behaviour among individuals. Language and Cognitive Processes, Vol. 27, 10 (2012), 1525--1549.

[8]

Eugene Yujun Fu and Michael W Ngai. 2021. Using motion histories for eye contact detection in multiperson group conversations. In Proceedings of the 29th ACM International Conference on Multimedia. 4873--4877.

Digital Library

[9]

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020. 5036--5040. https://doi.org/10.21437/Interspeech.2020-3015

[10]

Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka. 2015. Predicting next speaker based on head movement in multi-party meetings. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2319--2323.

[11]

Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, Ryuichiro Higashinaka, and Junji Tomita. 2019. Prediction of who will be next speaker and when using mouth-opening pattern in multi-party conversation. Multimodal Technologies and Interaction, Vol. 3, 4 (2019), 70.

[12]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[13]

Chris L Kleinke. 1986. Gaze and eye contact: a research review. Psychological bulletin, Vol. 100, 1 (1986), 78.

[14]

Kyveli Kompatsiari, Francesca Ciardo, Davide De Tommaso, and Agnieszka Wykowska. 2019. Measuring engagement elicited by eye contact in Human-Robot Interaction. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 6979--6985.

Digital Library

[15]

Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini Kannan, Suchendra Bhandarkar, Wojciech Matusik, and Antonio Torralba. 2016. Eye tracking for everyone. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2176--2184.

[16]

Yehao Li, Ting Yao, Yingwei Pan, and Tao Mei. 2022. Contextual transformer networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 2 (2022), 1489--1500.

[17]

Fuyan Ma, Bin Sun, and Shutao Li. 2023. Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion. IEEE Transactions on Affective Computing (2023), 1236--1248.

Digital Library

[18]

Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Dominik Schiller, Mohammed Guermal, Dominike Thomas, Franccois Brémond, Jan Alexandersson, Elisabeth André, and Andreas Bulling. 2023. MultiMediate '23: Engagement Estimation and Bodily Behaviour Recognition in Social Interactions. In Proceedings of the 31st ACM International Conference on Multimedia. https://doi.org/10.1145/3581783.3613851

Digital Library

[19]

Philipp Müller, Michael Dietz, Dominik Schiller, Dominike Thomas, Hali Lindsay, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2022. MultiMediate'22: Backchannel Detection and Agreement Estimation in Group Interactions. In Proceedings of the 30th ACM International Conference on Multimedia. 7109--7114.

Digital Library

[20]

Philipp Müller, Michael Dietz, Dominik Schiller, Dominike Thomas, Guanhua Zhang, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2021. Multimediate: Multi-modal group behaviour analysis for artificial mediation. In Proceedings of the 29th ACM International Conference on Multimedia. 4878--4882.

Digital Library

[21]

Philipp Müller, Michael Xuelin Huang, and Andreas Bulling. 2018a. Detecting low rapport during natural interactions in small groups from non-verbal behaviour. In 23rd International Conference on Intelligent User Interfaces. 153--164.

Digital Library

[22]

Philipp Müller, Michael Xuelin Huang, Xucong Zhang, and Andreas Bulling. 2018b. Robust eye contact detection in natural multi-person interactions using gaze and speaking behaviour. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications. 1--10.

Digital Library

[23]

Philipp Müller, Ekta Sood, and Andreas Bulling. 2020. Anticipating averted gaze in dyadic interactions. In ACM Symposium on Eye Tracking Research and Applications. 1--10.

Digital Library

[24]

Sunjeong Park and Youn-kyung Lim. 2020. Investigating user expectations on the roles of family-shared AI speakers. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1--13.

Digital Library

[25]

Volha Petukhova and Harry Bunt. 2009. Who's next? Speaker-selection mechanisms in multiparty dialogue. In Workshop on the Semantics and Pragmatics of Dialogue.

[26]

Rajeev Ranjan, Shalini De Mello, and Jan Kautz. 2018. Light-weight head pose invariant gaze tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2156--2164.

[27]

Matthew Roddy, Gabriel Skantze, and Naomi Harte. 2018. Investigating speech features for continuous turn-taking prediction using lstms. arXiv preprint arXiv:1806.11461 (2018).

[28]

Gabriel Skantze, Martin Johansson, and Jonas Beskow. 2015. Exploring turn-taking cues in multi-party human-robot discussions about objects. In Proceedings of the 2015 ACM on international conference on multimodal interaction. 67--74.

Digital Library

[29]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).

[30]

Jiudong Yang, Peiying Wang, Yi Zhu, Mingchao Feng, Meng Chen, and Xiaodong He. 2022. Gated multimodal fusion with contrastive learning for turn-taking prediction in human-robot dialogue. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7747--7751.

[31]

Lingyu Zhang, Mallory Morgan, Indrani Bhattacharya, Michael Foley, Jonas Braasch, Christoph Riedl, Brooke Foucault Welles, and Richard J Radke. 2019. Improved visual focus of attention estimation and prosodic features for analyzing group interactions. In International Conference on Multimodal Interaction. 385--394.

Digital Library

Cited By

Ma FHe YSun BLi SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Less is More: Adaptive Feature Selection and Fusion for Eye Contact DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688987(11390-11396)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3688987
Li BWeng YSong QLiang LMin XZhou S(2024)Overview of the NLPCC 2024 Shared Task 7: Multi-lingual Medical Instructional Video Question AnsweringNatural Language Processing and Chinese Computing10.1007/978-981-97-9443-0_38(429-439)Online publication date: 1-Nov-2024
https://doi.org/10.1007/978-981-97-9443-0_38

Index Terms

Multi-scale Conformer Fusion Network for Multi-participant Behavior Analysis
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Data Augmentation for Human Behavior Analysis in Multi-Person Conversations
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

In this paper, we present the solution of our team HFUT-VUT for the MultiMediate Grand Challenge 2023 at ACM Multimedia 2023. The solution covers three sub-challenges: bodily behavior recognition, eye contact detection, and next speaker prediction. We ...
Information fusion for multi-scale data: Survey and challenges
Abstract
Information fusion is a useful technique of combining and merging different information to form a more complete and accurate result. Traditional information fusion models mainly focus on the single-scale data in which each object has a unique ...
Multi-scale siamese networks for multi-focus image fusion
Abstract
In this paper, we propose a multi-scale Siamese network for multi-focus image fusion. Many current image fusion methods are based on classifier and discriminators to segment the original image, determine whether there is a focus on it, and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Fund of China
Hunan Provincial Natural Science Foundation of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
174
Total Downloads

Downloads (Last 12 months)92
Downloads (Last 6 weeks)8

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ma FHe YSun BLi SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Less is More: Adaptive Feature Selection and Fusion for Eye Contact DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688987(11390-11396)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3688987
Li BWeng YSong QLiang LMin XZhou S(2024)Overview of the NLPCC 2024 Shared Task 7: Multi-lingual Medical Instructional Video Question AnsweringNatural Language Processing and Chinese Computing10.1007/978-981-97-9443-0_38(429-439)Online publication date: 1-Nov-2024
https://doi.org/10.1007/978-981-97-9443-0_38

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten