skip to main content
10.1145/3581783.3612847acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections

Multi-scale Conformer Fusion Network for Multi-participant Behavior Analysis

Published: 27 October 2023 Publication History


Understanding and elucidating human behavior across diverse scenarios represents a pivotal research challenge in pursuing seamless human-computer interaction. However, previous research on multi-participant dialogues has mostly relied on proprietary datasets, which are not standardized and openly accessible. To propel advancements in this domain, the MultiMediate'23 Challenge presents two sub-challenges: Eye contact detection and Next speaker prediction, aiming to foster a comprehensive understanding of multi-participant behavior. To tackle these challenges, we propose a multi-scale conformer fusion network (MSCFN) for enhancing the perception of multi-participant group behaviors. The conformer block combines the strengths of transformers and convolution networks to facilitate the establishment of global and local contextual relationships between sequences. Then the output features from all Conformer blocks are concatenated to fusion multi-scale representations. Our proposed method was evaluated using the officially provided dataset, and it achieves the best and second best performance in next speaker prediction and gaze detection tasks of MultiMediate'23, respectively.


Tanay Agrawal, Michal Balazia, Philipp Müller, and Francc ois Brémond. 2023. Multimodal Vision Transformers with Forced Attention for Behavior Analysis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3392--3402.
Ahmed Amer, Chirag Bhuvaneshwara, Gowtham K Addluri, Mohammed M Shaik, Vedant Bonde, and Philipp Müller. 2023. Backchannel Detection and Agreement Estimation from Video with Transformer Networks. arXiv preprint arXiv:2306.01656 (2023).
Michal Balazia, Philipp Müller, Ákos Levente Tánczos, August von Liechtenstein, and Franccois Brémond. 2022. Bodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art Evaluation. In Proceedings of the 30th ACM International Conference on Multimedia. 70--79.
Chris Birmingham, Kalin Stefanov, and Maja J Mataric. 2021. Group-level focus of visual attention for improved next speaker prediction. In Proceedings of the 29th ACM International Conference on Multimedia. 4838--4842.
Eunji Chong, Katha Chanda, Zhefan Ye, Audrey Southerland, Nataniel Ruiz, Rebecca M Jones, Agata Rozga, and James M Rehg. 2017. Detecting gaze towards eyes in natural social interactions and its use in child assessment. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 1, 3 (2017), 1--20.
Ruth E Corps, Chiara Gambi, and Martin J Pickering. 2018. Coordinating utterances during turn-taking: The role of prediction, response preparation, and articulation. Discourse processes, Vol. 55, 2 (2018), 230--240.
Fred Cummins. 2012. Gaze and blinking in dyadic conversation: A study in coordinated behaviour among individuals. Language and Cognitive Processes, Vol. 27, 10 (2012), 1525--1549.
Eugene Yujun Fu and Michael W Ngai. 2021. Using motion histories for eye contact detection in multiperson group conversations. In Proceedings of the 29th ACM International Conference on Multimedia. 4873--4877.
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020. 5036--5040.
Ryo Ishii, Shiro Kumano, and Kazuhiro Otsuka. 2015. Predicting next speaker based on head movement in multi-party meetings. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2319--2323.
Ryo Ishii, Kazuhiro Otsuka, Shiro Kumano, Ryuichiro Higashinaka, and Junji Tomita. 2019. Prediction of who will be next speaker and when using mouth-opening pattern in multi-party conversation. Multimodal Technologies and Interaction, Vol. 3, 4 (2019), 70.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Chris L Kleinke. 1986. Gaze and eye contact: a research review. Psychological bulletin, Vol. 100, 1 (1986), 78.
Kyveli Kompatsiari, Francesca Ciardo, Davide De Tommaso, and Agnieszka Wykowska. 2019. Measuring engagement elicited by eye contact in Human-Robot Interaction. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 6979--6985.
Kyle Krafka, Aditya Khosla, Petr Kellnhofer, Harini Kannan, Suchendra Bhandarkar, Wojciech Matusik, and Antonio Torralba. 2016. Eye tracking for everyone. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2176--2184.
Yehao Li, Ting Yao, Yingwei Pan, and Tao Mei. 2022. Contextual transformer networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, 2 (2022), 1489--1500.
Fuyan Ma, Bin Sun, and Shutao Li. 2023. Facial Expression Recognition With Visual Transformers and Attentional Selective Fusion. IEEE Transactions on Affective Computing (2023), 1236--1248.
Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Dominik Schiller, Mohammed Guermal, Dominike Thomas, Franccois Brémond, Jan Alexandersson, Elisabeth André, and Andreas Bulling. 2023. MultiMediate '23: Engagement Estimation and Bodily Behaviour Recognition in Social Interactions. In Proceedings of the 31st ACM International Conference on Multimedia.
Philipp Müller, Michael Dietz, Dominik Schiller, Dominike Thomas, Hali Lindsay, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2022. MultiMediate'22: Backchannel Detection and Agreement Estimation in Group Interactions. In Proceedings of the 30th ACM International Conference on Multimedia. 7109--7114.
Philipp Müller, Michael Dietz, Dominik Schiller, Dominike Thomas, Guanhua Zhang, Patrick Gebhard, Elisabeth André, and Andreas Bulling. 2021. Multimediate: Multi-modal group behaviour analysis for artificial mediation. In Proceedings of the 29th ACM International Conference on Multimedia. 4878--4882.
Philipp Müller, Michael Xuelin Huang, and Andreas Bulling. 2018a. Detecting low rapport during natural interactions in small groups from non-verbal behaviour. In 23rd International Conference on Intelligent User Interfaces. 153--164.
Philipp Müller, Michael Xuelin Huang, Xucong Zhang, and Andreas Bulling. 2018b. Robust eye contact detection in natural multi-person interactions using gaze and speaking behaviour. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications. 1--10.
Philipp Müller, Ekta Sood, and Andreas Bulling. 2020. Anticipating averted gaze in dyadic interactions. In ACM Symposium on Eye Tracking Research and Applications. 1--10.
Sunjeong Park and Youn-kyung Lim. 2020. Investigating user expectations on the roles of family-shared AI speakers. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1--13.
Volha Petukhova and Harry Bunt. 2009. Who's next? Speaker-selection mechanisms in multiparty dialogue. In Workshop on the Semantics and Pragmatics of Dialogue.
Rajeev Ranjan, Shalini De Mello, and Jan Kautz. 2018. Light-weight head pose invariant gaze tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2156--2164.
Matthew Roddy, Gabriel Skantze, and Naomi Harte. 2018. Investigating speech features for continuous turn-taking prediction using lstms. arXiv preprint arXiv:1806.11461 (2018).
Gabriel Skantze, Martin Johansson, and Jonas Beskow. 2015. Exploring turn-taking cues in multi-party human-robot discussions about objects. In Proceedings of the 2015 ACM on international conference on multimodal interaction. 67--74.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
Jiudong Yang, Peiying Wang, Yi Zhu, Mingchao Feng, Meng Chen, and Xiaodong He. 2022. Gated multimodal fusion with contrastive learning for turn-taking prediction in human-robot dialogue. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7747--7751.
Lingyu Zhang, Mallory Morgan, Indrani Bhattacharya, Michael Foley, Jonas Braasch, Christoph Riedl, Brooke Foucault Welles, and Richard J Radke. 2019. Improved visual focus of attention estimation and prosodic features for analyzing group interactions. In International Conference on Multimodal Interaction. 385--394.

Cited By

View all
  • (2024)Less is More: Adaptive Feature Selection and Fusion for Eye Contact DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688987(11390-11396)Online publication date: 28-Oct-2024
  • (2024)Overview of the NLPCC 2024 Shared Task 7: Multi-lingual Medical Instructional Video Question AnsweringNatural Language Processing and Chinese Computing10.1007/978-981-97-9443-0_38(429-439)Online publication date: 1-Nov-2024

Index Terms

  1. Multi-scale Conformer Fusion Network for Multi-participant Behavior Analysis



    Information & Contributors


    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].



    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023


    Request permissions for this article.

    Check for updates

    Author Tags

    1. eye contact detection
    2. multi-participant behavior analysis
    3. multi-scale conformer
    4. next speaker prediction


    • Research-article

    Funding Sources

    • National Natural Science Fund of China
    • Hunan Provincial Natural Science Foundation of China


    MM '23
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%


    Other Metrics

    Bibliometrics & Citations


    Article Metrics

    • Downloads (Last 12 months)92
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 17 Feb 2025

    Other Metrics


    Cited By

    View all
    • (2024)Less is More: Adaptive Feature Selection and Fusion for Eye Contact DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3688987(11390-11396)Online publication date: 28-Oct-2024
    • (2024)Overview of the NLPCC 2024 Shared Task 7: Multi-lingual Medical Instructional Video Question AnsweringNatural Language Processing and Chinese Computing10.1007/978-981-97-9443-0_38(429-439)Online publication date: 1-Nov-2024

    View Options

    Login options

    View options


    View or Download as a PDF file.



    View online with eReader.







    Share this Publication link

    Share on social media