skip to main content
research-article

Understanding the Dynamics of Social Interactions: A Multi-Modal Multi-View Approach

Published: 17 February 2019 Publication History

Abstract

In this article, we deal with the problem of understanding human-to-human interactions as a fundamental component of social events analysis. Inspired by the recent success of multi-modal visual data in many recognition tasks, we propose a novel approach to model dyadic interaction by means of features extracted from synchronized 3D skeleton coordinates, depth, and Red Green Blue (RGB) sequences. From skeleton data, we extract new view-invariant proxemic features, named Unified Proxemic Descriptor (UProD), which is able to incorporate intrinsic and extrinsic distances between two interacting subjects. A novel key frame selection method is introduced to identify salient instants of the interaction sequence based on the joints’ energy. From Red Green Blue Depth (RGBD) videos, more holistic CNN features are extracted by applying an adaptive pre-trained Convolutional Neural Networks (CNNs) on optical flow frames. For better understanding the dynamics of interactions, we expand the boundaries of dyadic interactions analysis by proposing a fundamentally new modeling for non-treated problem aiming to discern the active from the passive interactor. Extensive experiments have been carried out on four multi-modal and multi-view interactions datasets. The experimental results demonstrate the superiority of our proposed techniques against the state-of-the-art approaches.

References

[1]
Jake K. Aggarwal and Michael S. Ryoo. 2011. Human activity analysis: A review. ACM Computing Surveys (CSUR) 43, 3 (2011), 16.
[2]
Judee K. Burgoon, Lesa A. Stern, and Leesa Dillman. 2007. Interpersonal adaptation: Dyadic interaction patterns. Cambridge University Press.
[3]
Chao Yeh Chen and Kristen Grauman. 2017. Efficient activity detection in untrimmed video with max-subgraph search. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 5 (2017), 908--921.
[4]
Lulu Chen, Hong Wei, and James Ferryman. 2013. A survey of human motion analysis using depth imagery. Pattern Recognition Letters 34, 15 (2013), 1995--2006.
[5]
Claudio Coppola, Serhan Cosar, Diego R. Faria, Nicola Bellotto, and others. 2017. Automatic detection of human interactions from RGB-D data for social activity classification. (2017).
[6]
Claudio Coppola, Diego R. Faria, Urbano Nunes, and Nicola Bellotto. 2016. Social activity recognition based on probabilistic merging of skeleton features with proximity priors from RGB-D data. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5055--5061.
[7]
Andreas Eitel, Jost Tobias Springenberg, Luciano Spinello, Martin Riedmiller, and Wolfram Burgard. 2015. Multimodal deep learning for robust RGB-D object recognition. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
[8]
Georgios Evangelidis, Gurkirt Singh, and Radu Horaud. 2014. Skeletal quads: Human action recognition using joint quadruples. In International Conference on Pattern Recognition (ICPR).
[9]
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9 (2008), 1871--1874.
[10]
Alan Page Fiske. 1991. Structures of Social Life: The Four Elementary Forms of Human Relations: Communal Sharing, Authority Ranking, Equality Matching, Market Pricing. Free Press.
[11]
Alan P. Fiske. 1992. The four elementary forms of sociality: Framework for a unified theory of social relations. Psychological Review 99, 4 (1992), 689.
[12]
Yun Fu, Yunde Jia, and Yu Kong. 2014. Interactive phrases: Semantic descriptions for human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2014).
[13]
Edward T. Hall. 1963. A system for the notation of proxemic behavior. American Anthropologist 65, 5 (1963), 1003--1026.
[14]
Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, and Jianguo Zhang. 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5344--5352.
[15]
De-An Huang and Kris M. Kitani. 2014. Action-reaction: Forecasting the dynamics of human interaction. In European Conference on Computer Vision. Springer, 489--504.
[16]
Kyriaki Kalimeri, Bruno Lepri, Oya Aran, Dinesh Babu Jayagopi, Daniel Gatica-Perez, and Fabio Pianesi. 2012. Modeling dominance effects on nonverbal behaviors using Granger causality. In 14th ACM International Conference on Multimodal Interaction. ACM, 23--26.
[17]
Yu Kong, Yunde Jia, and Yun Fu. 2012. Learning human interaction by interactive phrases. In European Conference on Computer Vision (ECCV), Vol. 7572.
[18]
Julian F. P. Kooij, M. C. Liem, Johannes D. Krijnders, Tjeerd C. Andringa, and Dariu M. Gavrila. 2016. Multi-modal human aggression detection. Computer Vision and Image Understanding 144 (2016), 106--120.
[19]
Iulia Lefter, Catholijn M Jonker, Stephanie Klein Tuente, Wim Veling, and Stefan Bogaerts. 2017. NAA: A multimodal database of negative affect and aggression. In 2017 7th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 21--27.
[20]
Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, and Alex C Kot. 2017. Global context-aware attention LSTM networks for 3D action recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
[21]
Minghuang Ma, Haoqi Fan, and Kris M Kitani. 2016. Going deeper into first-person activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 1894--1903.
[22]
Alvaro Marcos-Ramiro, Daniel Pizarro, Marta Marron-Romera, and Daniel Gatica-Perez. 2015. Let your body speak: Communicative cue extraction on natural interaction using RGBD data. IEEE Transactions on Multimedia 17, 10 (2015), 1721--1732.
[23]
Louis-Philippe Morency, Iwan de Kok, and Jonathan Gratch. 2008. Context-based recognition during human interactions: Automatic feature selection and encoding dictionary. In 10th International Conference on Multimodal Interfaces. ACM, 181--188.
[24]
Eshed Ohn-Bar and Mohan Trivedi. 2013. Joint angles similarities and HOG2 for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[25]
Sunghyun Park, Stefan Scherer, Jonathan Gratch, Peter Carnevale, and Louis-Philippe Morency. 2013. Mutual behaviors during dyadic negotiation: Automatic prediction of respondent reactions. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 423--428.
[26]
Alonso Patron-Perez, Marcin Marszalek, Andrew Zisserman, and Ian D. Reid. 2010. High five: Recognising human interactions in TV shows. In BMVC, Vol. 1. Citeseer, 2.
[27]
Eric Postma and Marie Nilsenova. 2016. Measuring the causal dynamics of facial interaction. (2016).
[28]
Michael S. Ryoo. 2011. Human activity prediction: Early recognition of ongoing activities from streaming videos. In 2011 IEEE International Conference on Computer Vision (ICCV). IEEE, 1036--1043.
[29]
M. S. Ryoo and J. K. Aggarwal. 2009. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In IEEE International Conference on Computer Vision (ICCV).
[30]
Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[31]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Neural Information Processing Systems Conference (NIPS).
[32]
K. Soomro, A. Roshan Zamir, and M. Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01.
[33]
Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 588--595.
[34]
Alessandro Vinciarelli, Anna Esposito, Elisabeth André, Francesca Bonin, Mohamed Chetouani, Jeffrey F. Cohn, Marco Cristani, Ferdinand Fuhrmann, Elmer Gilmartin, Zakia Hammal, and others. 2015. Open challenges in modelling, analysis and synthesis of human behaviour in human--human and human--machine interactions. Cognitive Computation 7, 4 (2015), 397--413.
[35]
Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision. 3551--3558.
[36]
Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. 2015. Towards good practices for very deep two-stream ConvNets. Arxiv Preprint Arxiv:1507.02159 (2015).
[37]
Pichao Wang, Wanqing Li, Zhimin Gao, Jing Zhang, Chang Tang, and Philip O. Ogunbona. 2016. Action recognition from depth maps using deep convolutional neural networks. IEEE Transactions on Human-Machine Systems 46, 4 (2016), 498--509.
[38]
Christian Wolf, Eric Lombardi, Julien Mille, Oya Celiktutan, Mingyuan Jiu, Emre Dogan, Gonen Eren, Moez Baccouche, Emmanuel Dellandréa, Charles-Edmond Bichot, and others. 2014. Evaluation of video activity localizations integrating quality and quantity measurements. Computer Vision and Image Understanding 127 (2014), 14--30.
[39]
Ning Xu, Anan Liu, Weizhi Nie, Yongkang Wong, Fuwu Li, and Yuting Su. 2015. Multi-modal 8 multi-view 8 interactive benchmark dataset for human action recognition. In 23rd ACM International Conference on Multimedia. ACM, 1195--1198.
[40]
Xiaodong Yang and YingLi Tian. 2014. Super normal vector for activity recognition using depth sequences. In IEEE Conference on Computer Vision and Pattern Recognition. 804--811.
[41]
Xiaodong Yang and YingLi Tian. 2017. Super normal vector for human activity recognition with depth cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 5 (2017), 1028--1039.
[42]
Ryo Yonetani, Kris M Kitani, and Yoichi Sato. 2016. Recognizing micro-actions and reactions from paired egocentric videos. In IEEE Conference on Computer Vision and Pattern Recognition. 2629--2638.
[43]
Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L Berg, and Dimitris Samaras. 2012. Two-person interaction detection using body-pose features and multiple instance learning. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[44]
Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime TV-L 1 optical flow. In Joint Pattern Recognition Symposium. 214--223.
[45]
Maryam Ziaeefard, Robert Bergevin, and Louis-Philippe Morency. 2015. Time-slice prediction of dyadic human activities. In BMVC. 167--1.

Cited By

View all
  • (2025)Individual Contribution-Based Spatial-Temporal Attention on Skeleton Sequences for Human Interaction RecognitionIEEE Access10.1109/ACCESS.2024.352518513(6463-6474)Online publication date: 2025
  • (2024)Modeling social interaction dynamics using temporal graph networks2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN)10.1109/RO-MAN60168.2024.10731450(2272-2278)Online publication date: 26-Aug-2024
  • (2024)Survey of Automated Methods for Nonverbal Behavior Analysis in Parent-Child Interactions2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)10.1109/FG59268.2024.10582009(1-11)Online publication date: 27-May-2024
  • Show More Cited By

Index Terms

  1. Understanding the Dynamics of Social Interactions: A Multi-Modal Multi-View Approach

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 1s
    Special Section on Deep Learning for Intelligent Multimedia Analytics and Special Section on Multi-Modal Understanding of Social, Affective and Subjective Attributes of Data
    January 2019
    265 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3309769
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 February 2019
    Accepted: 01 November 2018
    Revised: 01 October 2018
    Received: 01 October 2017
    Published in TOMM Volume 15, Issue 1s

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CNN
    2. Interaction recognition
    3. RGB-D
    4. active/passive subjects
    5. multi-modal data
    6. skeleton

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Agency for Science, Technology and Research (A*STAR), Singapore

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)22
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Individual Contribution-Based Spatial-Temporal Attention on Skeleton Sequences for Human Interaction RecognitionIEEE Access10.1109/ACCESS.2024.352518513(6463-6474)Online publication date: 2025
    • (2024)Modeling social interaction dynamics using temporal graph networks2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN)10.1109/RO-MAN60168.2024.10731450(2272-2278)Online publication date: 26-Aug-2024
    • (2024)Survey of Automated Methods for Nonverbal Behavior Analysis in Parent-Child Interactions2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)10.1109/FG59268.2024.10582009(1-11)Online publication date: 27-May-2024
    • (2022)Recent Advances in Vision-Based On-Road Behaviors Understanding: A Critical SurveySensors10.3390/s2207265422:7(2654)Online publication date: 30-Mar-2022
    • (2022)Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/349122818:3(1-24)Online publication date: 4-Mar-2022
    • (2022)Multi-scale residual network model combined with Global Average Pooling for action recognitionMultimedia Tools and Applications10.1007/s11042-021-11435-581:1(1375-1393)Online publication date: 1-Jan-2022
    • (2020)Relative View based Holistic-Separate Representations for Two-person Interaction Recognition Using Multiple Graph Convolutional NetworksJournal of Visual Communication and Image Representation10.1016/j.jvcir.2020.102833(102833)Online publication date: May-2020

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media