research-article

Understanding the Dynamics of Social Interactions: A Multi-Modal Multi-View Approach

Authors:

Jagannadan Varadarajan,

Ammar Bouallegue,

Pierre MoulinAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 15, Issue 1s

Article No.: 15, Pages 1 - 16

https://doi.org/10.1145/3300937

Published: 17 February 2019 Publication History

Abstract

In this article, we deal with the problem of understanding human-to-human interactions as a fundamental component of social events analysis. Inspired by the recent success of multi-modal visual data in many recognition tasks, we propose a novel approach to model dyadic interaction by means of features extracted from synchronized 3D skeleton coordinates, depth, and Red Green Blue (RGB) sequences. From skeleton data, we extract new view-invariant proxemic features, named Unified Proxemic Descriptor (UProD), which is able to incorporate intrinsic and extrinsic distances between two interacting subjects. A novel key frame selection method is introduced to identify salient instants of the interaction sequence based on the joints’ energy. From Red Green Blue Depth (RGBD) videos, more holistic CNN features are extracted by applying an adaptive pre-trained Convolutional Neural Networks (CNNs) on optical flow frames. For better understanding the dynamics of interactions, we expand the boundaries of dyadic interactions analysis by proposing a fundamentally new modeling for non-treated problem aiming to discern the active from the passive interactor. Extensive experiments have been carried out on four multi-modal and multi-view interactions datasets. The experimental results demonstrate the superiority of our proposed techniques against the state-of-the-art approaches.

References

[1]

Jake K. Aggarwal and Michael S. Ryoo. 2011. Human activity analysis: A review. ACM Computing Surveys (CSUR) 43, 3 (2011), 16.

Digital Library

[2]

Judee K. Burgoon, Lesa A. Stern, and Leesa Dillman. 2007. Interpersonal adaptation: Dyadic interaction patterns. Cambridge University Press.

[3]

Chao Yeh Chen and Kristen Grauman. 2017. Efficient activity detection in untrimmed video with max-subgraph search. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 5 (2017), 908--921.

Digital Library

[4]

Lulu Chen, Hong Wei, and James Ferryman. 2013. A survey of human motion analysis using depth imagery. Pattern Recognition Letters 34, 15 (2013), 1995--2006.

Digital Library

[5]

Claudio Coppola, Serhan Cosar, Diego R. Faria, Nicola Bellotto, and others. 2017. Automatic detection of human interactions from RGB-D data for social activity classification. (2017).

[6]

Claudio Coppola, Diego R. Faria, Urbano Nunes, and Nicola Bellotto. 2016. Social activity recognition based on probabilistic merging of skeleton features with proximity priors from RGB-D data. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5055--5061.

Digital Library

[7]

Andreas Eitel, Jost Tobias Springenberg, Luciano Spinello, Martin Riedmiller, and Wolfram Burgard. 2015. Multimodal deep learning for robust RGB-D object recognition. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[8]

Georgios Evangelidis, Gurkirt Singh, and Radu Horaud. 2014. Skeletal quads: Human action recognition using joint quadruples. In International Conference on Pattern Recognition (ICPR).

Digital Library

[9]

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9 (2008), 1871--1874.

Digital Library

[10]

Alan Page Fiske. 1991. Structures of Social Life: The Four Elementary Forms of Human Relations: Communal Sharing, Authority Ranking, Equality Matching, Market Pricing. Free Press.

[11]

Alan P. Fiske. 1992. The four elementary forms of sociality: Framework for a unified theory of social relations. Psychological Review 99, 4 (1992), 689.

[12]

Yun Fu, Yunde Jia, and Yu Kong. 2014. Interactive phrases: Semantic descriptions for human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2014).

[13]

Edward T. Hall. 1963. A system for the notation of proxemic behavior. American Anthropologist 65, 5 (1963), 1003--1026.

[14]

Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, and Jianguo Zhang. 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5344--5352.

[15]

De-An Huang and Kris M. Kitani. 2014. Action-reaction: Forecasting the dynamics of human interaction. In European Conference on Computer Vision. Springer, 489--504.

[16]

Kyriaki Kalimeri, Bruno Lepri, Oya Aran, Dinesh Babu Jayagopi, Daniel Gatica-Perez, and Fabio Pianesi. 2012. Modeling dominance effects on nonverbal behaviors using Granger causality. In 14th ACM International Conference on Multimodal Interaction. ACM, 23--26.

Digital Library

[17]

Yu Kong, Yunde Jia, and Yun Fu. 2012. Learning human interaction by interactive phrases. In European Conference on Computer Vision (ECCV), Vol. 7572.

Digital Library

[18]

Julian F. P. Kooij, M. C. Liem, Johannes D. Krijnders, Tjeerd C. Andringa, and Dariu M. Gavrila. 2016. Multi-modal human aggression detection. Computer Vision and Image Understanding 144 (2016), 106--120.

Digital Library

[19]

Iulia Lefter, Catholijn M Jonker, Stephanie Klein Tuente, Wim Veling, and Stefan Bogaerts. 2017. NAA: A multimodal database of negative affect and aggression. In 2017 7th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 21--27.

[20]

Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, and Alex C Kot. 2017. Global context-aware attention LSTM networks for 3D action recognition. In IEEE Conference on Computer Vision and Pattern Recognition.

[21]

Minghuang Ma, Haoqi Fan, and Kris M Kitani. 2016. Going deeper into first-person activity recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 1894--1903.

[22]

Alvaro Marcos-Ramiro, Daniel Pizarro, Marta Marron-Romera, and Daniel Gatica-Perez. 2015. Let your body speak: Communicative cue extraction on natural interaction using RGBD data. IEEE Transactions on Multimedia 17, 10 (2015), 1721--1732.

Digital Library

[23]

Louis-Philippe Morency, Iwan de Kok, and Jonathan Gratch. 2008. Context-based recognition during human interactions: Automatic feature selection and encoding dictionary. In 10th International Conference on Multimodal Interfaces. ACM, 181--188.

Digital Library

[24]

Eshed Ohn-Bar and Mohan Trivedi. 2013. Joint angles similarities and HOG2 for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

Digital Library

[25]

Sunghyun Park, Stefan Scherer, Jonathan Gratch, Peter Carnevale, and Louis-Philippe Morency. 2013. Mutual behaviors during dyadic negotiation: Automatic prediction of respondent reactions. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 423--428.

Digital Library

[26]

Alonso Patron-Perez, Marcin Marszalek, Andrew Zisserman, and Ian D. Reid. 2010. High five: Recognising human interactions in TV shows. In BMVC, Vol. 1. Citeseer, 2.

[27]

Eric Postma and Marie Nilsenova. 2016. Measuring the causal dynamics of facial interaction. (2016).

[28]

Michael S. Ryoo. 2011. Human activity prediction: Early recognition of ongoing activities from streaming videos. In 2011 IEEE International Conference on Computer Vision (ICCV). IEEE, 1036--1043.

Digital Library

[29]

M. S. Ryoo and J. K. Aggarwal. 2009. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In IEEE International Conference on Computer Vision (ICCV).

[30]

Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Neural Information Processing Systems Conference (NIPS).

Digital Library

[32]

K. Soomro, A. Roshan Zamir, and M. Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01.

[33]

Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 588--595.

Digital Library

[34]

Alessandro Vinciarelli, Anna Esposito, Elisabeth André, Francesca Bonin, Mohamed Chetouani, Jeffrey F. Cohn, Marco Cristani, Ferdinand Fuhrmann, Elmer Gilmartin, Zakia Hammal, and others. 2015. Open challenges in modelling, analysis and synthesis of human behaviour in human--human and human--machine interactions. Cognitive Computation 7, 4 (2015), 397--413.

[35]

Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision. 3551--3558.

Digital Library

[36]

Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. 2015. Towards good practices for very deep two-stream ConvNets. Arxiv Preprint Arxiv:1507.02159 (2015).

[37]

Pichao Wang, Wanqing Li, Zhimin Gao, Jing Zhang, Chang Tang, and Philip O. Ogunbona. 2016. Action recognition from depth maps using deep convolutional neural networks. IEEE Transactions on Human-Machine Systems 46, 4 (2016), 498--509.

[38]

Christian Wolf, Eric Lombardi, Julien Mille, Oya Celiktutan, Mingyuan Jiu, Emre Dogan, Gonen Eren, Moez Baccouche, Emmanuel Dellandréa, Charles-Edmond Bichot, and others. 2014. Evaluation of video activity localizations integrating quality and quantity measurements. Computer Vision and Image Understanding 127 (2014), 14--30.

Digital Library

[39]

Ning Xu, Anan Liu, Weizhi Nie, Yongkang Wong, Fuwu Li, and Yuting Su. 2015. Multi-modal 8 multi-view 8 interactive benchmark dataset for human action recognition. In 23rd ACM International Conference on Multimedia. ACM, 1195--1198.

Digital Library

[40]

Xiaodong Yang and YingLi Tian. 2014. Super normal vector for activity recognition using depth sequences. In IEEE Conference on Computer Vision and Pattern Recognition. 804--811.

Digital Library

[41]

Xiaodong Yang and YingLi Tian. 2017. Super normal vector for human activity recognition with depth cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 5 (2017), 1028--1039.

Digital Library

[42]

Ryo Yonetani, Kris M Kitani, and Yoichi Sato. 2016. Recognizing micro-actions and reactions from paired egocentric videos. In IEEE Conference on Computer Vision and Pattern Recognition. 2629--2638.

[43]

Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L Berg, and Dimitris Samaras. 2012. Two-person interaction detection using body-pose features and multiple instance learning. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[44]

Christopher Zach, Thomas Pock, and Horst Bischof. 2007. A duality based approach for realtime TV-L 1 optical flow. In Joint Pattern Recognition Symposium. 214--223.

Digital Library

[45]

Maryam Ziaeefard, Robert Bergevin, and Louis-Philippe Morency. 2015. Time-slice prediction of dyadic human activities. In BMVC. 167--1.

Cited By

Liu XGao B(2025)Individual Contribution-Based Spatial-Temporal Attention on Skeleton Sequences for Human Interaction RecognitionIEEE Access10.1109/ACCESS.2024.352518513(6463-6474)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2024.3525185
Kim JNaik AJayarathne IHa SChew J(2024)Modeling social interaction dynamics using temporal graph networks2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN)10.1109/RO-MAN60168.2024.10731450(2272-2278)Online publication date: 26-Aug-2024
https://doi.org/10.1109/RO-MAN60168.2024.10731450
Karaca BSalah ADenissen JPoppe Rde Zwarte S(2024)Survey of Automated Methods for Nonverbal Behavior Analysis in Parent-Child Interactions2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)10.1109/FG59268.2024.10582009(1-11)Online publication date: 27-May-2024
https://doi.org/10.1109/FG59268.2024.10582009
Show More Cited By

Index Terms

Understanding the Dynamics of Social Interactions: A Multi-Modal Multi-View Approach
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Robust Multi-Modal Cues for Dyadic Human Interaction Recognition
MUSA2 '17: Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes

Activity analysis methods usually tend to focus on elementary human actions but ignore to analyze complex scenarios. In this paper, we focus particularly on classifying interactions between two persons in a supervised fashion. We propose a robust multi-...
CANet: Co-attention network for RGB-D semantic segmentation
Highlights
- We propose a novel CANet for RGB-D semantic segmentation, and the key co-attention fusion part consists of three modules, i.e. the PCFM, the CCFM, and the ...
Abstract
Incorporating the depth (D) information to RGB images has proven the effectiveness and robustness in semantic segmentation. However, the fusion between them is not trivial due to their inherent physical meaning discrepancy, in which ...
Learning Multimodal Attention LSTM Networks for Video Captioning
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Automatic generation of video caption is a challenging task as video is an information-intensive media with complex variations. Most existing methods, either based on language templates or sequence learning, have treated video as a flat data sequence ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15, Issue 1s

Special Section on Deep Learning for Intelligent Multimedia Analytics and Special Section on Multi-Modal Understanding of Social, Affective and Subjective Attributes of Data

January 2019

265 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3309769

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2019

Accepted: 01 November 2018

Revised: 01 October 2018

Received: 01 October 2017

Published in TOMM Volume 15, Issue 1s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Agency for Science, Technology and Research (A*STAR), Singapore

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
300
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)2

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu XGao B(2025)Individual Contribution-Based Spatial-Temporal Attention on Skeleton Sequences for Human Interaction RecognitionIEEE Access10.1109/ACCESS.2024.352518513(6463-6474)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2024.3525185
Kim JNaik AJayarathne IHa SChew J(2024)Modeling social interaction dynamics using temporal graph networks2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN)10.1109/RO-MAN60168.2024.10731450(2272-2278)Online publication date: 26-Aug-2024
https://doi.org/10.1109/RO-MAN60168.2024.10731450
Karaca BSalah ADenissen JPoppe Rde Zwarte S(2024)Survey of Automated Methods for Nonverbal Behavior Analysis in Parent-Child Interactions2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)10.1109/FG59268.2024.10582009(1-11)Online publication date: 27-May-2024
https://doi.org/10.1109/FG59268.2024.10582009
Trabelsi RKhemmar RDecoux BErtaud JButteau R(2022)Recent Advances in Vision-Based On-Road Behaviors Understanding: A Critical SurveySensors10.3390/s2207265422:7(2654)Online publication date: 30-Mar-2022
https://doi.org/10.3390/s22072654
Zhu XZhu YWang HWen HYan YLiu P(2022)Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/349122818:3(1-24)Online publication date: 4-Mar-2022
https://dl.acm.org/doi/10.1145/3491228
Li JHan YZhang MLi GZhang B(2022)Multi-scale residual network model combined with Global Average Pooling for action recognitionMultimedia Tools and Applications10.1007/s11042-021-11435-581:1(1375-1393)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1007/s11042-021-11435-5
Liu XLi YGuo TXia R(2020)Relative View based Holistic-Separate Representations for Two-person Interaction Recognition Using Multiple Graph Convolutional NetworksJournal of Visual Communication and Image Representation10.1016/j.jvcir.2020.102833(102833)Online publication date: May-2020
https://doi.org/10.1016/j.jvcir.2020.102833

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents