ABSTRACT
Understanding small group face-to-face interactions is a prominent research problem for social psychology while the automatic realization of it recently became popular in social computing. This is mainly investigated in terms of nonverbal behaviors, as they are one of the main facet of communication. Among several multi-modal nonverbal cues, visual activity is an important one and its sufficiently good performance can be crucial for instance, when the audio sensors are missing. The existing visual activity-based nonverbal features, which are all hand-crafted, were able to perform well enough for some applications while did not perform well for some other problems. Given these observations, we claim that there is a need of more robust feature representations, which can be learned from data itself. To realize this, we propose a novel method, which is composed of optical flow computation, deep neural network based feature learning, feature encoding and classification. Additionally, a comprehensive analysis between different feature encoding techniques is also presented. The proposed method is tested on three research topics, which can be perceived during small group interactions i.e. meetings: i) emergent leader detection, ii) emergent leadership style prediction, and iii) high/low extraversion classification. The proposed method shows (significantly) better results not only as compared to the state of the art visual activity based-nonverbal features but also when the state of the art visual activity based-nonverbal features are combined with other audio-based and video-based nonverbal features.
- Marc Al-Hames, Alfred Dielmann, Daniel Gatica-Perez, Stephan Reiter, Steve Renals, Gerhard Rigoll, and Dong Zhang. 2006. Multimodal integration for meeting group action segmentation and recognition. , Vol. 3869 (2006), 52--63. Google ScholarDigital Library
- Oya Aran and Daniel Gatica-Perez. 2013. One of a kind: inferring personality impressions in meetings. In ACM ICMI. 9--13. Google ScholarDigital Library
- Sarah Adel Bargal, Emad Barsoum, Cristian Canton Ferrer, and Cha Zhang. 2016. Emotion Recognition in the Wild from Videos using Images. In ACM ICMI. 433--436. Google ScholarDigital Library
- Bernard M. Bass and Ronald E. Riggio. 2006. Transformational Leadership .Psychology Press.Google Scholar
- Cigdem Beyan, Francesca Capozzi, Cristina Becchio, and Vittorio Murino. 2016. Identification of Emergent Leaders in a Meeting Scenario Using Multiple Kernel Learning. ACM ICMI-ASSP4MI, 3--10. Google ScholarDigital Library
- Cigdem Beyan, Francesca Capozzi, Cristina Becchio, and Vittorio Murino. 2018. Prediction of the Leadership Style of an Emergent Leader Using Audio and Visual Nonverbal Features. , Vol. 20, 2 (2018), 441--456. Google ScholarDigital Library
- Cigdem Beyan, Nicolo Carissimi, Francesca Capozzi, Sebastiano Vascon, Matteo Bustreo, Antonio Pierro, Cristina Becchio, and Vittorio Murino. 2016. Detecting Emergent Leader in a Meeting Environment Using Nonverbal Visual Features Only. ACM ICMI, 317--324. Google ScholarDigital Library
- Cigdem Beyan and Robert Bob Fisher. 2013. Detection of Abnormal Fish Trajectories Using a Clustering Based Hierarchical Classifier. In BMVC .Google Scholar
- Cigdem Beyan, Vasiliki-Maria Katsageorgiou, and Vittorio Murino. 2017. Moving as a Leader: Detecting Emergent Leadership in Small Groups using Body Pose. ACM Multimedia, 1425--1433. Google ScholarDigital Library
- Joan-Isaac Biel and Daniel Gatica-Perez. 2012. The youtube lens: Crowdsourced personality impressions and audiovisual analysis of vlogs. IEEE Trans Multimedia , Vol. 15 (2012), 41--55. Google ScholarDigital Library
- Joan-Isaac Biel, Lucia Teijeiro-Mosquera, and D. Gatica-Perez. 2012. Facetube: predicting personality from facial expressions of emotion in online conversational video. In ACM ICMI-MLMI . Google ScholarDigital Library
- Thomas Brox, Andres Bruhn, Nils Papenberg, and Joachim Weickert. 2004. High accuracy optical flow estimation based on a theory for warping. In ECCV . 25--36.Google Scholar
- Dong Seon Cheng, Hugues Salamin, Pietro Salvagnini, Marco Cristani, Alessandro Vinciarelli, and Vittorio Murino. 2014. Predicting online lecture ratings based on gesturing and vocal behavior. (2014), 1--11.Google Scholar
- Gokul Chittaranjan, Jan Blom, and Daniel Gatica-Perez. 2013. Mining large-scale smartphone data for personality studies. Personal and Ubiquitous Computing , Vol. 17, 3 (2013), 433--450. Google ScholarDigital Library
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. Proceedings of IEEE CVPR, 248--255.Google ScholarCross Ref
- Alfred Dielmann and Steve Renals. 2007. Automatic meeting segmentation using dynamic bayesian networks. IEEE Trans. Multimedia , Vol. 9, 1 (2007), 25--36. Google ScholarDigital Library
- Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In CVPR .Google Scholar
- Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2017. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Trans. Pattern Anal. Mach. Intell. , Vol. 39, 4 (2017), 677--691. Google ScholarDigital Library
- Wen Dong, Bruno Lepri, Alessandro Cappelletti, Alex Sandy Pentland, Fabio Pianesi, and Massimo Zancanaro. 2007. Using the influence model to recognize functional roles in meetings. In ACM ICMI . 271--278. Google ScholarDigital Library
- Sarah Favre, Alfred Dielmann, and Alessandro Vinciarelli. 2009. Automatic role recognition in multiparty recordings using social networks and probabilistic sequential models. In ACM MM. 585--588. Google ScholarDigital Library
- Sarah Favre, Hugues Salamin, John Dines, and Alessandro Vinciarelli. 2008. Role recognition in multiparty recordings using social affiliation networks and discrete distributions. In ACM ICMI. 29--36. Google ScholarDigital Library
- Sebastian Feese, Bert Arnrich, Gerhard Tröster, Bertolt Meyer, and Klaus Jonas. 2011. Detecting Posture Mirroring in Social Interactions with Wearable Sensors. Int. Symp. on Wearable Computers, 119--120. Google ScholarDigital Library
- Sebastian Feese, Bert Arnrich, Gerhard Tröster, Bertolt Meyer, and Klaus Jonas. 2012. Quantifying Behavioral Mimicry by Automatic Detection of Nonverbal Cues from Body Motion. IEEE SocialCom/PASSAT, 520--525. Google ScholarDigital Library
- S. Feese, A. Muaremi, B. Arnrich, G. Tröster, B. Meyer, and K. Jonas. 2011. Discriminating Individually Considerate and Authoritarian Leaders by Speech Activity Cues. IEEE SocialCom/PASSAT, 1460--1465.Google Scholar
- David Johnson Frank Pierce Johnson. 1991. Joining together: Group theory and group skills .Prentice-Hall, Inc.Google Scholar
- Daniel Gatica-Perez, Iain McCowan, Dong Zhang, and Samy Bengio. 2005. Detecting group interest-level in meetings. In IEEE ICASSP. 489--492.Google Scholar
- Daniel Gatica-Perez, Alessandro Vinciarelli, and Jean-Marc Odobez. 2014. Nonverbal Behavior Analysis, In Multimodal Interactive Syst. Manage .EPFL Press.Google Scholar
- Mehmet Gonen and Ethem Alpaydin. 2008. Localized Multiple Kernel Learning. In ICML. 352--359. Google ScholarDigital Library
- Mehmet Gonen and Ethem Alpaydin. 2011. Multiple kernel learning algorithms. Journal of Machine Learning Research , Vol. 12 (2011), 2211--2268. Google ScholarDigital Library
- Kai Guo, Prakash Ishwar, and Janusz Konrad. 2010. Action recognition using sparse representation on covariance manifolds of optical flow. In AVSS . 188--195. Google ScholarDigital Library
- Hayley Hung, Dinesh Babu Jayagopi, Sileye Ba, Jean-Marc Odobez, and Daniel Gatica-Perez. 2008. Investigating automatic dominance estimation in groups from visual attention and speaking activity. In ACM ICMI. 233--236. Google ScholarDigital Library
- Mohamed E. Hussein, Marwan Torki, Mohammad A. Gowayyed, and Motaz El-Saban. 2013. Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In IJCAI. 2466--2472. Google ScholarDigital Library
- Dinesh Babu Jayagopi and Daniel Gatica-Perez. 2010. Mining group nonverbal conversational patterns using probabilistic topic models. IEEE Trans. Multimedia , Vol. 12, 8 (2010), 790--802. Google ScholarDigital Library
- Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. , Vol. 35, 1 (2013), 221--231. Google ScholarDigital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. Proceedings of ACM MM, 675--678. Google ScholarDigital Library
- Kyriaki Kalimeri, Bruno Lepri, Oya Aran, Dinesh Babu Jayagopi, Daniel Gatica-Perez, and Fabio Pianesi. 2012. Modeling dominance effects on nonverbal behaviors using granger causality. In ACM ICMI . 23--26. Google ScholarDigital Library
- Kyriaki Kalimeri, Bruno Lepri, Taemie Kim, Fabio Pianesi, and Alex Sandy Pentland. 2011. Automatic Modeling of Dominance Effects Using Granger Causality. Human Behavior Understanding, Lecture Notes in Computer Science, A. Salah, Lepri, B., Eds., Springer, Berlin/Heidelberg , Vol. 7065 (2011), 124--133. Google ScholarDigital Library
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale Video Classification with Convolutional Neural Networks. In CVPR . 1725--1732. Google ScholarDigital Library
- Ahmet Alp Kindiroglu, Lale Akarun, and Oya Aran. 2017. Multi-domain and multi-task prediction of extraversion and leadership from meeting videos. EURASIP Journal on Image and Video Processing , Vol. 77 (2017), 1--14.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. Proceedings of NIPS, 1106--1114. Google ScholarDigital Library
- Bruno Lepri, Kyriaki Kalimeri, and Fabio Pianesi. 2010. Honest Signals and Their Contribution to the Automatic Analysis of Personality Traits A Comparative Study. Human Behavior Understanding, ser. Lecture Notes in Computer Science, A. Salah, T. Gevers, N. Sebe, and A. Vinciarelli, Eds. Springer, Berlin / Heidelberg , Vol. 6219 (2010), 140--150. Google ScholarDigital Library
- Bruno Lepri, Ramanathan Subramanian, Kyriaki Kalimeri, Jacopo Staiano, Fabio Pianesi, and Nicu Sebe. 2012. Connecting Meeting Behavior with Extraversion-A Systematic Study. IEEE Transactions on Affective Computing , Vol. 3 (2012), 443--455. Google ScholarDigital Library
- Iain McCowan, Daniel Gatica-Perez, Samy Bengio, Guillaume Lathoud, Mark Barnard, and Dong Zhang. 2005. Automatic Analysis of Multimodal Group Actions in Meetings. IEEE Trans. Pattern Anal. Mach. Intell. , Vol. 27, 3 (2005), 305--317. Google ScholarDigital Library
- Gelareh Mohammadi, Antonio Origlia, Maurizio Filippone, and Alessandro Vinciarelli. 2012. From speech to personality: mapping voice quality and intonation into personality difference. In ACM Multimedia. 789--792. Google ScholarDigital Library
- Shogo Okada, Oya Aran, and Daniel Gatica-Perez. 2015. Personality Trait Classification via Co-Occurrent Multiparty Multimodal Event Discovery. In ACM ICMI. 15--22. Google ScholarDigital Library
- Kazuhiro Otsuka, Junji Yamato, Yoshinao Takemae, and Hiroshi Murase. 2006. Quantifying interpersonal influence in face-to-face conversations based on visual attention patterns. In Proc. of the ACM CHI Extended Abstract . Google ScholarDigital Library
- P. Ravindra De Silva and Nadia Bianchi Berthouze. 2004. Modeling human affective postures: an information theoretic characterization of posture features. Journal of Computational Animation and Virtual World , Vol. 15 (2004), 269?--276. Google ScholarDigital Library
- Yanwei Pang, Yuan Yuan, and Xuelong Li. 2008. Gabor-based region covariance matrices for face recognition. , Vol. 18, 7 (2008), 989--993. Google ScholarDigital Library
- Fabio Pianesi, Nadia Mana, Alessandro Cappelletti, Bruno Lepri, and Massimo Zancanaro. 2008. Multimodal recognition of personality traits in social interactions. In ACM ICMI . 53--60. Google ScholarDigital Library
- Fatih Porikli, Oncel Tuzel, and Peter Meer. 2006. Covariance tracking using model update based on lie algebra. In CVPR. 728--735. Google ScholarDigital Library
- Rutger Rienks, Dong Zhang, Daniel Gatica-Perez, and Wilfried Post. 2006. Detection and application of influence rankings in small group meetings. In ACM ICMI . 257--264. Google ScholarDigital Library
- Dairazalia Sanchez-Cortes, Oya Aran, Dinesh Babu Jayagopi, Marianne Schmid Mast, and Daniel Gatica-Perez. 2012. Emergent Leaders through Looking and Speaking: from Audio-Visual Data to Multimodal Recognition. Journal on Multimodal User Interfaces , Vol. 7, 1--2 (2012), 39--53.Google Scholar
- Dairazalia Sanchez-Cortes, Oya Aran, Marianne Schmid Mast, and Daniel Gatica-Perez. 2010. Identifying emergent leadership in small groups using nonverbal communicative cues. ACM ICMI-MLMI, 8--10.Google Scholar
- Dairazalia Sanchez-Cortes, Oya Aran, Marianne Schmid Mast, and Daniel Gatica-Perez. 2012. A Nonverbal Behavior Approach to Identify Emergent Leaders in Small Groups. IEEE Trans. Multimedia , Vol. 14, 3 (2012), 816--832. Google ScholarCross Ref
- Ashtosh Sapru and Herve Bourlard. 2013. Automatic Social Role Recognition In Professional Meetings Using Conditional Random Fields. Proceedings of Interspeech.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. Proceedings of NIPS. Google ScholarDigital Library
- Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. Tech. Rep. CRCV-TR-12-01, Univ. Central Florida, Orlando, FL, USA (2012).Google Scholar
- Oncel Tuzel, Fatih Porikli, and Peter Meer. 2006. Region covariance: A fast descriptor for detection and classification. In ECCV . 589--600. Google ScholarDigital Library
- Oncel Tuzel, Fatih Porikli, and Peter Meer. 2007. Human detection via classification on riemannian manifolds. In CVPR .Google Scholar
- Alessandro Vinciarelli, Fabio Valente, Sree Harsha Yella, and Ashtosh Sapru. 2011. Understanding Social Signals in Multi-party Conversations: Automatic Recognition of Socio-Emotional Roles in the AMI Meeting Corpus. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics .Google ScholarCross Ref
- Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. Proceedings of ACM Multimedia. Google ScholarDigital Library
- Chang Xu, Dacheng Tao, and Chao Xu. 2013. A Survey on Multi-view Learning. CoRR , Vol. abs/1304.5634 (2013).Google Scholar
- Chunfeng Yuan, Weiming Hu, Xi Li, Stephen Maybank, and Guan Luo. 2009. Human action recognition under log-euclidean riemannian metric. In ACCV . 343--353. Google ScholarDigital Library
- Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. Proceedings of ECCV, 818--833.Google Scholar
Index Terms
- Investigation of Small Group Social Interactions Using Deep Visual Activity-Based Nonverbal Features
Recommendations
Predicting remote versus collocated group interactions using nonverbal cues
ICMI-MLMI '09: Proceedings of the ICMI-MLMI '09 Workshop on Multimodal Sensor-Based Systems and Mobile Phones for Social ComputingThis paper addresses two problems: Firstly, the problem of classifying remote and collocated small-group working meetings, and secondly, the problem of identifying the remote participant, using in both cases nonverbal behavioral cues. Such classifiers ...
Discovering group nonverbal conversational patterns with topics
ICMI-MLMI '09: Proceedings of the 2009 international conference on Multimodal interfacesThis paper addresses the problem of discovering conversational group dynamics from nonverbal cues extracted from thin-slices of interaction. We first propose and analyze a novel thin-slice interaction descriptor - a bag of group nonverbal patterns - ...
Prediction/Assessment of communication skill using multimodal cues in social interactions
ICMI '16: Proceedings of the 18th ACM International Conference on Multimodal InteractionUnderstanding people’s behavior in social interactions is a very interesting problem in Social Computing. In this work, we automatically predict the communication skill of a person in various kinds of social interactions. We consider in particular, 1) ...
Comments