skip to main content
10.1145/3340555.3353761acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Improved Visual Focus of Attention Estimation and Prosodic Features for Analyzing Group Interactions

Published: 14 October 2019 Publication History

Abstract

Collaborative group tasks require efficient and productive verbal and non-verbal interactions among the participants. Studying such interaction patterns could help groups perform more efficiently, but the detection and measurement of human behavior is challenging since it is inherently multimodal and changes on a millisecond time frame. In this paper, we present a method to study groups performing a collaborative decision-making task using non-verbal behavioral cues. First, we present a novel algorithm to estimate the visual focus of attention (VFOA) of participants using frontal cameras. The algorithm can be used in various group settings, and performs with a state-of-the-art accuracy of 90%. Secondly, we present prosodic features for non-verbal speech analysis. These features are commonly used in speech/music classification tasks, but are rarely used in human group interaction analysis. We validate our algorithms on a multimodal dataset of 14 group meetings with 45 participants, and show that a combination of VFOA-based visual metrics and prosodic-feature-based metrics can predict emergent group leaders with 64% accuracy and dominant contributors with 86% accuracy. We also report our findings on the correlations between the non-verbal behavioral metrics with gender, emotional intelligence, and the Big 5 personality traits.

References

[1]
F. Alías, J.C. Socoró, and X. Sevillano. 2016. A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds. Appl. Sci. 6, 143 (2016).
[2]
S.O. Ba and J. Odobez. 2009. Recognizing visual focus of attention from head pose in natural meetings. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 1(2009), 16–33.
[3]
S.O. Ba and J. Odobez. 2011. Multiperson visual focus of attention from head pose and meeting contextual cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 1(2011), 101–116.
[4]
T. Baltrusaitis, A. Zadeh, Y. Lim, and L. Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 59–66.
[5]
S. Baron-Cohen, S. Wheelwright, J. Hill, Y. Raste, and I. Plumb. 2001. The “Reading the Mind in the Eyes” Test revised version: a study with normal adults, and adults with Asperger syndrome or high-functioning autism. The Journal of Child Psychology and Psychiatry and Allied Disciplines 42, 2(2001), 241–251.
[6]
B. Barry and G.L. Stewart. 1997. Composition, process, and performance in self-managed groups: The role of personality. Journal of Applied Psychology 82, 1 (1997), 62–78.
[7]
C. Beyan, F. Capozzi, C. Becchio, and V. Murino. 2017. Multi-task learning of social psychology assessments and nonverbal features for automatic leadership identification. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. ACM, 451–455.
[8]
C. Beyan, F. Capozzi, C. Becchio, and V. Murino. 2018. Prediction of the Leadership Style of an Emergent Leader Using Audio and Visual Nonverbal Features. IEEE Transactions on Multimedia 20, 2 (2018), 441–456.
[9]
C. Beyan, N. Carissimi, F. Capozzi, S. Vascon, M. Bustreo, A. Pierro, C. Becchio, and V. Murino. 2016. Detecting emergent leader in a meeting environment using nonverbal visual features only. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 317–324.
[10]
I. Bhattacharya, M. Foley, C. Ku, N. Zhang, T. Zhang, C. Mine, M. Li, H. Ji, C. Riedl, B. Foucault Welles, and R.J. Radke. 2019. The Unobtrusive Group Interaction (UGI) Corpus. In Proceedings of the 10th ACM Multimedia Systems Conference(MMSys ’19).
[11]
I. Bhattacharya, M. Foley, N. Zhang, T. Zhang, C. Ku, C. Mine, H. Ji, C. Riedl, B. Foucault Welles, and R.J. Radke. 2018. A Multimodal-Sensor-Enabled Room for Unobtrusive Group Meeting Analysis. In Proceedings of the 2018 International Conference on Multimodal Interaction. ACM, 347–355.
[12]
J.H. Bradley and F.J. Hebert. 1997. The effect of personality type on team performance. Journal of Management Development 16, 5 (1997), 337–353.
[13]
J.S. Bridle and M.D. Brown. 1974. An Experimental Automatic Word-Recognition System. JSU Report 1003. Joint Speech Research Unit, Ruislip, England.
[14]
A. Bulling and H. Gellersen. 2010. Toward mobile eye-based human-computer interaction. IEEE Pervasive Computing 9, 4 (2010), 8–12.
[15]
S. Burger, V. MacLaren, and H. Yu. 2002. The ISL meeting corpus: The impact of meeting type on speech style. In INTERSPEECH. Denver, CO.
[16]
N. Campbell, T. Sadanobu, M. Imura, N. Iwahashi, S. Noriko, and D. Douxchamps. 2006. A multimedia database of meetings and informal interactions for tracking participant involvement and discourse flow. In Proc. Int. Conf. Lang. Resources Evaluation. Genoa, Italy.
[17]
J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, 2005. The AMI meeting corpus: A pre-announcement. In International Workshop on Machine Learning for Multimodal Interaction. Springer, 28–39.
[18]
J.W. Chang, T. Sy, and J.N. Change. 2012. Team Emotional Intelligence and Performance: Interactive Dynamics between Leaders and Members. Small Group Research 43, 1 (2012).
[19]
N. Chinchor. 1992. MUC-4 evaluation metrics. In Proceedings of the 4th Conference on Message Understanding. Association for Computational Linguistics, 22–29.
[20]
D. Chrusciel. 2006. Considerations of emotional intelligence (EI) in dealing with change decision management. Management Decision 44, 5 (2006), 644–657.
[21]
C. Cortes and V. Vapnik. 1995. Support-vector networks. Machine Learning 20, 3 (1995), 273–297.
[22]
P.L. Curşeu, R. Ilies, D. Virgǎ, L. Marticuţoiu, and F.A. Sava. 2018. Personality characteristics that are valued in teams: Not always “more is better”?International Journal of Psychology(2018).
[23]
A. Darioly and M.S. Mast. 2014. The role of nonverbal behavior for leadership: An integrative review. In Leader Interpersonal and Influence Skills: The Soft Skills of Leadership, R.E. Riggio and S. Tan (Eds.). Taylor and Francis, 73–100.
[24]
G. De Souza and H.J. Klein. 1995. Emergent leadership in the group goal-setting process. Small Group Research 26, 4 (1995), 475–496.
[25]
B.M. DePaulo and H.S. Friedman. 1998. Nonverbal communication. In Handbook of Social Psychology(4 ed.), D. Gilbert, S. Fisker, and G. Lindzey (Eds.). McGraw Hill, Boston, MA, 3–40.
[26]
V. Druskat and A.T. Pescosolido. 2006. The impact of emergent leader’s emotionally competent behavior on team trust, communication, engagement, and effectiveness. Research on Emotion in Organizations 2 (2006), 25–55.
[27]
S. Duffner and C. Garcia. 2016. Visual focus of attention estimation with unsupervised incremental learning. IEEE Transactions on Circuits and Systems for Video Technology 26, 12(2016), 2264–2272.
[28]
M. Frese, S. Beimel, and S. Schoenborn. 2003. Action training for charismatic leadership: Two evaluations of studies of a commercial training module on inspirational communication of a vision. Personnel Psychology 56, 3 (2003), 671–698.
[29]
H. Ghaemmaghami, B. Baker, R. Vogt, and S. Sridharan. 2010. Noise robust voice activity detection using features extracted from the time-domain autocorrelation function. In 11th Annual Conference of the International Speech (InterSpeech). Makuhari, Japan, 3118–3121.
[30]
S. Gonzalez and M. Brookes. 2014. PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 2 (Feb. 2014), 518–530.
[31]
C. Gorse, I. McKinney, A. Shepherd, and P. Whitehead. 2006. Meetings: Factors that affect group interaction and performance. Proceedings of the Association of Researchers in Construction Management (2006), 4–6.
[32]
J. Gray and J. Markel. 1974. A spectral-flatness measure for studying the autocorrelation method of linear prediction of speech analysis. IEEE Trans. Acoust., Speech, Signal Process. 22, 3 (1974), 207–217.
[33]
J.A. Hall, E.J. Coats, and L.S. LeBeau. 2005. Nonverbal behavior and the vertical dimension of social relations: A meta-analysis. Psychological Bulletin 131, 6 (2005), 898–924.
[34]
J. Hall and W.H. Watson. 1970. The effects of a normative intervention on group decision-making performance. Human Relations 23, 4 (1970), 299–317.
[35]
M. Harris Bond and I. Wing-Chun Ng. 2004. The depth of a group’s personality resources: Impacts on group process and group performance. Asian Journal of Social Psychology 7, 3 (2004), 285–300.
[36]
C. Harte, M. Sandler, and M. Gasser. 2006. Detecting Harmonic Change in Musical Audio. In 1st ACM Workshop on Audio and Music Computing Multimedia. ACM, Santa Barbara, CA, 21–26.
[37]
J.A Hesch and S.I Roumeliotis. 2011. A direct least-squares (DLS) method for PnP. In 2011 International Conference on Computer Vision. IEEE, 383–390.
[38]
V. Iglovikov and A. Shvets. 2018. Ternausnet: U-net with VGG11 encoder pre-trained on Imagenet for image segmentation. arXiv preprint arXiv:1801.05746(2018).
[39]
A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, 2003. The ICSI meeting corpus. In Int. Conf. Acoust., Speech, and Signal Process.
[40]
D. Jayagopi, D. Sanchez-Cortes, K. Otsuka, J. Yamato, and D. Gatica-Perez. 2012. Linking speaking and looking behavior patterns with group composition, perception, and performance. In Proceedings of the 14th ACM International Conference on Multimodal Interaction. ACM, 433–440.
[41]
D. Jiang, L. Lu, H. Zhang, J. Tao, and L. Cai. 2002. Music type classification by spectral contrast feature. In International Conference on Multimedia and Expo. 113–116.
[42]
O.P. John and S. Srivastava. 1999. The Big Five trait taxonomy: History, measurement, and theoretical perspectives. In Handbook Personality: Theory and Research (2 ed.), L.A. Pervin and O.P. John (Eds.). McGraw Hill, Boston, MA, 102–138.
[43]
S.L. Kichuk and W.H. Wiesner. 1997. The big five personality factors and team performance: implications for selecting successful product design teams. Journal of Engineering and Technology Management 14, 3-4(1997), 195–221.
[44]
S. Liang and X. Fan. 2014. Audio Content Classification Method Research Based on Two-step Strategy. Int. J. Adv. Comput. Sci. Appl. 5 (2014), 57–62.
[45]
J. Long, E. Shelhamer, and T. Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440.
[46]
R.G Lord, R.J Foti, and C.L De Vader. 1984. A test of leadership categorization theory: Internal structure, information processing, and leadership. Organizational Behavior and Human Performance 34, 3(1984), 343–378.
[47]
B. Massé, S. Ba, and R. Horaud. 2018. Tracking gaze and visual focus of attention of people involved in social interaction. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 11(2018), 2711–2724.
[48]
S. Mathur, M.S. Poole, F. Pena-Mora, M. Hasegawa-Johnson, and N. Contractor. 2012. Detecting interaction links in a collaborating group using manually annotated data. Social Networks 34, 4 (2012), 515–526.
[49]
L. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang. 2005. Automatic analysis of multimodal group actions in meeting. IEEE Trans. Pattern Anal. Mach. Intell. 27, 3 (March 2005), 305–317.
[50]
K. Otsuka, K. Kasuga, and M. Köhler. 2018. Estimating Visual Focus of Attention in Multiparty Meetings using Deep Convolutional Neural Networks. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 191–199.
[51]
K. Otsuka, H. Sawada, and J. Yamato. 2007. Automatic inference of cross-modal nonverbal interactions in multiparty conversations: Who responds to whom, when, and how? From gaze, head gestures, and utterances. In Proc. Int. Conf. Multimodal Interfaces. ACM, Aichi, Japan.
[52]
K. Otsuka, Y. Takemae, and J. Yamato. 2005. A probabilistic inference of multiparty-conversation structure based on Markov-switching models of gaze patterns, head directions, and utterances. In Proceedings of the 7th International Conference on Multimodal Interfaces. ACM, 191–198.
[53]
K. Otsuka, J. Yamato, Y. Takemae, and H. Murase. 2006. Conversation scene analysis with dynamic Bayesian Network based on visual head tracking. In Proc. Int. Conf. Multimedia and Expo.IEEE, Toronto, ON, Canada.
[54]
A.T. Pescosolido. 2001. Informal leaders and the development of group efficacy. Small Group Research 32, 1 (2001), 74–93.
[55]
B. Rammstedt and O.P. John. 2007. Measuring personality in one minute or less: A 10-item short version of the big five inventory in English and German. Journal of Research in Personality 41, 1 (2007), 203–212.
[56]
M. Remland. 1981. Developing leadership skills in nonverbal communication: A situational perspective. Journal of Business Communication 18, 3 (1981), 17–29.
[57]
O. Ronneberger, P. Fischer, and T. Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 234–241.
[58]
D.E. Rumelhart, G.E. Hinton, and R.J. Williams. 1985. Learning internal representations by error propagation. Technical Report. California Univ San Diego La Jolla Inst for Cognitive Science.
[59]
D. Sanchez-Cortes, O. Aran, and D. Gatica-Perez. 2011. An audio visual corpus for emergent leader analysis. In Workshop Multimodal Corpora Mach. Learning: Taking Stock and Road Mapping the Future. Alicante, Spain.
[60]
D. Sanchez-Cortes, O. Aran, and M. Schmid Mast D. Gatica-Perez. 2012. A Nonverbal Behavior Approach to Identify Emergent Leaders in Small Groups. IEEE Transactions on Multimedia 14, 3 (2012), 816–832.
[61]
J. Saunders. 1996. Real-time discrimination of broadcast speech/music. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. 993–996.
[62]
E. Scheirer and M. Slaney. 1997. Construction and evaluation of a robust multifeature speech/music discriminator. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. 1331–1334.
[63]
L. Sukhostat and Y. Imamverdiyev. 2015. A Comparative Analysis of Pitch Detection Methods Under the Influence of Different Noise Conditions. Journal of Voice 29, 4 (July 2015), 410–417.
[64]
C. Thoman. 2009. Model-Based Classification of Speech Audio. Master’s thesis. Florida Atlantic University, Florida, USA.
[65]
A.L.C. Wang. 2003. An industrial-strength audio search algorithm. In Proceedings of the 4th International Society for Music Information Retrieval Conference. Baltimore, MD, 7–13.
[66]
F. Wang, X. Wang, B. Shao, T. Li, and M. Ogihara. 2009. Tag Integrated Multi-Label Music Style Classification with Hypergraph. In Proceedings of the 10th International Society for Music Information Retrieval Conference. Kobe, Japan, 363–368.
[67]
K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23, 10 (2016), 1499–1503.
[68]
T. Zhang and J.C.C. Kuo. 1999. Heuristic approach for generic audio data segmentation and annotation. In Proceedings of the 7th ACM International Conference on Multimedia. 67–76.

Cited By

View all
  • (2023)Developing a Generic Focus Modality for Multimodal Interactive EnvironmentsCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3617165(31-35)Online publication date: 9-Oct-2023
  • (2023)Automated Detection of Joint Attention and Mutual Gaze in Free Play Parent-Child InteractionsCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3616234(374-382)Online publication date: 9-Oct-2023
  • (2023)Multi-scale Conformer Fusion Network for Multi-participant Behavior AnalysisProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612847(9472-9476)Online publication date: 26-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMI '19: 2019 International Conference on Multimodal Interaction
October 2019
601 pages
ISBN:9781450368605
DOI:10.1145/3340555
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Multimodal sensing
  2. group meeting analysis
  3. prosodic acoustic features
  4. smart rooms
  5. visual focus of attention

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICMI '19

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)1
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Developing a Generic Focus Modality for Multimodal Interactive EnvironmentsCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3617165(31-35)Online publication date: 9-Oct-2023
  • (2023)Automated Detection of Joint Attention and Mutual Gaze in Free Play Parent-Child InteractionsCompanion Publication of the 25th International Conference on Multimodal Interaction10.1145/3610661.3616234(374-382)Online publication date: 9-Oct-2023
  • (2023)Multi-scale Conformer Fusion Network for Multi-participant Behavior AnalysisProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612847(9472-9476)Online publication date: 26-Oct-2023
  • (2021)MultiMediateProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3479219(4878-4882)Online publication date: 17-Oct-2021
  • (2021)Predicting Gaze from Egocentric Social Interaction Videos and IMU DataProceedings of the 2021 International Conference on Multimodal Interaction10.1145/3462244.3479954(717-722)Online publication date: 18-Oct-2021
  • (2021)An Exploratory Computational Study on the Effect of Emergent Leadership on Social and Task CohesionCompanion Publication of the 2021 International Conference on Multimodal Interaction10.1145/3461615.3485415(263-272)Online publication date: 18-Oct-2021
  • (2021)Classroom Digital Twins with Instrumentation-Free Gaze TrackingProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445711(1-9)Online publication date: 6-May-2021
  • (2020)Temporal analysis of multimodal data to predict collaborative learning outcomesBritish Journal of Educational Technology10.1111/bjet.1298251:5(1527-1547)Online publication date: 20-Jul-2020
  • (2020)Multiparty Visual Co-Occurrences for Estimating Personality Traits in Group Meetings2020 IEEE Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV45572.2020.9093642(2074-2083)Online publication date: Mar-2020
  • (2020)A Multi-Stream Recurrent Neural Network for Social Role Detection in Multiparty InteractionsIEEE Journal of Selected Topics in Signal Processing10.1109/JSTSP.2020.299239414:3(554-567)Online publication date: Mar-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media