Skip to main content
Log in

Learning multimodal behavioral models for face-to-face social interaction

  • Original Paper
  • Published:
Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Abstract

The aim of this paper is to model multimodal perception-action loops of human behavior in face-to-face interactions. To this end, we propose trainable behavioral models that predict the optimal actions for one specific person given others’ perceived actions and the joint goals of the interlocutors. We first compare sequential models—in particular discrete hidden Markov models (DHMMs)—with standard classifiers (SVMs and decision trees). We propose a modification of the initialization of the DHMMs in order to better capture the recurrent structure of the sensory-motor states. We show that the explicit state duration modeling by discrete hidden semi markov models (DHSMMs) improves prediction performance. We applied these models to parallel speech and gaze data collected from interacting dyads. The challenge was to predict the gaze of one subject given the gaze of the interlocutor and the voice activity of both. For both DHMMs and DHSMMs the short-time Viterbi concept is used for incremental decoding and prediction. For the proposed models we evaluated objectively several properties in order to go beyond pure classification performance. Results show that incremental DHMMs (IDHMMs) were more efficient than classic classifiers and superseded by incremental DHSMMs (IDHSMMs). This later result emphasizes the relevance of state duration modeling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Kendon A, Harris RM, Key MR, International Congress of Anthropological and Ethnological Sciences (1975) Organization of behavior in face-to-face interaction. The Hague; Chicago: Mouton; Distributed in the USA and Canada by Aldine

  2. Scherer S, Marsella S, Stratou G, Xu Y, Morbini F, Egan A, Morency L-P (2012) Perception markup language: towards a standardized representation of perceived nonverbal behaviors. In: Intelligent virtual agents, pp 455–463

  3. Lakin JL, Jefferis VE, Cheng CM, Chartrand TL (2003) The chameleon effect as social glue: evidence for the evolutionary significance of nonconscious mimicry. J Nonverbal Behav 27(3):145–162

    Article  Google Scholar 

  4. Bailly G (2009) Boucles de perception-action et interaction face-à-face. Rev Franccaise Linguist Appliquée 13(2):121–131

    Google Scholar 

  5. Bailly G, Raidt S, Elisei F (2010) Gaze, conversational agents and face-to-face communication. Speech Commun 52(6):598–612

    Article  Google Scholar 

  6. Vinciarelli A, Pantic M, Heylen D, Pelachaud C, Poggi I, D’Errico F, Schroeder M (2012) Bridging the gap between social animal and unsocial machine: a survey of social signal processing. IEEE Trans Affect Comput 3(1):69–87

    Article  Google Scholar 

  7. Otsuka K (2011) Conversation scene analysis [Social Sciences]. IEEE Signal Process Mag 28(4):127–131

    Article  Google Scholar 

  8. Gatica-Perez D (2009) Automatic nonverbal analysis of social interaction in small groups: a review. Image Vis Comput 27(12):1775–1787

    Article  Google Scholar 

  9. Pentland A, Choudhury T, Eagle N, Singh P (2005) Human dynamics: computation for organizations. Pattern Recognit Lett 26(4):503–511

  10. Choudhury T, Pentland A (2004) Characterizing social interactions using the sociometer. In: Proceedings of NAACOS 2004

  11. Curhan JR, Pentland A (2007) Thin slices of negotiation: predicting outcomes from conversational dynamics within the first 5 minutes. J Appl Psychol 92(3):802–811

    Article  Google Scholar 

  12. Otsuka K, Sawada H, Yamato J (2007) Automatic inference of cross-modal nonverbal interactions in multiparty conversations: “who responds to whom, when, and how?” from gaze, head gestures, and utterances. In: Proceedings of the 9th international conference on multimodal interfaces, New York, NY, USA, pp 255–262

  13. Zhang D, Gatica-Perez D, Bengio S, McCowan I (2006) Modeling individual and group actions in meetings with layered HMMs. Multimed IEEE Trans 8(3):509–520

    Article  Google Scholar 

  14. Petridis S, Pantic M (2008) Audiovisual discrimination between laughter and speech. In: IEEE international conference on acoustics, speech and signal processing, 2008. ICASSP 2008, pp 5117–5120

  15. Fragopanagos N, Taylor JG (2005) Emotion recognition in human–computer interaction. Neural Netw 18(4):389–405

    Article  Google Scholar 

  16. Caridakis G, Malatesta L, Kessous L, Amir N, Raouzaiou A, Karpouzis K (2006) Modeling naturalistic affective states via facial and vocal expressions recognition. In: Proceedings of the 8th International conference on multimodal interfaces, New York, NY, USA, pp 146–154

  17. Karpouzis K, Caridakis G, Kessous L, Amir N, Raouzaiou A, Malatesta L, Kollias S (2007) Modeling naturalistic affective states via facial, vocal, and bodily expressions recognition. In: Huang TS, Nijholt A, Pantic M, Pentland A (eds) Artifical intelligence for human computing. Springer, Berlin, Heidelberg, pp 91–112

    Chapter  Google Scholar 

  18. Banerjee S, Rudnicky AI (2004) Using simple speech–based features to detect the state of a meeting and the roles of the meeting participants. In: International conference on spoken language processing (ICSLP), International convention center Jeju, Jesu Island, Korea

  19. Jayagopi DB, Hung H, Yeo C, Gatica-Perez D (2009) Modeling dominance in group conversations using nonverbal activity cues. Audio Speech Lang Process IEEE Trans 17(3):501–513

    Article  Google Scholar 

  20. Gatica-Perez D (2006) Analyzing group interactions in conversations: a review, In: Multisensor fusion and integration for intelligent systems, 2006 IEEE International Conference on, pp 41–46

  21. de Kok I, Heylen D (2012) Integrating backchannel prediction models into embodied conversational agents. In: Nakano Y, Neff M, Paiva A, Walker M (eds) Intelligent virtual agents. Springer, Berlin, Heidelberg, pp 268–274

    Chapter  Google Scholar 

  22. Neff MK (2008) Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Trans Graph, vol 27

  23. Admoni H, Scassellati B (2014) Data-driven model of nonverbal behavior for socially assistive human-robot interactions. In: Proceedings of the 16th international conference on multimodal interaction, New York, NY, USA, pp 196–199

  24. Lee SP, Badler JB, Badler NI (2002) Eyes alive. In: Proceedings of the 29th annual conference on computer graphics and interactive techniques, New York, NY, USA, pp 637–644

  25. Morency L-P, de Kok I, Gratch J (2010) A probabilistic multimodal approach for predicting listener backchannels. Auton Agents Multi Agent Syst 20(1):70–84

    Article  Google Scholar 

  26. de Kok I, Heylen D, Morency, L-P (2013) Speaker-adaptive multimodal prediction model for listener responses. In: Proceedings of the 15th ACM on international conference on multimodal interaction, New York, NY, USA, pp 51–58

  27. Lee J, Marsella S (2012) Modeling speaker behavior: a comparison of two approaches. In: Nakano Y, Neff M, Paiva A, Walker M (eds) Intelligent virtual agents. Springer, Berlin, Heidelberg, pp 161–174

    Chapter  Google Scholar 

  28. Huang C-M, Mutlu B (2014) Learning-based modeling of multimodal behaviors for humanlike robots. In: Proceedings of the 2014 ACM/IEEE international conference on human-robot interaction, New York, NY, USA, pp 57–64

  29. Mohammad Y, Nishida T, Okada S (2009) Unsupervised simultaneous learning of gestures, actions and their associations for human-robot interaction. In: IEEE/RSJ International conference on intelligent robots and systems, 2009. IROS 2009, pp 2537–2544

  30. Mohammad Y, Nishida T (2010) Learning interaction protocols using augmented baysian networks applied to guided navigation. In: 2010 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 4119–4126

  31. Ferreira JF, Castelo-Branco M, Dias J (2012) A hierarchical Bayesian framework for multimodal active perception. Adapt Behav 20(3):172–190

    Article  Google Scholar 

  32. Levine S, Krähenbühl P, Thrun S, Koltun V (2010) Gesture controllers. In: ACM SIGGRAPH 2010 papers, New York, NY, USA, pp 124:1–124:11

  33. Thórisson KR (2002) Natural turn-taking needs no manual: computational theory and model, from perception to action. In: Granström B, House D, Karlsson I (eds) Multimodality in language and speech systems. Springer, Netherlands, pp 173–207

    Chapter  Google Scholar 

  34. Ford CE (2004) Contingency and units in interaction. Discourse Stud 6(1):27–52

    Article  Google Scholar 

  35. Lee J, Marsella S, Traum D, Gratch J, Lance B (2007) The rickel gaze model: a window on the mind of a virtual human. In: Proceedings of the 7th International conference on intelligent virtual agents, Berlin, Heidelberg, pp 296–303

  36. Rickel J, Johnson WL (1998) Animated agents for procedural training in virtual reality: perception, cognition, and motor control. Appl Artif Intell 13:343–382

    Article  Google Scholar 

  37. Marsella S, Gratch J, Rickel J (2004) Expressive behaviors for virtual worlds. In: Prendinger H, Ishizuka M (eds) Life-like characters. Springer, Berlin, Heidelberg, pp 317–360

    Chapter  Google Scholar 

  38. Rabiner LR (1989) A tutorial on hidden markov models and selected applications in speech recognition. In: Proceedings of the IEEE, pp 257–286

  39. Bengio Y, Frasconi P (1996) Input-output HMMs for sequence processing. IEEE Trans Neural Netw 7(5):1231–1249

    Article  Google Scholar 

  40. Šrámek R, Brejová B, Vinař T (2007) On-line Viterbi algorithm and Its relationship to random walks. arXiv:0704.0062

  41. Bloit J, Rodet X (2008) Short-time Viterbi for online HMM decoding: evaluation on a real-time phone recognition task. In: IEEE international conference on acoustics, speech and signal processing, 2008. ICASSP 2008, pp 2121–2124

  42. Goh CY, Dauwels J, Mitrovic N, Asif MT, Oran A, Jaillet P (2012) Online map-matching based on hidden Markov model for real-time traffic sensing applications. In: 2012 15th international IEEE conference on intelligent transportation systems (ITSC), pp 776–781

  43. Yu S (2010) Hidden semi-Markov models. Artif Intell

  44. Ferguson JD (oct. 1980) Variable duration models for speech. In: Symp Appl. Hidden Markov Models Text Speech Inst. Def. Anal. Princet. NJ, pp 143–179

  45. Levinson SE (1986) Continuously variable duration hidden Markov models for automatic speech recognition. Comput Speech Lang 1(1):29–45

    Article  Google Scholar 

  46. Kulp D, Haussler D, Reese MG, Eeckman FH (1996) A generalized hidden Markov model for the recognition of human genes in DNA. In: Proc. Int. Conf. Intell. Syst. Mol. Biol. ISMB Int. Conf. Intell. Syst. Mol. Biol, vol 4. pp 134–142

  47. Russell M (1993) A segmental HMM for speech pattern modelling. In:1993 IEEE international conference on acoustics, speech, and signal processing, 1993. ICASSP-93, vol 2. pp 499–502

  48. Ramesh P, Wilpon JG (1992) Modeling state durations in hidden Markov models for automatic speech recognition. In: 1992 IEEE international conference on acoustics, speech, and signal processing, 1992. ICASSP-92, vol 1. pp 381–384

  49. Zen H, Tokuda K, Masuko T, Kobayashi T, Kitamura T (2004) Hidden semi-Markov model based speech synthesis. In: Proc. of ICSLP, 2004

  50. Pierre Lanchantin WP (2004) Unsupervised non stationary image segmentation using triplet markov chains. In: Advanced concepts for intelligent vision systems (ACVIS 04), Brussels, Belgium

  51. Hongeng S, Nevatia R (2003) Large-scale event detection using semi-hidden Markov models. In: 9th IEEE international conference on computer vision, 2003. Proceedings, vol 2. pp 1455–1462

  52. Squire K (2004) HMM-based semantic learning for a mobile robot, Ph.D. dissertation, Ph.D. dissertation, University of Illinois at Urbana-Champaign

  53. Yu S (2005) Multiple tracking based anomaly detection of mobile nodes. In: 2005 2nd international conference on mobile technology, applications and systems, p 5

  54. Schmidler SC, Liu JS, Brutlag DL (2000) Bayesian segmentation of protein secondary structure. J Comput Biol J Comput Mol Cell Biol 7(12):233–248

    Article  Google Scholar 

  55. Bulla J, Bulla I (2006) Stylized facts of financial time series and hidden semi-Markov models. Comput Stat Data Anal 51(4):2192–2209

    Article  MATH  MathSciNet  Google Scholar 

  56. Mitchell C, Harper M, Jamieson L, CTM (1995) On the complexity of explicit duration HMMs. In: IEEE transactions on speech and audio processing, pp 213–217

  57. Yu S, Kobayashi H (2003) An efficient forward-backward algorithm for an explicit-duration hidden Markov model. IEEE Signal Process Lett 10(1):11–14

    Article  MATH  Google Scholar 

  58. Shun-Zheng Yu HK (2006) Practical implementation of an efficient forward-backward algorithm for an explicit-duration hidden Markov model. IEEE Trans Signal Process 54:1947–1951

    Article  Google Scholar 

  59. Baron-Cohen S (2004) Mind reading: the interactive guide to emotions. Édition : Cdr. London u.a.: Jessica Kingsley Publishers

  60. Young SJ (1993) The HTK hidden Markov model toolkit: design and philosophy TR 152, University of Cambridge, Department of Engineering, Speech Group. http://htk.eng.cam.ac.uk/

  61. Dunham M, Murphy K (2012) PMTK3: probabilistic modeling toolkit for Matlab/Octave, version 3. http://code.google.com/p/pmtk3/

  62. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  63. Levenshtein V (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 10(8):707–710

    MathSciNet  Google Scholar 

  64. Mihoub A, Bailly G, Wolf C (2013) Social behavior modeling based on incremental discrete hidden markov models. In: Salah AA, Hung H, Aran O, Gunes H (eds) Human behavior understanding. Springer International Publishing, Barcelona, Spain, pp 172–183

  65. Mihoub A, Bailly G, Wolf C (2014) Modeling perception-action loops: comparing sequential models with frame-based classifiers. In: Proceedings of the second international conference on human agent interaction, ACM, pp 309–314

  66. Richardson DC, Dale R, Shockley K (2008) Synchrony and swing in conversation: coordination, temporal dynamics, and communication. In: Wachsmuth I, Lenzen M, Knoblich G (eds) Embodied communication in humans and machines. Oxford University Press, Oxford, pp 75–94

Download references

Acknowledgments

This research is financed by the Rhône-Alpes ARC6 research council and the ANR-14-CE27-0014 SOMBRERO.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alaeddine Mihoub.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mihoub, A., Bailly, G., Wolf, C. et al. Learning multimodal behavioral models for face-to-face social interaction. J Multimodal User Interfaces 9, 195–210 (2015). https://doi.org/10.1007/s12193-015-0190-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12193-015-0190-7

Keywords

Navigation