skip to main content
10.1145/3281151.3281152acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Detecting head movements in video-recorded dyadic conversations

Published: 16 October 2018 Publication History

Abstract

This paper is about the automatic recognition of head movements in videos of face-to-face dyadic conversations. We present an approach where recognition of head movements is casted as a multimodal frame classification problem based on visual and acoustic features. The visual features include velocity, acceleration, and jerk values associated with head movements, while the acoustic ones are pitch and intensity measurements from the co-occuring speech. We present the results obtained by training and testing a number of classifiers on manually annotated data from two conversations. The best performing classifier, a Multilayer Perceptron trained using all the features, obtains 0.75 accuracy and outperforms the mono-modal baseline classifier.

References

[1]
Jens Allwood. 1988. The Structure of Dialog. In Structure of Multimodal Dialog II, Martin M. Taylor, Francoise Neél, and Don G. Bouwhuis (Eds.). John Benjamins, Amsterdam, 3--24.
[2]
Jens Allwood, Loredana Cerrato, Kristiina Jokinen, Costanza Navarretta, and Patrizia Paggio. 2007. The MUMIN coding scheme for the annotation of feedback, turn management and sequencing phenomena. In Multimodal Corpora for Modelling Human Multimodal Behaviour, Jean-Claude Martin, Patrizia Paggio, Peter Kuehnlein, Rainer Stiefelhagen, and Fabio Pianesi (Eds.). Special issue of the International Journal of Language Resources and Evaluation, Vol. 41. Springer, 273--287.
[3]
Paul Boersma and David Weenink. 2009. Praat: doing phonetics by computer (Version 5.1.05) {Computer program}. (2009). Retrieved May 1, 2009, from http://www.praat.org/.
[4]
G. Bradski and A. Koehler. 2008. Learning OpenCV: Computer Vision with the OpenCV Linbrary. O'Reilly.
[5]
Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Philadelphia, 1--8.
[6]
Marion Dohen, Hélène Lœvenbruck, and Hill Harold. 2006. Visual correlates of prosodic contrastive focus in French: description and inter-speaker variability. In Speech Prosody 2006. p-221.
[7]
Starkey Duncan. 1972. Some signals and rules for taking speaking turns in conversations. Journal of Personality and Social Psychology 23 (1972), 283--292.
[8]
Sebastian Germesin and Theresa Wilson. 2009. Agreement detection in multiparty conversation. In Proceedings of ICMI-MLMI 2009. 7--14.
[9]
Björn Granström and David House. 2005. Audiovisual representation of prosody in expressive speech communication. Speech Communication 46, 3 (July 2005), 473--484.
[10]
U. Hadar, T.J. Steiner, E.C. Grant, and F. Clifford Rose. 1983. Head Movement Correlates of Juncture and Stress at Sentence Level. Language and Speech 26, 2 (April 1983), 117--129.
[11]
D. Heylen, E. Bevacqua, M. Tellier, and C. Pelachaud. 2007. Searching for prototypical facial feedback signals. In Proceedings of 7th International Conference on Intelligent Virtual Agents. 147--153.
[12]
Bart Jongejan. 2012. Automatic annotation of head velocity and acceleration in Anvil. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12). European Language Resources Distribution Agency, 201--208.
[13]
Bart Jongejan, Patrizia Paggio, and Costanza Navarretta. 2017. Classifying head movements in video-recorded conversations based on movement velocity, acceleration and jerk. In Proceedings of the 4th European and 7th Nordic Symposium on Multimodal Communication (MMSYM 2016), Copenhagen, 29--30 September 2016. LinkÃűping University Electronic Press, LinkÃűpings universitet, 10--17.
[14]
Ashish Kapoor and Rosalind W. Picard. 2001. A Real-time Head Nod and Shake Detector. In Proceedings of the 2001 Workshop on Perceptive User Interfaces (PUI '01). ACM, New York, NY, USA, 1--5.
[15]
Adam Kendon. 2004. Gesture. Cambridge University Press.
[16]
Michael Kipp. 2004. Gesture Generation by Imitation - From Human Behavior to Computer Character Animation. Boca Raton, Florida: Dissertation.com.
[17]
John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001).
[18]
Evelyn McClave. 2000. Linguistic functions of head movements in the context of speech. Journal of Pragmatics 32 (2000), 855--878.
[19]
Louis-Philippe Morency, Ariadna Quattoni, and Trevor Darrell. 2007. Latent-dynamic discriminative models for continuous gesture recognition. In 2007 IEEE conference on computer vision and pattern recognition. IEEE, 1--8.
[20]
L.-P. Morency, C. Sidner, C. Lee, and T. Darrell. 2005. Contextual recognition of head gestures. In Proc. Int. Conf. on Multimodal Interfaces (ICMI).
[21]
Patrizia Paggio, Jens Allwood, Elisabeth Ahlsén, Kristiina Jokinen, and Costanza Navarretta. 2010. The NOMCO Multimodal Nordic Resource - Goals and Characteristics. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10) (19--21). European Language Resources Association (ELRA), Valletta, Malta.
[22]
P. Paggio and C. Navarretta. 2011. Head Movements, Facial Expressions and Feedback in Danish First Encounters Interactions: A Culture-Specific Analysis. In Universal Access in Human-Computer Interaction - Users Diversity. 6th International Conference. UAHCI 2011, Held as Part of HCI International 2011 (LNCS), Constantine Stephanidis (Ed.). Springer Verlag, Orlando Florida, 583--690.
[23]
Patrizia Paggio and Costanza Navarretta. 2016. The Danish NOMCO corpus: multimodal interaction in first acquaintance conversations. Language Resources and Evaluation (2016), 1--32.
[24]
W. Tan and G. Rong. 2003. A real-time head nod and shake detector using HMMs. Expert Systems with Applications 25, 3 (2003), 461--466.
[25]
Nina Thorsen. 1980. Neutral stress, emphatic stress, and sentence Intonation in Advanced Standard Copenhagen Danish. Technical Report 14. University of Copenhagen. 121--205 pages. https://danpass.hum.ku.dk/ng/papers/aripuc14_1980_121-205.pdf
[26]
Haolin Wei, Patricia Scanlon, Yingbo Li, David S Monaghan, and Noel E O'Connor. 2013. Real-time head nod and shake detection for continuous human affect recognition. In 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS). IEEE, 1--4.
[27]
Victor Yngve. 1970. On getting a word in edgewise. In Papers from the sixth regional meeting of the Chicago Linguistic Society. 567--578.
[28]
Z. Zhao, Y. Wang, and S. Fu. 2012. Head Movement Recognition Based on the Lucas-Kanade Algorithm. In Computer Science Service System (CSSS), 2012 International Conference on. 2303--2306.

Cited By

View all
  • (2024)Phonetic differences between affirmative and feedback head nods in German Sign Language (DGS): A pose estimation studyPLOS ONE10.1371/journal.pone.030404019:5(e0304040)Online publication date: 30-May-2024
  • (2024)An Outlook for AI Innovation in Multimodal Communication ResearchDigital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management10.1007/978-3-031-61066-0_13(182-234)Online publication date: 29-Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '18: Proceedings of the 20th International Conference on Multimodal Interaction: Adjunct
October 2018
62 pages
ISBN:9781450360029
DOI:10.1145/3281151
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. head movement classification
  2. multimodal features

Qualifiers

  • Research-article

Conference

ICMI '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Phonetic differences between affirmative and feedback head nods in German Sign Language (DGS): A pose estimation studyPLOS ONE10.1371/journal.pone.030404019:5(e0304040)Online publication date: 30-May-2024
  • (2024)An Outlook for AI Innovation in Multimodal Communication ResearchDigital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management10.1007/978-3-031-61066-0_13(182-234)Online publication date: 29-Jun-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media