Elsevier

Speech Communication

Volume 57, February 2014, Pages 233-243
Speech Communication

Analysis of relationship between head motion events and speech in dialogue conversations

https://doi.org/10.1016/j.specom.2013.06.008Get rights and content

Highlights

  • We analyzed relations between head motion and speech on multi-speaker dialogue data.

  • We found relation between head motion (nods, shakes and tilts) and dialogue acts.

  • The timing of the speaker’s nods was found to be synchronized with the last syllable.

  • The frequency of nods was found to be influenced by the inter-personal relationship.

Abstract

Head motion naturally occurs in synchrony with speech and may convey paralinguistic information (such as intentions, attitudes and emotions) in dialogue communication. With the aim of verifying the relationship between head motion events and speech utterances, analyses were conducted on motion-captured data of multiple speakers during spontaneous dialogue conversations. The relationship between head motion events and dialogue acts was firstly analyzed. Among the head motion types, nods occurred with most frequency during speech utterances, not only for expressing dialogue acts of agreement or affirmation, but also appearing at the end of phrases with strong boundaries (including both turn-keeping and giving dialogue act functions). Head shakes usually appeared for expressing negation, while head tilts appeared mostly in interjections expressing denial, and in phrases with weak boundaries, where the speaker is thinking or did not finish uttering. The synchronization of head motion events and speech was also analyzed with focus on the timing of nods relative to the last syllable of a phrase. Results showed that nods were highly synchronized with the center portion of backchannels, while it was more synchronized with the end portion of the last syllable in phrases with strong boundaries. Speaker variability analyses indicated that the inter-personal relationship with the interlocutor is one factor influencing the frequency of head motion events. It was found that the frequency of nods was lower for dialogue partners with close relationship (such as family members), where speakers do not have to express careful attitudes. On the other hand, the frequency of nods (especially of multiple nods) clearly increased when the inter-personal relationship between the dialogue partners was distant.

Introduction

Head motion naturally occurs in synchrony with speech utterances, and may carry paralinguistic information related to intentions, attitudes or emotion in dialogue communication. Therefore, a better understanding of the relationship between head motion events and speech utterances is important for application in multi-modal human–agent or human–robot interactions.

Head motion analyses can be focused on two problems from the application viewpoint: one is how to generate the head motion of CG (Computer Graphics) agents or robots, synchronized with their speech utterances (e.g. Yehia et al., 2002, Munhall et al., 2004, Watanabe et al., 2004, Sargin et al., 2006, Beskow et al., 2006, Busso et al., 2007, Foster and Oberlander, 2007, Hofer and Shimodaira, 2007); the other is how to recognize the user’s head motion and interpret its role in communication (e.g., Iwano et al., 1996, Watanuki et al., 2000, Graf et al., 2002, Dohen et al., 2006, Sidner et al., 2006, Morency et al., 2007, Burnham et al., 2007). The generation of a natural head motion is not only useful for improving human–agent or human–robot interaction, but can also improve intelligibility in noisy environments. For example, it is reported that a better perception of syllables has been achieved in a speech-in-noise task, with the normal, natural head motion compared with speech without head motion and only auditory stimulus, in an experiment with animations (Munhall et al., 2004). Intelligibility of tones may also be improved by use of head motion in tonal languages (Burnham et al., 2007).

There are many works in the literature, which analyzed the correlation between head motion and prosodic features, such as the fundamental frequency (F0) contours (which represent pitch movements) (Yehia et al., 2002, Munhall et al., 2004, Busso et al., 2007).

For example, Yehia et al. tried to associate head motion with speech over the fundamental frequency (F0) (Yehia et al., 2002). Experiments using read speech utterances of one American English speaker (ES) and one Japanese speaker (JS) showed the following results for estimation of head motion from F0 and vice versa. From head motion to F0, the average correlation was 0.73 for JS and 0.88 for ES. Opposite estimation from F0 to head motion showed a less obvious correlation (0.25 for JS and 0.50 for ES, on average). In addition, correlation among F0 and the 6DOF (degrees-of-freedom) (3DOF for rotation and 3DOF for translation) of head motion was between 0.39 and 0.52 for ES, and between 0.22 and 0.30 for JS, which are in average less than 0.50. Munhall et al. (2004) reports that head motion is correlated with pitch and amplitude of the talker’s voice, in Japanese read speech utterances, regarding all 6 DOF. For several sentences, correlations were almost always over 0.50, on average about 0.63.

The results above imply that the correspondence between head motion and prosodic features is language dependent, and that the use of only prosodic information might not be enough to generate natural head motion.

Other works show that head motion may differ according to emotional states (Beskow et al., 2006, Busso et al., 2007). Analysis of the relation between facial parameters (including head motion) and several expressive modes was also reported by Beskow et al. for short read Swedish utterances in which focal accent was systematically varied (Beskow et al., 2006). Results indicated that in all expressive modes, words with focal accent were accompanied by a greater variation of the facial parameters than the words in non-focal positions. Regarding head motion, it is reported that head pitch has larger variations in “certain”, “angry” and “confirming” modes, head yaw in “angry” and “happy” modes, and head roll in “certain” modes. Busso et al. compared head motion of neutral and emotional speech for synthesis purposes (Busso et al., 2007). They investigated head motion for four emotional states (neutral, sadness, happiness and anger). As prosodic features, they used the pitch (F0), the RMS (root mean square) energy and their 1st and 2nd derivatives. For the head motion they took account of the 3 DOF of head rotation. Canonical correlation analysis (which provides a measure of the correlation between two streams of data with equal or different dimensionality) was applied for the streams of prosodic features and head motion, resulting in correlations around 0.7 for all expressive modes. However, as the prosodic features are implicit in the synthesis models, it is not clear which of the features are related to a specific emotion.

Most of the works focusing on head motion synthesis, as in the ones cited above, usually associate the acoustic features directly with the raw head motion measurements through models. However, to better understand the roles of head motion in speech communication, it could be more appropriate to analyze head motion events, like nods, head shakes and head tilts.

For example, head nods are reported to be related to emphasis (or focus) (Sargin et al., 2006, Foster and Oberlander, 2007, Dohen et al., 2006). Variations in speech emphasis depending on head movement were observed by Graf et al., for English sentences (Foster and Oberlander, 2007). They reported that emphasis of a word often goes along with head nodding, and a rise of the head can correspond with a rise in the voice. They call these movements ‘visual prosody’. A talking head which included ‘visual prosody’ through head motion was reported as looking more natural even if this motion was not really connected with the content of the spoken text. Sargin et al. also reported correlation between head motion events (nods and head tilts) and speech prominences marked as pitch accents for English (Sargin et al., 2006). They carried out experiments with one native speaker of Canadian English to investigate the correlation between keyword speech (like “left”, “right” and “straight”) and gestures (including hand and head gestures). With focus on head nods and tilts, a correspondence of about 64% between these two head motion types and pitch accents was reported. Dohen et al. also reported that eyebrow raising and/or head nods signal focus in French (Dohen et al., 2006).

As can be observed from the past works described above, most of them focus on the relationship between head motion and prosodic features. However, we consider that this relationship might be language-dependent, since the function of the prosodic features differs, for example, if the language is a tonal language (such as Chinese and Thai), a lexical pitch-accent language (such as Japanese), or a stress-accent language (such as English and other European languages). The present work focuses on Japanese, whose correlation between head motion and prosodic features has been reported to be lower than in English (Yehia et al., 2002). Further, it can also be observed that most of the past works described above analyzed read speech or acted emotional speech data for few speakers. Thus, the results reported in the past works may not be applied for any language.

In the present work, the analysis of the relationship between head motion and speech are focused on Japanese spontaneous speech. For Japanese, there are works reporting that head motion might be related to turn-taking and speech act functions, for spontaneous dialogue speech. For example, Iwano et al. analyzed relations between head motion and the semantics of utterances in Japanese spoken dialogue, with the purpose of improving spoken dialogue understanding by also using visual information (Iwano et al., 1996). They also considered the speaking turn, and speech act functions. Their main findings were that: affirmation, agreement and giving responses involve vertical movement of the head; when the speaker wants to have a response from the listener, speaker often faces up to see his partner; and when listener moves his head vertically, he is giving an affirmative response to the speaker. Watanuki et al. analyzed relations between turn-taking and gestures (including head, hand and upper body motions) in Japanese spontaneous dialogue data (Watanuki et al., 2000). They found that vertical head movements occur around the utterance beginnings and before utterance ends in both turn change and turn hold cases. However, they did not consider the direction of the vertical movements.

In our preliminary analysis, relationships between head motion, dialogue acts (including turn taking functions), and prosodic (including voice quality) features was investigated in spontaneous dialogue speech data of one Japanese female speaker (Ishi et al., 2007). As prosodic features, phrase final tones were analyzed, instead of using global F0 contours, as in conventional works, since low correlations between F0 and head motion have been reported in past works for Japanese. Nonetheless, a better correspondence was found between head motion events and dialogue act functions, rather than head motion and prosodic features.

Based on the analysis results for one speaker, in the present work, we extended the analysis data to spontaneous dialogue speech data of multiple speakers, and investigated the effects of speaker variability on the relationship between head motion and speech. The present manuscript is an extension of our previous works (Ishi et al., 2007, Ishi et al., 2008, Ishi et al., 2010), regarding head motion analysis.

The rest of the paper is organized as follows. Section 2 describes the speech, motion, and annotation data used in the analysis. In Section 3, the relationship between head motion and dialogue acts, and speaker variability in head motion are analyzed for multiple speaker data. In Section 4 the main conclusions are summarized.

Section snippets

Speech and motion data

Multi-modal dialogue speech data were collected for several pairs of speakers. Data of seven speakers (four male and three female speakers) were used for the analysis of the present section. Fig. 1 shows the IDs (and ages) of the speakers, and the relationship between them. The IDs beginning with F and M are female and male speakers, respectively. The lines linking two IDs indicate the pairs of speakers whose dialogue data were recorded. The dashed lines indicate the dialogues where the

Relationship between head motion and dialogue act functions

Firstly, the general trends of the head motion types for each dialogue act function were analyzed, disregarding speaker variability. Fig. 4 shows the overall distributions of the head motion types for each dialogue act type. The total number of occurrences for each dialogue act type is shown on the right side of each corresponding bar, for reference. A chi-squared test was conducted on the cross-distributions between head motion and dialogue act tags (chi-square(77) = 2355.8, p < 0.01). The head

Conclusion

Analyses were conducted on the relationship between head motion and speech, in spontaneous dialogue data of multiple speakers.

Among the head motion types, nods were the most frequent, appearing not only for expressing dialogue acts such as agreement or affirmation (in about 80% of backchannels), but also as indicative of syntactic or semantic units, appearing in about 30–40% of phrases with strong boundaries, including both turn-keeping and turn-giving dialogue act functions. In contrast, in

Acknowledgments

This work is partly supported by the Ministry of Internal Affairs and Communication. We thank Kyoko Nakanishi, Maiko Hirano, Chaoran Liu, Hiroaki Hatano and Mika Morita for their contributions in data annotation and analysis. We also thank Freerk Wilbers and Judith Haas for their contributions in the collection and processing of motion data.

References (22)

  • H.C. Yehia et al.

    Linking facial animation, head motion and speech acoustics

    J. Phonetics

    (2002)
  • L.-P. Morency et al.

    Head gestures for perceptual interfaces: the role of context in improving recognition

    Artif. Intell.

    (2007)
  • M. Swerts et al.

    Facial expression and prosodic prominence: effects of modality and facial area

    J. Phonetics

    (2008)
  • K.G. Munhall et al.

    Visual prosody and speech intelligibility – head movement improves auditory speech perception

    Psychol. Sci.

    (2004)
  • T. Watanabe et al.

    InterActor: speech-driven embodied interactive actor

    Int. J. Hum. Comput. Interact.

    (2004)
  • Sargin, M.E., Aran, O., Karpov, A., Ofli, F., Yasinnik, Y., Wilson, S., Erzin, E., Yemez, Y., Tekalp, A.M., 2006....
  • Beskow, J., Granstrom, B., House, D., 2006. Visual correlates to prominence in several expressive modes. In:...
  • C. Busso et al.

    Rigid head motion in expressive speech animation: analysis and synthesis

    IEEE Trans. Audio Speech Lang. Process.

    (2007)
  • M.E. Foster et al.

    Corpus-based generation of head and eyebrow motion for an embodied conversational agent

    Lang. Resour. Eval.

    (2007)
  • Hofer, G., Shimodaira, H., 2007. Automatic head motion prediction from speech data. In: Proceedings of Interspeech...
  • Iwano, Y., Kageyama, S., Morikawa, E., Nakazato, S., Shirai, K., 1996. Analysis of head movements and its role in...
  • Cited by (46)

    • Analyzing synergetic functions of listener’s head movements and aizuchi in conversations

      2023, Transactions of the Japanese Society for Artificial Intelligence
    View all citing articles on Scopus
    View full text