Analysis of relationship between head motion events and speech in dialogue conversations
Introduction
Head motion naturally occurs in synchrony with speech utterances, and may carry paralinguistic information related to intentions, attitudes or emotion in dialogue communication. Therefore, a better understanding of the relationship between head motion events and speech utterances is important for application in multi-modal human–agent or human–robot interactions.
Head motion analyses can be focused on two problems from the application viewpoint: one is how to generate the head motion of CG (Computer Graphics) agents or robots, synchronized with their speech utterances (e.g. Yehia et al., 2002, Munhall et al., 2004, Watanabe et al., 2004, Sargin et al., 2006, Beskow et al., 2006, Busso et al., 2007, Foster and Oberlander, 2007, Hofer and Shimodaira, 2007); the other is how to recognize the user’s head motion and interpret its role in communication (e.g., Iwano et al., 1996, Watanuki et al., 2000, Graf et al., 2002, Dohen et al., 2006, Sidner et al., 2006, Morency et al., 2007, Burnham et al., 2007). The generation of a natural head motion is not only useful for improving human–agent or human–robot interaction, but can also improve intelligibility in noisy environments. For example, it is reported that a better perception of syllables has been achieved in a speech-in-noise task, with the normal, natural head motion compared with speech without head motion and only auditory stimulus, in an experiment with animations (Munhall et al., 2004). Intelligibility of tones may also be improved by use of head motion in tonal languages (Burnham et al., 2007).
There are many works in the literature, which analyzed the correlation between head motion and prosodic features, such as the fundamental frequency (F0) contours (which represent pitch movements) (Yehia et al., 2002, Munhall et al., 2004, Busso et al., 2007).
For example, Yehia et al. tried to associate head motion with speech over the fundamental frequency (F0) (Yehia et al., 2002). Experiments using read speech utterances of one American English speaker (ES) and one Japanese speaker (JS) showed the following results for estimation of head motion from F0 and vice versa. From head motion to F0, the average correlation was 0.73 for JS and 0.88 for ES. Opposite estimation from F0 to head motion showed a less obvious correlation (0.25 for JS and 0.50 for ES, on average). In addition, correlation among F0 and the 6DOF (degrees-of-freedom) (3DOF for rotation and 3DOF for translation) of head motion was between 0.39 and 0.52 for ES, and between 0.22 and 0.30 for JS, which are in average less than 0.50. Munhall et al. (2004) reports that head motion is correlated with pitch and amplitude of the talker’s voice, in Japanese read speech utterances, regarding all 6 DOF. For several sentences, correlations were almost always over 0.50, on average about 0.63.
The results above imply that the correspondence between head motion and prosodic features is language dependent, and that the use of only prosodic information might not be enough to generate natural head motion.
Other works show that head motion may differ according to emotional states (Beskow et al., 2006, Busso et al., 2007). Analysis of the relation between facial parameters (including head motion) and several expressive modes was also reported by Beskow et al. for short read Swedish utterances in which focal accent was systematically varied (Beskow et al., 2006). Results indicated that in all expressive modes, words with focal accent were accompanied by a greater variation of the facial parameters than the words in non-focal positions. Regarding head motion, it is reported that head pitch has larger variations in “certain”, “angry” and “confirming” modes, head yaw in “angry” and “happy” modes, and head roll in “certain” modes. Busso et al. compared head motion of neutral and emotional speech for synthesis purposes (Busso et al., 2007). They investigated head motion for four emotional states (neutral, sadness, happiness and anger). As prosodic features, they used the pitch (F0), the RMS (root mean square) energy and their 1st and 2nd derivatives. For the head motion they took account of the 3 DOF of head rotation. Canonical correlation analysis (which provides a measure of the correlation between two streams of data with equal or different dimensionality) was applied for the streams of prosodic features and head motion, resulting in correlations around 0.7 for all expressive modes. However, as the prosodic features are implicit in the synthesis models, it is not clear which of the features are related to a specific emotion.
Most of the works focusing on head motion synthesis, as in the ones cited above, usually associate the acoustic features directly with the raw head motion measurements through models. However, to better understand the roles of head motion in speech communication, it could be more appropriate to analyze head motion events, like nods, head shakes and head tilts.
For example, head nods are reported to be related to emphasis (or focus) (Sargin et al., 2006, Foster and Oberlander, 2007, Dohen et al., 2006). Variations in speech emphasis depending on head movement were observed by Graf et al., for English sentences (Foster and Oberlander, 2007). They reported that emphasis of a word often goes along with head nodding, and a rise of the head can correspond with a rise in the voice. They call these movements ‘visual prosody’. A talking head which included ‘visual prosody’ through head motion was reported as looking more natural even if this motion was not really connected with the content of the spoken text. Sargin et al. also reported correlation between head motion events (nods and head tilts) and speech prominences marked as pitch accents for English (Sargin et al., 2006). They carried out experiments with one native speaker of Canadian English to investigate the correlation between keyword speech (like “left”, “right” and “straight”) and gestures (including hand and head gestures). With focus on head nods and tilts, a correspondence of about 64% between these two head motion types and pitch accents was reported. Dohen et al. also reported that eyebrow raising and/or head nods signal focus in French (Dohen et al., 2006).
As can be observed from the past works described above, most of them focus on the relationship between head motion and prosodic features. However, we consider that this relationship might be language-dependent, since the function of the prosodic features differs, for example, if the language is a tonal language (such as Chinese and Thai), a lexical pitch-accent language (such as Japanese), or a stress-accent language (such as English and other European languages). The present work focuses on Japanese, whose correlation between head motion and prosodic features has been reported to be lower than in English (Yehia et al., 2002). Further, it can also be observed that most of the past works described above analyzed read speech or acted emotional speech data for few speakers. Thus, the results reported in the past works may not be applied for any language.
In the present work, the analysis of the relationship between head motion and speech are focused on Japanese spontaneous speech. For Japanese, there are works reporting that head motion might be related to turn-taking and speech act functions, for spontaneous dialogue speech. For example, Iwano et al. analyzed relations between head motion and the semantics of utterances in Japanese spoken dialogue, with the purpose of improving spoken dialogue understanding by also using visual information (Iwano et al., 1996). They also considered the speaking turn, and speech act functions. Their main findings were that: affirmation, agreement and giving responses involve vertical movement of the head; when the speaker wants to have a response from the listener, speaker often faces up to see his partner; and when listener moves his head vertically, he is giving an affirmative response to the speaker. Watanuki et al. analyzed relations between turn-taking and gestures (including head, hand and upper body motions) in Japanese spontaneous dialogue data (Watanuki et al., 2000). They found that vertical head movements occur around the utterance beginnings and before utterance ends in both turn change and turn hold cases. However, they did not consider the direction of the vertical movements.
In our preliminary analysis, relationships between head motion, dialogue acts (including turn taking functions), and prosodic (including voice quality) features was investigated in spontaneous dialogue speech data of one Japanese female speaker (Ishi et al., 2007). As prosodic features, phrase final tones were analyzed, instead of using global F0 contours, as in conventional works, since low correlations between F0 and head motion have been reported in past works for Japanese. Nonetheless, a better correspondence was found between head motion events and dialogue act functions, rather than head motion and prosodic features.
Based on the analysis results for one speaker, in the present work, we extended the analysis data to spontaneous dialogue speech data of multiple speakers, and investigated the effects of speaker variability on the relationship between head motion and speech. The present manuscript is an extension of our previous works (Ishi et al., 2007, Ishi et al., 2008, Ishi et al., 2010), regarding head motion analysis.
The rest of the paper is organized as follows. Section 2 describes the speech, motion, and annotation data used in the analysis. In Section 3, the relationship between head motion and dialogue acts, and speaker variability in head motion are analyzed for multiple speaker data. In Section 4 the main conclusions are summarized.
Section snippets
Speech and motion data
Multi-modal dialogue speech data were collected for several pairs of speakers. Data of seven speakers (four male and three female speakers) were used for the analysis of the present section. Fig. 1 shows the IDs (and ages) of the speakers, and the relationship between them. The IDs beginning with F and M are female and male speakers, respectively. The lines linking two IDs indicate the pairs of speakers whose dialogue data were recorded. The dashed lines indicate the dialogues where the
Relationship between head motion and dialogue act functions
Firstly, the general trends of the head motion types for each dialogue act function were analyzed, disregarding speaker variability. Fig. 4 shows the overall distributions of the head motion types for each dialogue act type. The total number of occurrences for each dialogue act type is shown on the right side of each corresponding bar, for reference. A chi-squared test was conducted on the cross-distributions between head motion and dialogue act tags (chi-square(77) = 2355.8, p < 0.01). The head
Conclusion
Analyses were conducted on the relationship between head motion and speech, in spontaneous dialogue data of multiple speakers.
Among the head motion types, nods were the most frequent, appearing not only for expressing dialogue acts such as agreement or affirmation (in about 80% of backchannels), but also as indicative of syntactic or semantic units, appearing in about 30–40% of phrases with strong boundaries, including both turn-keeping and turn-giving dialogue act functions. In contrast, in
Acknowledgments
This work is partly supported by the Ministry of Internal Affairs and Communication. We thank Kyoko Nakanishi, Maiko Hirano, Chaoran Liu, Hiroaki Hatano and Mika Morita for their contributions in data annotation and analysis. We also thank Freerk Wilbers and Judith Haas for their contributions in the collection and processing of motion data.
References (22)
- et al.
Linking facial animation, head motion and speech acoustics
J. Phonetics
(2002) - et al.
Head gestures for perceptual interfaces: the role of context in improving recognition
Artif. Intell.
(2007) - et al.
Facial expression and prosodic prominence: effects of modality and facial area
J. Phonetics
(2008) - et al.
Visual prosody and speech intelligibility – head movement improves auditory speech perception
Psychol. Sci.
(2004) - et al.
InterActor: speech-driven embodied interactive actor
Int. J. Hum. Comput. Interact.
(2004) - Sargin, M.E., Aran, O., Karpov, A., Ofli, F., Yasinnik, Y., Wilson, S., Erzin, E., Yemez, Y., Tekalp, A.M., 2006....
- Beskow, J., Granstrom, B., House, D., 2006. Visual correlates to prominence in several expressive modes. In:...
- et al.
Rigid head motion in expressive speech animation: analysis and synthesis
IEEE Trans. Audio Speech Lang. Process.
(2007) - et al.
Corpus-based generation of head and eyebrow motion for an embodied conversational agent
Lang. Resour. Eval.
(2007) - Hofer, G., Shimodaira, H., 2007. Automatic head motion prediction from speech data. In: Proceedings of Interspeech...
Cited by (46)
Speech-driven head motion generation from waveforms
2024, Speech CommunicationPreschoolers benefit from a clear sound-referent mapping to acquire nonnative phonology
2021, Applied PsycholinguisticsAnalyzing synergetic functions of listener’s head movements and aizuchi in conversations
2023, Transactions of the Japanese Society for Artificial IntelligenceModeling Feedback in Interaction With Conversational Agents—A Review
2022, Frontiers in Computer Science