Estimating Timing of Head Movements Based on the Volume and Pitch of Speech

Yanagi, Haruka; Oshima, Chika; Nakayama, Koichi

doi:10.1007/978-3-030-22649-7_26

Haruka Yanagi¹⁶,
Chika Oshima¹⁶ &
Koichi Nakayama¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11570))

Included in the following conference series:

International Conference on Human-Computer Interaction

1349 Accesses

Abstract

Our research aims to create two friendly communication robots to talk with elderly people in nursing facilities. If the robots synchronize their head movements in response to the elderly person, the elderly person may react favorably to the robot. Then, the elderly person can enjoy talking with these two robots. In this paper, we investigated whether the volume and pitch of the speech are useful data for estimating the timing of head movements. Because the robots need to move their heads in real time, when one of the robots or the person is talking, we focus on the volume and pitch of the speech, not the content. Moreover, it was cleared which machine learning method creates suitable classifier models for estimating the timing of head movements. The experimental results showed that Random Forest classifier was the most suitable method.

You have full access to this open access chapter, Download conference paper PDF

Emotional head motion predicting from prosodic and linguistic features

Article 03 March 2016

Analysis of Head Motions and Speech, and Head Motion Control in an Android Robot

Ava (A Social Robot): Design and Performance of a Robotic Hearing Apparatus

Keywords

1 Introduction

We have developed a communication robot, “CATARO” for elderly people in nursing facilities [1]. The robot can move its head, although it cannot move its hands or legs. We plan to adopt two robots that can talk with the elderly person according to a prescheduled conversation [2], because the conversation ends when the elderly person stops speaking. Our aim is for the elderly person to enjoy the conversation and feel familiarity with the robots. Chartrand and Bargh found that subjects whose movements were mirrored by a confidant liked that partner more (chameleon effect [3]). Therefore, we presume that if the robot’s head movements synchronize with the elderly person, then the elderly person will react favorably towards the robot.

Generally, people move their heads without premeditated thought during a conversation. They may also intentionally move their heads to match their partners’ position [4]. We think that the timing of head movements is quasi-different between people. One person does not move his/her head until the other finishes speech, and another person moves his/her head as often as the other’s speech breaks off. Therefore, an individual model of the head movement is required so that the robot can move its head in synchronization with each person.

Busso et al. aimed to quantify differences in the head motion patterns displayed under expressive utterances. They used “hidden Markov model (HMM)” to estimate the discrete representation of head poses from prosodic features [5]. Munhall et al. studied the impact on speech perception of a talker’s head movement. The head movement was correlated with the fundamental frequency and amplitude of the voice during speech [6].

In our research, the robot moves its head in real time in response to the speech of the other robot or the person. Therefore, we employ volume and pitch data from the speech of the person or the robot to estimate the appropriate timing of head movements.

In this paper, we investigate which method is appropriate to create a learning model that estimates the timing of head movements in response to speech. We collect the volume and the pitch data of a radio program by examining when each subject (human) moves their head. Then, we apply three kinds of classifier models for each subject through three kinds of machine learning, support vector machine (SVM) [7, 8], K-neighbors classifier [9, 10], and Random Forest classifier [11] using scikit-learn [12] which is an open-source library used for machine learning.

In the next section, we introduce a communication robot, “CATARO”. Section 3 shows an experiment to construct a classifier model estimating the head motion timing. We discuss which machine learning method creates suitable classifier models for estimating the timing of head movements in Sect. 4, and we conclude our research in Sect. 5.

2 CATARO

Figure 1 shows “CATARO [1]” (Care and Therapy Assistant RObot), a communication robot. The main body is 391 mm in height, 283 mm in width, and 200 mm in depth. A smartphone is attached to CATARO’s eyes. Its facial expressions are displayed through the screen of the smartphone. CATARO can learn and recognize the faces of patients through the mounted smartphone. Further, the direction of the robot’s face is automatically adjustable (180 degrees in the horizontal and vertical) [1].

In nursing facilities for the elderly people (care receivers), caregivers are very busy caring about the care receivers: toileting, eating, bathing or dressing. Generally, the caregivers cannot have long conversation with the care receivers. On the other hand, some care receivers want to talk with someone about their old days, their family, today’s weather, and so on. The other care receivers cannot talk so for a long time because they often run out of topics and are tired. However, the care receivers feel lonely when nobody talks to them.

Therefore, as shown in Fig. 2, we plan to adopt two CATAROs that can talk with the elderly person according to a pre-scheduled conversation [2]. Even if the elderly person runs out of topics and stops speaking, because the CATAROs keep talking, he/she does not feel lonely. Moreover, our aim is for the elderly person to enjoy the conversation and feel familiarity with the CATAROs. It is one of solutions that the CATARO’s head movements synchronize with the elderly person. The elderly person may feel that he/she and the CATARO share the same values. Then, his/her closeness to the CATARO may increase.

3 Experiment

3.1 Aim

We conducted an experiment to construct a classifier model estimating the head motion timing of a nod, based on the volume and pitch of speech. We employed three methods of machine learning: support vector machine (SVM) [7, 8], K-neighbors classifier [9, 10], and Random Forest classifier [11]. Then, an appropriate model contribution was made by comparing these three methods.

3.2 Method

In the experiment, we employed about 10 min of speech by a male radio personality. Because the radio personality spoke alone, speech recognition accuracy was relatively high. Each of the nine university students (S1–S9) listened to the speech, pushing a button on an application whenever they moved their head in response.

3.3 Volume and Pitch Data

Figure 3 shows a processing flow for creating a classifier model from the speech data. First, audio data was inputted into a speech conversion software corresponding to Audio Stream Input/Output (ASIO) through Quad Capture Interface (Roland). Then, the software calculates the volume and pitch of the audio data by Fourier transformation. About 200 values of each volume and pitch were acquired per second. The data was then continuously outputted to comma separated value (CSV) format files. Then, classifier models were built using three kinds of machine learning models.

3.4 Data Set

Figure 4 shows how to calculate a data set per head movement (response). The volume and pitch data were written in 200 lines per second, respectively. 400 lines (400 volume data and 400 pitch data) were considered one data set. Then, two kinds of array are prepared for one data set. One of them is to store the volume data. The other is to store the pitch data. The yellow part in Fig. 4 shows that these arrays (“volume array” and “pitch array”) are shifted one by one between zero to three seconds before the head movement. The volume and pitch data are stored in each array every shift. Finally, we obtained 200 data sets per one response. These data sets are labeled “1”.

On the other hand, the 400 kinds of volume and pitch data from the beginning of the audio file outside of the 200 values—before and after the timing of the head movement—are considered no-response data. These data sets are labeled “0”.

3.5 Scikit-Learn

Scikit-learn [12] is an open-source library of Python used for machine learning. The scikit-learn has various algorithms, such as clustering, regression, and dimensionality reduction. Figure 5 shows a scikit-learn algorithm cheat-sheet [12] which we used when we selected our methods to make classifier models for the timing of head movements. The scikit-learn has a grid search function that automatically optimizes a parameter of the machine learning model. We used a grid search for the Random Forest classifier.

3.6 For Building Classifier Model Based on SVC

Support vector machine classification (SVC) means support vector machine (SVM) [7, 8] in the scikit-learn. SVC is one of the supervised learning techniques used for regression, classification and outlier detection. SVC detects the boundary line between label 0 and label 1 using training data and predicts the label of sample data. The scikit-learn has two kinds of SVC, kernel and linear, to set parameters. In this experiment, we only used linear.

3.7 For Building Classifier Model Based on K-Neighbors Classifier

K-neighbors classifier [9, 10] determines the label of sample data through a majority decision. Then, the K-neighbors classifier uses k training data near the sample data. In the K-neighbors classifier by scikit-learn, we can set two parameters: weight and n_neighbors. We selected distance at the weight. In this case, closer neighbors of a query point carry more weight than distant neighbors. The n_neighbors can set a number of neighbors to use. If a number of the n_neighbors is too small or too large, the learning model will not be able to correctly predict the label of the sample data. Therefore, we selected “5”, which is a default number.

3.8 For Building Classifier Model Based on Random Forest Classifier

Random Forest classifier [11] is a kind of ensemble classifier in scikit-learn. The learning model predicts the label of the sample data using multiple Random Forest classifiers. Each Random Forest is built from training data with a bootstrap sample. The bootstrap sample is a resampling technique which samples a dataset with replacement. Three parameters, n_estimators, max_features, and max_depth were set using grid search, which is one of the functions of the scikit-learn. The grid search automatically sets parameters for machine learning to optimum values. The parameter, class_weight, was set as “balanced”. The n_estimators was set at 7 values: 5, 10, 20, 30, 50, 100, and 300. The max_features were set at 5 values: 3, 5, 10, 15, and 20. The max_depth was set at 10 values: 3, 5, 10, 15, 20, 25, 30, 40, 50, and 100. Table 1 shows the results of the parameters for each subject by the grid search.

Table 1. The result of parameters for each subject.

Full size table

3.9 Analysis

The classifier models generated by the three kinds of methods are evaluated per each subject based on accuracy, precision, recall, and F-score, and calculated using k-fold cross-validation (k = 10). The volume and pitch data, label 1 and label 0, were divided into training and test data in the ratio of 9:1.

Then, we recreate the classifier models using one of the subject’s data as training data to confirm the timing of the head movement while listening to a partner’s speech.

3.10 Result

Tables 2, 3 and 4 shows the results of the recall and precision ratios, F-score, and accuracy rates of the three kinds of classifier models built by linear SVC, K-neighbors classifier, and Random Forest classifier in each subject. The average accuracy rates were 0.61, 0.81, and 0.95 at the classifier models, respectively. The averages of F-score were 0.39, 0.65, and 0.86 at the classifier models, respectively. These results show that Random Forest classifier is the most suitable method to model for head movement in response.

Table 2. The results of classifier models using Linear SVC.

Full size table

Table 3. The result of classifier models using K-neighbors classifier

Full size table

Table 4. The result of classifier models using Random Forest classifier.

Full size table

Table 5 shows the average precision ration, recall ration, F-score, and accuracy rate in each classifier model, using one of the subject’s data as training data by Random Forest classifier. The average F-scores are from 0.45 to 0.57, the accuracy rates are from 0.64 to 0.82, and the individual classifier models are 0.86 and 0.95, respectively (See Table 4).

Table 5. The result that one of the subject’s data was a training data for creating the classifier models using Random Forest classifier.

Full size table

4 Discussion

Even if the robot cannot reply to the elderly person appropriately, the elderly person is not discouraged when two robots continue to talk in front of them [2]. Moreover, if the robot synchronizes its head in response to the elderly person, they may consider the robot friendly. The results of the experiment demonstrated that the volume and the pitches of the speech are useful data for estimating the timing of head movements if the classifier model was based on an individual. The results also showed that the timing of head movements is different among the subjects.

In this experiment, Random Forest classifier was the most appropriate method of the three for creating the classifier model. The Random Forest algorithm is based on ensemble learning. First, random forests create various decision trees on randomly selected data samples. Finally, appropriate classifiers are decided according to the shared proportion. Namely, the precision of the classifier becomes higher. If an individual classifier model for head movement is to be built, we suggest Random Forest classifier.

5 Conclusion

In this paper, we investigated whether the timing of head movements can be estimated based on the volume and the pitch of a speech, and which of the three learning model methods (SVM, K-neighbors classifier, and Random Forest classifier) are most useful for a classifier model to estimate the timing of head movements. In the experiment, each of the nine university students listened to the speech which was about 10 min of speech by a male radio personality. They pushed a button on an application whenever they moved their head in response. The experimental results showed that the volume and the pitch were useful for estimating the timing of head movements, and that Random Forest classifier is the most effective method of the individual classifier model.

In the future work, we will construct two robots which will move their heads in synchronization with an elderly person, based on the individual classifier model.

References

Hock, P., Oshima, C., Nakayama, K.: CATARO: a robot that tells caregivers a patient’s current non-critical condition indirectly. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2018, pp. 1841–1844. ACM (2018)
Google Scholar
Iio, T., Yoshikawa, Y., Ishiguro, H.: Pre-scheduled turn-taking between robots to make conversation coherent. In: Proceedings of the Fourth International Conference on Human Agent Interaction, pp. 19–25. ACM (2016)
Google Scholar
Chartrand, T.L., Bargh, J.A.: The chameleon effect: the perception–behavior link and social interaction. J. Pers. Soc. Psychol. 76(6), 893 (1999)
Article Google Scholar
Stamenov, M., Gallese, V. (eds.): Mirror Neurons and the Evolution of Brain and Language, vol. 42. John Benjamins Publishing, Amsterdam (2002)
Google Scholar
Busso, C., Deng, Z., Neumann, U., Narayanan, S.: Learning expressive human-like head motion sequences from speech. In: Deng, Z., Neumann, U. (eds.) Data-Driven 3D Facial Animation, pp. 113–131. Springer, London (2008). https://doi.org/10.1007/978-1-84628-907-1_6
Chapter Google Scholar
Munhall, K.G., Jones, J.A., Callan, D.E., Kuratate, T., Vatikiotis-Bateson, E.: Visual prosody and speech intelligibility: head movement improves auditory speech perception. Psychol. Sci. 15(2), 133–137 (2004)
Article Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2(2), 121–167 (1998)
Article Google Scholar
Weill, P.: The relationship between investment in information technology and firm performance: a study of the valve manufacturing sector. Inform. Syst. Res. 3(4), 307–333 (1992)
Article MathSciNet Google Scholar
Suguna, N., Thanushkodi, K.: An improved k-nearest neighbor classification using genetic algorithm. Int. J. Comput. Sci. Issues 7(2), 18–21 (2010)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
scikit-learn. https://scikit-learn.org/stable/index.html

Download references

Acknowledgment

This work was supported by JSPS KAKENHI Grant Number 17K20011.

Author information

Authors and Affiliations

Saga University, Saga, 840-8502, Japan
Haruka Yanagi, Chika Oshima & Koichi Nakayama

Authors

Haruka Yanagi
View author publications
You can also search for this author in PubMed Google Scholar
Chika Oshima
View author publications
You can also search for this author in PubMed Google Scholar
Koichi Nakayama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chika Oshima .

Editor information

Editors and Affiliations

Tokyo University of Science, Tokyo, Japan
Sakae Yamamoto
Tokyo City University, Tokyo, Japan
Hirohiko Mori

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yanagi, H., Oshima, C., Nakayama, K. (2019). Estimating Timing of Head Movements Based on the Volume and Pitch of Speech. In: Yamamoto, S., Mori, H. (eds) Human Interface and the Management of Information. Information in Intelligent Systems. HCII 2019. Lecture Notes in Computer Science(), vol 11570. Springer, Cham. https://doi.org/10.1007/978-3-030-22649-7_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-22649-7_26
Published: 29 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22648-0
Online ISBN: 978-3-030-22649-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Estimating Timing of Head Movements Based on the Volume and Pitch of Speech

Abstract

Similar content being viewed by others

Emotional head motion predicting from prosodic and linguistic features

Analysis of Head Motions and Speech, and Head Motion Control in an Android Robot

Ava (A Social Robot): Design and Performance of a Robotic Hearing Apparatus

Keywords

1 Introduction

2 CATARO