Abstract
Our research aims to create two friendly communication robots to talk with elderly people in nursing facilities. If the robots synchronize their head movements in response to the elderly person, the elderly person may react favorably to the robot. Then, the elderly person can enjoy talking with these two robots. In this paper, we investigated whether the volume and pitch of the speech are useful data for estimating the timing of head movements. Because the robots need to move their heads in real time, when one of the robots or the person is talking, we focus on the volume and pitch of the speech, not the content. Moreover, it was cleared which machine learning method creates suitable classifier models for estimating the timing of head movements. The experimental results showed that Random Forest classifier was the most suitable method.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
We have developed a communication robot, “CATARO” for elderly people in nursing facilities [1]. The robot can move its head, although it cannot move its hands or legs. We plan to adopt two robots that can talk with the elderly person according to a prescheduled conversation [2], because the conversation ends when the elderly person stops speaking. Our aim is for the elderly person to enjoy the conversation and feel familiarity with the robots. Chartrand and Bargh found that subjects whose movements were mirrored by a confidant liked that partner more (chameleon effect [3]). Therefore, we presume that if the robot’s head movements synchronize with the elderly person, then the elderly person will react favorably towards the robot.
Generally, people move their heads without premeditated thought during a conversation. They may also intentionally move their heads to match their partners’ position [4]. We think that the timing of head movements is quasi-different between people. One person does not move his/her head until the other finishes speech, and another person moves his/her head as often as the other’s speech breaks off. Therefore, an individual model of the head movement is required so that the robot can move its head in synchronization with each person.
Busso et al. aimed to quantify differences in the head motion patterns displayed under expressive utterances. They used “hidden Markov model (HMM)” to estimate the discrete representation of head poses from prosodic features [5]. Munhall et al. studied the impact on speech perception of a talker’s head movement. The head movement was correlated with the fundamental frequency and amplitude of the voice during speech [6].
In our research, the robot moves its head in real time in response to the speech of the other robot or the person. Therefore, we employ volume and pitch data from the speech of the person or the robot to estimate the appropriate timing of head movements.
In this paper, we investigate which method is appropriate to create a learning model that estimates the timing of head movements in response to speech. We collect the volume and the pitch data of a radio program by examining when each subject (human) moves their head. Then, we apply three kinds of classifier models for each subject through three kinds of machine learning, support vector machine (SVM) [7, 8], K-neighbors classifier [9, 10], and Random Forest classifier [11] using scikit-learn [12] which is an open-source library used for machine learning.
In the next section, we introduce a communication robot, “CATARO”. Section 3 shows an experiment to construct a classifier model estimating the head motion timing. We discuss which machine learning method creates suitable classifier models for estimating the timing of head movements in Sect. 4, and we conclude our research in Sect. 5.
2 CATARO
Figure 1 shows “CATARO [1]” (Care and Therapy Assistant RObot), a communication robot. The main body is 391 mm in height, 283 mm in width, and 200 mm in depth. A smartphone is attached to CATARO’s eyes. Its facial expressions are displayed through the screen of the smartphone. CATARO can learn and recognize the faces of patients through the mounted smartphone. Further, the direction of the robot’s face is automatically adjustable (180 degrees in the horizontal and vertical) [1].
In nursing facilities for the elderly people (care receivers), caregivers are very busy caring about the care receivers: toileting, eating, bathing or dressing. Generally, the caregivers cannot have long conversation with the care receivers. On the other hand, some care receivers want to talk with someone about their old days, their family, today’s weather, and so on. The other care receivers cannot talk so for a long time because they often run out of topics and are tired. However, the care receivers feel lonely when nobody talks to them.
Therefore, as shown in Fig. 2, we plan to adopt two CATAROs that can talk with the elderly person according to a pre-scheduled conversation [2]. Even if the elderly person runs out of topics and stops speaking, because the CATAROs keep talking, he/she does not feel lonely. Moreover, our aim is for the elderly person to enjoy the conversation and feel familiarity with the CATAROs. It is one of solutions that the CATARO’s head movements synchronize with the elderly person. The elderly person may feel that he/she and the CATARO share the same values. Then, his/her closeness to the CATARO may increase.
3 Experiment
3.1 Aim
We conducted an experiment to construct a classifier model estimating the head motion timing of a nod, based on the volume and pitch of speech. We employed three methods of machine learning: support vector machine (SVM) [7, 8], K-neighbors classifier [9, 10], and Random Forest classifier [11]. Then, an appropriate model contribution was made by comparing these three methods.
3.2 Method
In the experiment, we employed about 10 min of speech by a male radio personality. Because the radio personality spoke alone, speech recognition accuracy was relatively high. Each of the nine university students (S1–S9) listened to the speech, pushing a button on an application whenever they moved their head in response.
3.3 Volume and Pitch Data
Figure 3 shows a processing flow for creating a classifier model from the speech data. First, audio data was inputted into a speech conversion software corresponding to Audio Stream Input/Output (ASIO) through Quad Capture Interface (Roland). Then, the software calculates the volume and pitch of the audio data by Fourier transformation. About 200 values of each volume and pitch were acquired per second. The data was then continuously outputted to comma separated value (CSV) format files. Then, classifier models were built using three kinds of machine learning models.
3.4 Data Set
Figure 4 shows how to calculate a data set per head movement (response). The volume and pitch data were written in 200 lines per second, respectively. 400 lines (400 volume data and 400 pitch data) were considered one data set. Then, two kinds of array are prepared for one data set. One of them is to store the volume data. The other is to store the pitch data. The yellow part in Fig. 4 shows that these arrays (“volume array” and “pitch array”) are shifted one by one between zero to three seconds before the head movement. The volume and pitch data are stored in each array every shift. Finally, we obtained 200 data sets per one response. These data sets are labeled “1”.
On the other hand, the 400 kinds of volume and pitch data from the beginning of the audio file outside of the 200 values—before and after the timing of the head movement—are considered no-response data. These data sets are labeled “0”.
3.5 Scikit-Learn
Scikit-learn [12] is an open-source library of Python used for machine learning. The scikit-learn has various algorithms, such as clustering, regression, and dimensionality reduction. Figure 5 shows a scikit-learn algorithm cheat-sheet [12] which we used when we selected our methods to make classifier models for the timing of head movements. The scikit-learn has a grid search function that automatically optimizes a parameter of the machine learning model. We used a grid search for the Random Forest classifier.
3.6 For Building Classifier Model Based on SVC
Support vector machine classification (SVC) means support vector machine (SVM) [7, 8] in the scikit-learn. SVC is one of the supervised learning techniques used for regression, classification and outlier detection. SVC detects the boundary line between label 0 and label 1 using training data and predicts the label of sample data. The scikit-learn has two kinds of SVC, kernel and linear, to set parameters. In this experiment, we only used linear.
3.7 For Building Classifier Model Based on K-Neighbors Classifier
K-neighbors classifier [9, 10] determines the label of sample data through a majority decision. Then, the K-neighbors classifier uses k training data near the sample data. In the K-neighbors classifier by scikit-learn, we can set two parameters: weight and n_neighbors. We selected distance at the weight. In this case, closer neighbors of a query point carry more weight than distant neighbors. The n_neighbors can set a number of neighbors to use. If a number of the n_neighbors is too small or too large, the learning model will not be able to correctly predict the label of the sample data. Therefore, we selected “5”, which is a default number.
3.8 For Building Classifier Model Based on Random Forest Classifier
Random Forest classifier [11] is a kind of ensemble classifier in scikit-learn. The learning model predicts the label of the sample data using multiple Random Forest classifiers. Each Random Forest is built from training data with a bootstrap sample. The bootstrap sample is a resampling technique which samples a dataset with replacement. Three parameters, n_estimators, max_features, and max_depth were set using grid search, which is one of the functions of the scikit-learn. The grid search automatically sets parameters for machine learning to optimum values. The parameter, class_weight, was set as “balanced”. The n_estimators was set at 7 values: 5, 10, 20, 30, 50, 100, and 300. The max_features were set at 5 values: 3, 5, 10, 15, and 20. The max_depth was set at 10 values: 3, 5, 10, 15, 20, 25, 30, 40, 50, and 100. Table 1 shows the results of the parameters for each subject by the grid search.
3.9 Analysis
The classifier models generated by the three kinds of methods are evaluated per each subject based on accuracy, precision, recall, and F-score, and calculated using k-fold cross-validation (k = 10). The volume and pitch data, label 1 and label 0, were divided into training and test data in the ratio of 9:1.
Then, we recreate the classifier models using one of the subject’s data as training data to confirm the timing of the head movement while listening to a partner’s speech.
3.10 Result
Tables 2, 3 and 4 shows the results of the recall and precision ratios, F-score, and accuracy rates of the three kinds of classifier models built by linear SVC, K-neighbors classifier, and Random Forest classifier in each subject. The average accuracy rates were 0.61, 0.81, and 0.95 at the classifier models, respectively. The averages of F-score were 0.39, 0.65, and 0.86 at the classifier models, respectively. These results show that Random Forest classifier is the most suitable method to model for head movement in response.
Table 5 shows the average precision ration, recall ration, F-score, and accuracy rate in each classifier model, using one of the subject’s data as training data by Random Forest classifier. The average F-scores are from 0.45 to 0.57, the accuracy rates are from 0.64 to 0.82, and the individual classifier models are 0.86 and 0.95, respectively (See Table 4).
4 Discussion
Even if the robot cannot reply to the elderly person appropriately, the elderly person is not discouraged when two robots continue to talk in front of them [2]. Moreover, if the robot synchronizes its head in response to the elderly person, they may consider the robot friendly. The results of the experiment demonstrated that the volume and the pitches of the speech are useful data for estimating the timing of head movements if the classifier model was based on an individual. The results also showed that the timing of head movements is different among the subjects.
In this experiment, Random Forest classifier was the most appropriate method of the three for creating the classifier model. The Random Forest algorithm is based on ensemble learning. First, random forests create various decision trees on randomly selected data samples. Finally, appropriate classifiers are decided according to the shared proportion. Namely, the precision of the classifier becomes higher. If an individual classifier model for head movement is to be built, we suggest Random Forest classifier.
5 Conclusion
In this paper, we investigated whether the timing of head movements can be estimated based on the volume and the pitch of a speech, and which of the three learning model methods (SVM, K-neighbors classifier, and Random Forest classifier) are most useful for a classifier model to estimate the timing of head movements. In the experiment, each of the nine university students listened to the speech which was about 10 min of speech by a male radio personality. They pushed a button on an application whenever they moved their head in response. The experimental results showed that the volume and the pitch were useful for estimating the timing of head movements, and that Random Forest classifier is the most effective method of the individual classifier model.
In the future work, we will construct two robots which will move their heads in synchronization with an elderly person, based on the individual classifier model.
References
Hock, P., Oshima, C., Nakayama, K.: CATARO: a robot that tells caregivers a patient’s current non-critical condition indirectly. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2018, pp. 1841–1844. ACM (2018)
Iio, T., Yoshikawa, Y., Ishiguro, H.: Pre-scheduled turn-taking between robots to make conversation coherent. In: Proceedings of the Fourth International Conference on Human Agent Interaction, pp. 19–25. ACM (2016)
Chartrand, T.L., Bargh, J.A.: The chameleon effect: the perception–behavior link and social interaction. J. Pers. Soc. Psychol. 76(6), 893 (1999)
Stamenov, M., Gallese, V. (eds.): Mirror Neurons and the Evolution of Brain and Language, vol. 42. John Benjamins Publishing, Amsterdam (2002)
Busso, C., Deng, Z., Neumann, U., Narayanan, S.: Learning expressive human-like head motion sequences from speech. In: Deng, Z., Neumann, U. (eds.) Data-Driven 3D Facial Animation, pp. 113–131. Springer, London (2008). https://doi.org/10.1007/978-1-84628-907-1_6
Munhall, K.G., Jones, J.A., Callan, D.E., Kuratate, T., Vatikiotis-Bateson, E.: Visual prosody and speech intelligibility: head movement improves auditory speech perception. Psychol. Sci. 15(2), 133–137 (2004)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2(2), 121–167 (1998)
Weill, P.: The relationship between investment in information technology and firm performance: a study of the valve manufacturing sector. Inform. Syst. Res. 3(4), 307–333 (1992)
Suguna, N., Thanushkodi, K.: An improved k-nearest neighbor classification using genetic algorithm. Int. J. Comput. Sci. Issues 7(2), 18–21 (2010)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
scikit-learn. https://scikit-learn.org/stable/index.html
Acknowledgment
This work was supported by JSPS KAKENHI Grant Number 17K20011.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Yanagi, H., Oshima, C., Nakayama, K. (2019). Estimating Timing of Head Movements Based on the Volume and Pitch of Speech. In: Yamamoto, S., Mori, H. (eds) Human Interface and the Management of Information. Information in Intelligent Systems. HCII 2019. Lecture Notes in Computer Science(), vol 11570. Springer, Cham. https://doi.org/10.1007/978-3-030-22649-7_26
Download citation
DOI: https://doi.org/10.1007/978-3-030-22649-7_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22648-0
Online ISBN: 978-3-030-22649-7
eBook Packages: Computer ScienceComputer Science (R0)