Keywords

1 Introduction

We have developed a communication robot, “CATARO” for elderly people in nursing facilities [1]. The robot can move its head, although it cannot move its hands or legs. We plan to adopt two robots that can talk with the elderly person according to a prescheduled conversation [2], because the conversation ends when the elderly person stops speaking. Our aim is for the elderly person to enjoy the conversation and feel familiarity with the robots. Chartrand and Bargh found that subjects whose movements were mirrored by a confidant liked that partner more (chameleon effect [3]). Therefore, we presume that if the robot’s head movements synchronize with the elderly person, then the elderly person will react favorably towards the robot.

Generally, people move their heads without premeditated thought during a conversation. They may also intentionally move their heads to match their partners’ position [4]. We think that the timing of head movements is quasi-different between people. One person does not move his/her head until the other finishes speech, and another person moves his/her head as often as the other’s speech breaks off. Therefore, an individual model of the head movement is required so that the robot can move its head in synchronization with each person.

Busso et al. aimed to quantify differences in the head motion patterns displayed under expressive utterances. They used “hidden Markov model (HMM)” to estimate the discrete representation of head poses from prosodic features [5]. Munhall et al. studied the impact on speech perception of a talker’s head movement. The head movement was correlated with the fundamental frequency and amplitude of the voice during speech [6].

In our research, the robot moves its head in real time in response to the speech of the other robot or the person. Therefore, we employ volume and pitch data from the speech of the person or the robot to estimate the appropriate timing of head movements.

In this paper, we investigate which method is appropriate to create a learning model that estimates the timing of head movements in response to speech. We collect the volume and the pitch data of a radio program by examining when each subject (human) moves their head. Then, we apply three kinds of classifier models for each subject through three kinds of machine learning, support vector machine (SVM) [7, 8], K-neighbors classifier [9, 10], and Random Forest classifier [11] using scikit-learn [12] which is an open-source library used for machine learning.

In the next section, we introduce a communication robot, “CATARO”. Section 3 shows an experiment to construct a classifier model estimating the head motion timing. We discuss which machine learning method creates suitable classifier models for estimating the timing of head movements in Sect. 4, and we conclude our research in Sect. 5.

2 CATARO

Figure 1 shows “CATARO [1]” (Care and Therapy Assistant RObot), a communication robot. The main body is 391 mm in height, 283 mm in width, and 200 mm in depth. A smartphone is attached to CATARO’s eyes. Its facial expressions are displayed through the screen of the smartphone. CATARO can learn and recognize the faces of patients through the mounted smartphone. Further, the direction of the robot’s face is automatically adjustable (180 degrees in the horizontal and vertical) [1].

Fig. 1.
figure 1

Framework (left side) and CATARO covered with a cloth (right side).

In nursing facilities for the elderly people (care receivers), caregivers are very busy caring about the care receivers: toileting, eating, bathing or dressing. Generally, the caregivers cannot have long conversation with the care receivers. On the other hand, some care receivers want to talk with someone about their old days, their family, today’s weather, and so on. The other care receivers cannot talk so for a long time because they often run out of topics and are tired. However, the care receivers feel lonely when nobody talks to them.

Therefore, as shown in Fig. 2, we plan to adopt two CATAROs that can talk with the elderly person according to a pre-scheduled conversation [2]. Even if the elderly person runs out of topics and stops speaking, because the CATAROs keep talking, he/she does not feel lonely. Moreover, our aim is for the elderly person to enjoy the conversation and feel familiarity with the CATAROs. It is one of solutions that the CATARO’s head movements synchronize with the elderly person. The elderly person may feel that he/she and the CATARO share the same values. Then, his/her closeness to the CATARO may increase.

Fig. 2.
figure 2

Two CATAROs talk with an elderly person.

3 Experiment

3.1 Aim

We conducted an experiment to construct a classifier model estimating the head motion timing of a nod, based on the volume and pitch of speech. We employed three methods of machine learning: support vector machine (SVM) [7, 8], K-neighbors classifier [9, 10], and Random Forest classifier [11]. Then, an appropriate model contribution was made by comparing these three methods.

3.2 Method

In the experiment, we employed about 10 min of speech by a male radio personality. Because the radio personality spoke alone, speech recognition accuracy was relatively high. Each of the nine university students (S1–S9) listened to the speech, pushing a button on an application whenever they moved their head in response.

3.3 Volume and Pitch Data

Figure 3 shows a processing flow for creating a classifier model from the speech data. First, audio data was inputted into a speech conversion software corresponding to Audio Stream Input/Output (ASIO) through Quad Capture Interface (Roland). Then, the software calculates the volume and pitch of the audio data by Fourier transformation. About 200 values of each volume and pitch were acquired per second. The data was then continuously outputted to comma separated value (CSV) format files. Then, classifier models were built using three kinds of machine learning models.

Fig. 3.
figure 3

Processing flow for making a classifier model from the speech data.

3.4 Data Set

Figure 4 shows how to calculate a data set per head movement (response). The volume and pitch data were written in 200 lines per second, respectively. 400 lines (400 volume data and 400 pitch data) were considered one data set. Then, two kinds of array are prepared for one data set. One of them is to store the volume data. The other is to store the pitch data. The yellow part in Fig. 4 shows that these arrays (“volume array” and “pitch array”) are shifted one by one between zero to three seconds before the head movement. The volume and pitch data are stored in each array every shift. Finally, we obtained 200 data sets per one response. These data sets are labeled “1”.

Fig. 4.
figure 4

The way of collecting data for learning model.

On the other hand, the 400 kinds of volume and pitch data from the beginning of the audio file outside of the 200 values—before and after the timing of the head movement—are considered no-response data. These data sets are labeled “0”.

3.5 Scikit-Learn

Scikit-learn [12] is an open-source library of Python used for machine learning. The scikit-learn has various algorithms, such as clustering, regression, and dimensionality reduction. Figure 5 shows a scikit-learn algorithm cheat-sheet [12] which we used when we selected our methods to make classifier models for the timing of head movements. The scikit-learn has a grid search function that automatically optimizes a parameter of the machine learning model. We used a grid search for the Random Forest classifier.

Fig. 5.
figure 5

Scikit-learn algorithm cheat-sheet [12]

3.6 For Building Classifier Model Based on SVC

Support vector machine classification (SVC) means support vector machine (SVM) [7, 8] in the scikit-learn. SVC is one of the supervised learning techniques used for regression, classification and outlier detection. SVC detects the boundary line between label 0 and label 1 using training data and predicts the label of sample data. The scikit-learn has two kinds of SVC, kernel and linear, to set parameters. In this experiment, we only used linear.

3.7 For Building Classifier Model Based on K-Neighbors Classifier

K-neighbors classifier [9, 10] determines the label of sample data through a majority decision. Then, the K-neighbors classifier uses k training data near the sample data. In the K-neighbors classifier by scikit-learn, we can set two parameters: weight and n_neighbors. We selected distance at the weight. In this case, closer neighbors of a query point carry more weight than distant neighbors. The n_neighbors can set a number of neighbors to use. If a number of the n_neighbors is too small or too large, the learning model will not be able to correctly predict the label of the sample data. Therefore, we selected “5”, which is a default number.

3.8 For Building Classifier Model Based on Random Forest Classifier

Random Forest classifier [11] is a kind of ensemble classifier in scikit-learn. The learning model predicts the label of the sample data using multiple Random Forest classifiers. Each Random Forest is built from training data with a bootstrap sample. The bootstrap sample is a resampling technique which samples a dataset with replacement. Three parameters, n_estimators, max_features, and max_depth were set using grid search, which is one of the functions of the scikit-learn. The grid search automatically sets parameters for machine learning to optimum values. The parameter, class_weight, was set as “balanced”. The n_estimators was set at 7 values: 5, 10, 20, 30, 50, 100, and 300. The max_features were set at 5 values: 3, 5, 10, 15, and 20. The max_depth was set at 10 values: 3, 5, 10, 15, 20, 25, 30, 40, 50, and 100. Table 1 shows the results of the parameters for each subject by the grid search.

Table 1. The result of parameters for each subject.

3.9 Analysis

The classifier models generated by the three kinds of methods are evaluated per each subject based on accuracy, precision, recall, and F-score, and calculated using k-fold cross-validation (k = 10). The volume and pitch data, label 1 and label 0, were divided into training and test data in the ratio of 9:1.

Then, we recreate the classifier models using one of the subject’s data as training data to confirm the timing of the head movement while listening to a partner’s speech.

3.10 Result

Tables 2, 3 and 4 shows the results of the recall and precision ratios, F-score, and accuracy rates of the three kinds of classifier models built by linear SVC, K-neighbors classifier, and Random Forest classifier in each subject. The average accuracy rates were 0.61, 0.81, and 0.95 at the classifier models, respectively. The averages of F-score were 0.39, 0.65, and 0.86 at the classifier models, respectively. These results show that Random Forest classifier is the most suitable method to model for head movement in response.

Table 2. The results of classifier models using Linear SVC.
Table 3. The result of classifier models using K-neighbors classifier
Table 4. The result of classifier models using Random Forest classifier.

Table 5 shows the average precision ration, recall ration, F-score, and accuracy rate in each classifier model, using one of the subject’s data as training data by Random Forest classifier. The average F-scores are from 0.45 to 0.57, the accuracy rates are from 0.64 to 0.82, and the individual classifier models are 0.86 and 0.95, respectively (See Table 4).

Table 5. The result that one of the subject’s data was a training data for creating the classifier models using Random Forest classifier.

4 Discussion

Even if the robot cannot reply to the elderly person appropriately, the elderly person is not discouraged when two robots continue to talk in front of them [2]. Moreover, if the robot synchronizes its head in response to the elderly person, they may consider the robot friendly. The results of the experiment demonstrated that the volume and the pitches of the speech are useful data for estimating the timing of head movements if the classifier model was based on an individual. The results also showed that the timing of head movements is different among the subjects.

In this experiment, Random Forest classifier was the most appropriate method of the three for creating the classifier model. The Random Forest algorithm is based on ensemble learning. First, random forests create various decision trees on randomly selected data samples. Finally, appropriate classifiers are decided according to the shared proportion. Namely, the precision of the classifier becomes higher. If an individual classifier model for head movement is to be built, we suggest Random Forest classifier.

5 Conclusion

In this paper, we investigated whether the timing of head movements can be estimated based on the volume and the pitch of a speech, and which of the three learning model methods (SVM, K-neighbors classifier, and Random Forest classifier) are most useful for a classifier model to estimate the timing of head movements. In the experiment, each of the nine university students listened to the speech which was about 10 min of speech by a male radio personality. They pushed a button on an application whenever they moved their head in response. The experimental results showed that the volume and the pitch were useful for estimating the timing of head movements, and that Random Forest classifier is the most effective method of the individual classifier model.

In the future work, we will construct two robots which will move their heads in synchronization with an elderly person, based on the individual classifier model.