Keywords

1 Introduction

With the advancements in the field of information technology, it is now becoming possible for humans to use CG characters called avatars to communicate in a 3D virtual space over a network. Furthermore, many researches that support remote communication using CG characters such as avatars and agent are performed [1]. However, current systems do not simulate embodied sharing using synchrony of embodied rhythms, such as the nodding and body movements in human face-to-face communication, because the CG characters express nonverbal behavior based on the key commands. In human face-to-face communication, not only verbal messages but also nonverbal behavior such as nodding, body movement, line-of-sight and facial expression are rhythmically related and mutually synchronized between talkers [2]. This synchrony of embodied rhythms in communication is called entrainment, and it enhances the sharing of embodiment and empathy unconsciously in human interaction and accelerates the activated communication in which nonverbal behaviors such as body movements and speech activity increase, and the embodied interaction is activated [3].

In our previous work, we analyzed the entrainment between a speaker’s speech and a listener’s nodding motion in face-to-face communication, and developed iRT (InterRobot Technology), which generates a variety of communicative actions and movements such as nodding and blinking and movements of the head, arms, and waist that are coherently related to voice input [4]. In addition, we developed an interactive CG character called “InterActor” which has functions of both speaker and listener, and demonstrated that InterActor can effectively support human interaction and communication [4]. Moreover, we developed an estimation model of interaction-activated communication based on the heat conduction equation and demonstrated the effectiveness of the model by the evaluation experiment [5].

On the other hand, body movements as well as line-of-sight such as eye contact and gaze duration play an important role in smooth human face-to-face communication [6]. Moreover, it is reported that smooth communication via avatars is realized by expressing the avatar’s gaze. For example, Ishii et al. developed a communication system that controls an avatar’s gaze based on an estimated line-of-sight model and demonstrated that utterance is facilitated between talkers using this model in an avatar-mediated communication [7]. Also, we analyzed human eyeball movement through avatars by using an embodied virtual communication system with a line-of-sight measurement device, and proposed an eyeball movement model, consisting of an eyeball delay movement model and a gaze withdrawal model [8]. In addition, we developed an advanced avatar-mediated communication system by applying our proposed eyeball movement model to InterActors, and demonstrated that the developed system is effective for supporting the embodied interaction and communication. These systems generate the avatar’s eyeball movement by a statistical model based on face-to-face communication characteristics. However, from the viewpoint of promoting the line-of-sight interaction, it is difficult for these systems to enhance the line-of-sight interaction, because the dynamic characteristics of human line-of-sight in the activated communication have not yet been designed. Therefore, in our previous research, we analyzed the interaction between activated communication and human gaze behavior by using a line-of-sight measurement device [8]. On the basis of this analysis, we proposed an eye gaze model, consisting of an eyeball delay movement model and a look away model.

In this paper, we develop an advanced avatar-mediated communication system by applying the proposed eye gaze model to InterActors. This system generates the avatar’s eyeball movements such as gaze and looking away based on the proposed model by using only speech input, and provides a communication environment wherein the embodied interaction is promoted. The effectiveness of the proposed and communication system is demonstrated by means of sensory evaluations in an avatar-mediated communication system.

2 A Speech-Driven Embodied Communication System Based on an Eye Gaze Model

2.1 InterActor

In order to support human interaction and communication, we developed a speech-driven embodied entrainment character called InterActor, which has the functions of both speaker and listener [4]. The configuration of InterActor is shown in Fig. 1. InterActor has a virtual skeleton structure such as head, eyes, mouth, neck, shoulders, elbows, hands (Fig. 1(a)). The texture puts on the 3D surface model including the virtual skeleton structure (Fig. 1(b)). In addition, the various facial expressions are realized by applying the smile model in which the previous research was developed (Fig. 1(c)) [9, 10].

Fig. 1.
figure 1

InterActor: speech-driven embodied entrainment character.

The listener’s interaction model includes a nodding reaction model which estimates the nodding timing from a speech ON-OFF pattern and a body reaction model linked to the nodding reaction model [4]. The timing of nodding is predicted using a hierarchy model consisting of two stages; macro and micro (Fig. 2). The macro stage estimates whether a nodding response exists or not in a duration unit which consists of a talkspurt episode T(i) and the following silence episode S(i) with a hangover value of 4/30 s. The estimator M u (i) is a moving-average (MA) model, expressed as the weighted sum of unit speech activity R(i) in Eqs. (1) and (2). When M u (i) exceeds a threshold value, nodding M(i) is also a MA model, estimated as the weighted sum of the binary speech signal V(i) in Eq. (3).

$$ M_{u} (i) = \sum\limits_{j = 1}^{J} {a(j)R(i - j) + u(i)} $$
(1)
$$ R(i) = \frac{T(i)}{T(i) + S(i)} $$
(2)
  • a(j): linear prediction coefficient

  • T(i): talkspurt duration in the i th duration unit

  • S(i): silence duration in the i th duration unit

  • u(i): noise

  • i: number of frame

$$ M(i) = \sum\limits_{j = 1}^{K} {b(j)V(i - j) + w(i)} $$
(3)
  • b(j): linear prediction coefficient

  • V(i): voice

  • w(i): noise

Fig. 2.
figure 2

Interaction model.

The body movements are related to the speech input in that the neck and one of the wrists, elbows, arms, or waist is operated when the body threshold is exceeded. The threshold is set lower than that of the nodding prediction of the MA model, which is expressed as the weighted sum of the binary speech signal to nodding. In other words, when InterActor functions as a listener for generating body movements, the relationship between nodding and other movements is dependent on the threshold values of the nodding estimation.

2.2 Eye Gaze Model

We proposed an eye gaze model that generates a gaze movement and looking away movement for enhancing embodied communication based on the characteristics of the analysis of human eyeball movement. The proposed model consists of the previous eyeball delay movement model [8] and look away model. The outline of the proposed model is indicated as follows:

(1) Eyeball Delay Movement Model

The eyeball delay movement model consists of a delay of 0.13 s with respect to the avatar’s head movement. First, the angle of the avatar’s gaze direction for the viewpoint in virtual space is calculated using Eq. 4 (Fig. 3(a)). Then, the avatar’s gaze is generated by adding the angle of the avatar’s head movement to the angle of the avatar’s gaze direction in the fourth previous frame at a frame rate of 30 fps (Eq. 5). Figure 3(b) shows an example of the eyeball delay movement model in an avatar. If the avatar’s head moves, the eyeball moves with a delay of 0.13 s with respect to the head movement in the opposite direction.

$$ \theta_{AG} = \tan^{ - 1} \frac{{A_{Ex} - P_{x} }}{{A_{Ey} - P_{y} }} $$
(4)
  • θ AG : Rotation angle of gaze direction

  • A Ex , A Ey : eyeball postion of InterActor

  • P x , P y : position of view point in virtual space

$$ \theta_{G} (i) = \theta_{AH} (i) + \theta_{AG} (i - 4) $$
(5)
  • θ G (i): Rotation angle of eyeball movement

  • θ AH (i): Rotation angle of InterActor’s head movement

  • i: number of frame

Fig. 3.
figure 3

Eyeball delay movement model.

(2) Look Away Model

The previous analysis of the human eyeball indicates that direct gaze is limited to about 80% of total conversation time [8]. Therefore, the look away model in this study generates eyeball movement for other gazes such as gaze withdrawal and blinking based on the previous analysis. The avatar’s eyeball of looking away is moved at the horizontal direction greatly (Fig. 4), and the effectiveness of this movement was confirmed in a preliminary experiment. When a value which is estimated the degree of interaction-activated communication falls below a threshold value, the looking away movement is generated by the proposed model (Fig. 5). The avatar’s gaze would be modulated such that staring is prevented and impressions of the conversation such as unification and vividness are enhanced.

Fig. 4.
figure 4

Looking away movement.

Fig. 5.
figure 5

Look away model.

2.3 Developed System

We developed an advanced communication system in which the proposed model was used with InterActors (Fig. 6). The virtual space was generated by Microsoft DirectX 9.0 SDK (June 2010) and a Windows 7 workstation (CPU: Corei7 2.93 GHz, Memory: 8 GB, Graphics: NVIDIA Geforce GTS250). The voice was sampled using 16 bits at 11 kHz via a headset (Logicool H330). InterActors were represented at a frame rate of 30 fps.

Fig. 6.
figure 6

System setup.

When Talker1 speaks to Talker2, InterActor2 responds to Talker1’s utterance with appropriate timing through body movements, including nodding, blinking, and actions, in a manner similar to the body motions of a listener. A nodding movement is defined as the falling-rising movement in the front-back direction at a speed of 0.15 rad/frame. In addition, InterActor2 generates an eyeball movement based on the proposed model. Here, a looking away movement is defined as the left-right motion of eyeballs at a speed of 0.15 rad/frame based on the preliminary experiment. Also, InterActor1 generates communicative actions and movements and avatar’s eyeball movements as a speaker by using the MA model and eye gaze model. In this manner, two remote talkers can enjoy a conversation via InterActors within a communication environment in which the sense of unity is shared by embodied entrainment.

3 Communication Experiment

In order to evaluate the developed system, a communication experiment was carried out using the developed system.

3.1 Experimental Method

The experiment was performed on talkers engaged in a free conversation. In this experiment, the following three modes were compared: mode (A) with neither eyeball movement nor facial expression, mode (B) with smile model only, and mode (C) with combined smile model and eye gaze model. We recorded the communication experiment scene using two video cameras and screens as shown in Fig. 7. The subjects were 12 pairs of talkers (12 males and 12 females).

Fig. 7.
figure 7

Example of a communication scene using the system.

The experimental procedure is described as follows. First, the subjects used the system for around 3 min. Next, they were instructed to perform a paired comparison of modes in which, based on their preferences, they selected the better mode. Finally, they were urged to talk in a free conversation for 3 min in each mode. The questionnaire used a seven-point bipolar rating scale from −3 (not at all) to 3 (extremely), where a score of 0 denotes “moderately.” The conversational topics were not specified in both experiments. Each pair of talkers was presented with the two modes in a random order.

3.2 Result

The results of the paired comparison are summarized in Table 1. In this table, the number of winner is shown. For example, the number of mode (A)’s winner is six for mode (B), and the number of total winner is nine. Figure 8 shows the calculated results of the evaluation provided in Table 1 based on the Bradley-Terry model given in Eqs. (6) and (7) [11].

Table 1. Result of paired comparison.
Fig. 8.
figure 8

Comparison of the preference \( \pi \) based on the Bradley-Terry model.

$$ p_{ij} = \frac{{\pi_{i} }}{{\pi_{i} + \pi_{j} }} $$
(6)
$$ \sum\limits_{i} {\pi_{i} } = const.( = 100) $$
(7)
  • π i : Intensity of i

  • p ij : probability of judgment that i is better than j

The consistency of mode matching was confirmed by performing a goodness of fit test \( (x^{2}(1,0.05) = 3.84 > x_{0}^{2} = 0.28) \) and a likelihood ratio test \( (x^{2}(1,0.05) = 3.84 > x_{0}^{2} = 0.27) \). The proposed mode (C), with both smile model and eye gaze model, was evaluated as the best; followed by mode (B), smile model only; and mode (A), no movement.

The questionnaire results are shown in Fig. 9. From the results of the Friedman signed-rank test and the Wilcoxon signed rank test, all categories showed a significance level of 1% among modes (A), (B), and (C). In addition, “Enjoyment,” “Interaction-activated communication,” “Vividness,” and “Natural line-of-sight” had a significance level of 5% between modes (B) and (C).

Fig. 9.
figure 9

Seven-points bipolar rating.

In both experiments, mode (C) of the proposed eye gaze model was evaluated as the best for avatar-mediated communication. These results indicate the effectiveness of the proposed eye gaze model. These results demonstrate that the combined model is effective.

4 Conclusion

In this paper, we developed an advanced avatar-mediated communication system in which our proposed eye gaze model is used by speech-driven embodied entrainment characters called InterActors. The proposed model consists of an eyeball delay movement model and a look away model. The communication system generates eyeball movement based on this model by generating the entrained head and body motions of InterActors using only speech input. Sensory evaluations in an avatar-mediated communication system showed the effectiveness of the proposed eye gaze model and communication system.