Keywords

1 Introduction

Virtual humans are increasingly being researched and developed for applications in which they must engage and interact with real humans. Some of these applications include virtual tour guides for museums [40, 41], adjuncts to therapy [5, 13], support for training [33], and instruments of social science research [38]. While there are many ways to display virtual humans, one important technique for presenting these virtual humans portrays them as life-sized images on large monitors or large projection screens to be viewed by one or more individuals. However, such screens have limitations that can hamper important aspects of social interaction.

A large 2D display screen, whether it is large monitor or a large screen for a projector, is typically limited to displaying imagery that is rendered from a single viewpoint at a time. This means that if there are observers who are not standing at the viewpoint location for which the imagery was rendered, then the perspective of that imagery will be incorrect for those observers. The perspective mismatch will occur if there are multiple observers, or if a single observer moves around without providing the rendering system with location updates.

Fig. 1.
figure 1

An over-the-shoulder view of a user interacting with a virtual human, presented with REFLCT, a near axis, head mounted, retro-reflective projection display. The character and wooden wall pattern are projected. The corrugated materials are real props mounted around the retro-reflective screen.

This perspective mismatch can lead to ambiguous social cues when observers try to determine where a virtual human is pointing or looking. The “Mona Lisa” effect is one example of problems that can arise. This effect, named after the painting by Leonardo da Vinci, is a perceptual illusion in which the 2D image of a character appears to be looking or pointing at an observer, regardless of where the observer is standing in relation to the character. There appears to be a human perceptual process that realigns the perceived gaze direction of characters who are depicted as gazing straight out of the image surface. This yields the unsettling effect of characters in paintings who appear to follow observers with their eyes. This perceptual effect and related invariant perception issues have been reported and studied by a number of researchers from the 17th century to current times [12, 16, 29, 34, 36]. Some work suggests that the Mona Lisa effect is a consequence of the perceptual system estimating and compensating for the average local slant of the display surface behind the imagery [8, 44]. The effect also appears to have a neurological basis [11]. Our research interest is the development of display technologies that help to ameliorate the limitations of displays that yield such mismatched perspective cues, particularly when presenting virtual characters for training.

Some display systems employ multiplexing over time, frequency, or polarization to display more than one viewpoint. For example, a stereoscopic display can alternatively display left and right eye viewpoints for a single individual, using shutter glasses. Agrawala et al. extended this type of time multiplexed stereoscopy to create a head tracked virtual reality display for multiple users, with a trade off in reduced brightness and increasing flickering as users are added, and requiring users to wear shutter glasses and tracking sensors [2]. Leveraging techniques analogous to spatial multiplexing, light field displays, using a single projector and a spinning mirror, or a large array of projectors, can also provide perspective correct imagery to multiple viewers. However, such systems have complex hardware requirements. The spinning mirror approach requires a very high frame rate projector and a rapidly rotating mirror, for example 4,800 frames per second and 1,200 revolutions per minute [20, 21]. The projector array approach can require multiple computers driving an array of 216 closely spaced projectors and a anisotropic light shaping diffuser [22].

Another approach to render projected imagery with correct perspective for multiple users led to the development of REFLCT (Retroreflective Environments For Learner-Centered Training), a near axis, head mounted projective display, which utilizes retroreflective screens [10, 27]. With REFLCT, each user wears a tracked head mounted projector. Each user can only see the image generated by their own projector, due to the use of retroreflective screens. The retroreflective screens have the property of reflecting the light from each projector straight back to where it came from, and is thus visible only to the wearer of the projector. The system can thus render imagery of a virtual human with perspective correct and consistent gaze for each user. If the virtual character is looking at user A, each and every user will see the character looking at user A.

A unique aspect of the REFLCT system is how it leverages the imperfect performance characteristics of retroreflective materials. Light is not reflected purely on-axis, thus offering substantial energy at slightly off-axis angles. By mounting a pico-projector near a user’s eyes, this reflected energy can be seen by the user. Previous head mounted projection systems used earlier projection technologies that provided less light and larger form factors. These systems required the projection to be aligned with the same optical axis as the user’s eyes, typically facilitated by employing an optical combiner in front of the user’s eyes. Recent pico-projectors offer enough brightness such that this is not an issue.

Retroreflective screen material is placed wherever virtual elements are to be displayed (see Fig. 1). A number of props, such as simulated cinderblock walls, sandbags, and camouflage netting, can be used to create a military themed stage and blend the screens into the environment. Other props could be used for alternative training settings. Retroreflective coatings can also be added to props, in the form of retroreflective cloth, retroreflective tape, or even a coating of fine retroreflective glass beads, allowing an image to be applied to arbitrary surfaces or even a sculpted human form.

In this paper, we discuss our evaluation of REFLCT and its effect on participants performing a multi-party social task.

2 Related Work

Mutual gaze, i.e. looking someone in the eye, is an important social signal for demonstrating interest and attention. Through the use of gaze, conversing parties are able to perform a number of functions to both express and control information flow, such as regulating the flow of conversation, conveying emotions, describing relationships, and constraining the amount of visual information received in order to avoid distraction [3, 4, 26]. The amount of gaze presented by an individual can influence social perceptions of that individual. Individuals who provided more gaze to an interviewer can receive higher socio-emotional evaluations [15]. Additionally, greater levels of eye contact can be associated with greater perceived dynamism, likability, and believability [9].

Much of the research to examine and enhance the communication of gaze in the human-computer interaction field, has sprung from efforts to improve video conferencing systems or to improve interactions with virtual humans. Some of these systems have included mechanical proxies for conversants and may also employ multiple cameras [31, 37]. Other systems have used multi-camera configurations with large displays showing multiple conversants [32] or “video tunnels” [1], which utilize half-silvered mirrors to align cameras and displays and thus the sight lines of video conferencing participants. GAZE-2, developed by Vertegaal et al., is a hybrid approach, using half-silvered mirrors, a small array of cameras, and eye tracking to select and present the video imagery best representing eye contact to other participants in a multi-party video conference [43].

Virtual humans who demonstrate effective gaze can have positive impacts on social interactions. However, gaze may only provide a portion of the social signals needed to facilitate social interactions. Work by Wang et al. demonstrated that a virtual human conveying maximal attention, using continous gaze or staring, was less effective in establishing rapport with human counterparts than virtual humans who combine continuous gaze with postural mimicry and head nods as additional positive social feedback signals [45].

Displays to improve rendering, particularly for presenting individualized and perspective correct imagery for multiple users have followed three main approaches: projector arrays, head-mounted displays (HMDs), and head-mounted projective displays (HMPDs). Projector arrays coupled with asymmetrically diffusing screens [7, 22, 30] can create individualized perspective-correct views, but are expensive in terms of hardware and calibration effort. They require more projectors than users, with projectors positioned everywhere that a user might be. In most cases, they are configured to offer only horizontal image isolation. Eye-tracked autostereoscopic systems may be used to reduce the number of projectors required, but commercial systems are limited in size and viewing angle [39].

Head mounted displays (HMDs) can provide perspective correct mixed reality imagery for multiple users, using either a video or optical overlay upon a real view of the world. Video overlays mix synthetic imagery with a live camera view of the world using standard opaque head-mounted displays. Video overlays exhibit some artifacts, such as video frame lag (typically 1/30th s) and the downsampling of the world to video resolution. Optical overlays use translucent displays, allowing the real world to be seen through the display. Unfortunately, this often causes the virtual imagery to be translucent. Optical overlays can also make tracker lag and noise more apparent as the virtual imagery is compared to the real world. With either type of overlay, HMDs adds bulky optical elements in front of the user’s eyes. These elements make it difficult for trainees to see each others’ eyes and facial expressions. They can also interfere with sighting down a weapon.

Head mounted projective displays (HMPDs) can also be used for individualized virtual and augmented reality imagery. The previous generation of HMPDs differs from REFLCT in several ways. Chief among these is the use of projectors that shine onto an optical combiner, a semi-transparent mirror surface, in front of a user’s eyes to create a projection path aligned with a user’s optical path [14, 17,18,19, 35]. The partially reflective surface in front of the eyes can interfere with eye contact and head movements such as sighting down a rifle. The approach of REFLCT can be compared to Karitsuka and Sato [25] in that there is no optical combiner, but instead, REFLCT employs more compact components and a more optimal optical configuration, maintaining a small and fixed distance between the projector and the user’s eyes.

3 Method

A user study was employed to examine the ability of the REFLCT system to accurately portray the gaze of a virtual human character in a social situation to multiple viewers at a time. This study engaged participants in a “Twenty Questions” game led by a virtual character. Three individuals asked yes/no questions of the virtual character in order to determine secret objects previously selected by the virtual character. The virtual character was presented using the REFLCT system, which could be used in a normal Head Mounted Projection (HMP) mode, delivering perspective correct imagery to each participant. It could also deliver imagery to each user that was rendered from a single viewpoint, in a Simulated Traditional Projection (STP) mode. Various measures of rapport and social response to the character were recorded using standardized surveys.

Fig. 2.
figure 2

A REFLCT head mounted projector unit incorporates a pico-projector, a mirror, and LED tracking markers onto a helmet. An optional USB video camera shown here can record or convey what the user sees.

3.1 Apparatus

Study participants wore helmets fitted with a REFLCT projection unit (see Fig. 2). Each unit was fashioned out of a high density fiberboard framework, which can support a number of active LED markers for motion capture, a DLP based pico-projector, and an optional USB video camera for monitoring and recording the user’s view. The USB cameras were not used in this study. The pico-projector is vertically mounted and projects down upon a small mirror oriented at 45\({}^\circ \), which reflects the light forward. This places the optical axis of the projection closer to the user’s eyes. A PhaseSpace Impulse active LED motion capture system determines the position and orientation of each REFLCT projection unit and distributes this information, via VRPN, to a corresponding PC. Each PC then renders the proper perspective view using the Unity game engine. Each REFLCT projector is connected by a DVI cable to the corresponding PC.

The software development environment was centered around the Unity game engine with scripting written in C#. The virtual human character was animated using the Smartbody character animation platform [23, 42]. The character’s voice consisted of spoken dialogue that was pre-recorded and then processed to determine visemes (mouth movements) and timing for character animation. Most character behaviors were automated, such as head and mouth movements. Specific gaze behaviors could be set according to the needs of the interaction or experimental condition, described later in this paper. Appropriate vocal responses and special gestures, such as affirmative nods, could be triggered by an experimenter operating a Wizard of Oz control panel. This experimenter and the control panel was hidden from the view of the participants during the experiment.

3.2 Design

The experiment used a 2\(\,\times \,\)3 factorial between-subjects design with two factors: (i) two different types of projection display: Simulated Traditional Projection display (STP) and Head Mounted Projection display (HMP); and (ii) three different types of eye gaze behavior: extensive gaze, real-life gaze, and random gaze. The HMP condition reflects the true design intent of the REFLCT system. The simulation aspect of the Simulated Traditional Display condition was employed to improve the comparison of display styles with more consistent resolution and brightness than could be achieved by introducing a separate traditional projection display.

Participants. A total of 107 users were recruited via Craigslist and emails to our institute staff mailing list. The users (52% men, 48% women; mean age of 37 years) were randomly assigned to one of the 6 conditions to interact with the virtual character (position was a between subjects condition): HMP with extensive gaze (N = 21), HMP with real-life gaze (N = 17), HMP with random gaze (N = 18), STP with extensive gaze (N = 16), STP with real-life gaze (N = 16), and STP with random gaze (N = 19).

Stimuli. Study participants viewed the virtual character performing one of three gaze conditions. The general qualities of the gaze conditions are listed in this paragraph, and a description of how they were implemented under the two projection display conditions is given in the following paragraphs. The extensive gaze condition presented continuous mutual gaze between the virtual human and each participant throughout an interaction. The real-life gaze condition portrayed the virtual human with a gaze direction that shifted toward the participant who was currently playing his/her turn in the game. The random gaze condition presented the virtual human with a gaze that shifted between random points within the virtual human’s visual field, at random intervals between 2 and 10 s.

In the HMP with Extensive gaze condition, each user was presented with full and direct gaze from the virtual human, which would be inconsistent and physically impossible in real life. This is possible in the HMP condition since REFLCT provides each user with a personalized view of the character. In the STP with Extensive gaze condition, users experienced the full gaze of the character as if it was presented on a traditional projection display, which means that the character was rendered from the center position, and thus was constantly staring straight out of the screen. Participants to the left and right positions would experience incorrect perspectives and also the “Mona Lisa” effect.

In the HMP with real-life gaze condition, users experienced consistent and correct perspective rendering of the virtual human’s gaze as it shifted between users located at the left, center, and right locations. In the STP with real-life condition, users experienced the same behavior, but with perspective distortion of the character’s gaze since the viewpoint was rendered from a single central location. A user at the left or right location would be shown the character rendered as if it was gazing at too great an angle, i.e. turning too sharply and gazing past the user’s location.

In the HMP with random gaze condition, each user was presented with consistent and perspective correct rendering of the character shifting gaze between random points at a random interval as previously described. In the STP with random gaze condition, each user was presented with this same gaze behavior, but as rendered from a single central location, introducing perspective distortions.

To evaluate social impact on the participants’ experiences, we used existing measurements including Virtual Rapport [24] to measure the users’ feelings of being connected and together, as well as PANAS (Positive And Negative Affect Schedule) [46], Person Perception [28], questions concerning the “Twenty Questions” game [6], questions related to virtual human’s eye gaze [6], and additional questions concerning the amount of eye contact from the virtual human (i.e. “What percentage of time do you think the virtual character was making eye contact with you?”).

3.3 Procedure

Participation required a total of less than 60 min on an individual basis. Upon arrival, study participants were first provided with the informed consent form, which described the study in detail. Participants were asked to read through the document, and were given the opportunity to ask questions about the study.

Participants were then given an online questionnaire to record demographic information, ratings of experiences with video games, prior experience with virtual reality, prior interactions with virtual characters, as well as personality related information.

Participants were then led to the space where the experimental apparatus was located. They were each assigned a position in which to stand. The positions were located approximately 3 meters from the projection screen, and 20 degrees to the left and right of a perpendicular normal vector from the screen’s surface. Participants were also joined by confederates, posing as additional participants, to form a group of three game players (1 or 2 participants randomly assigned to the left and/or right positions, and confederates standing in the center and filling any vacant participant positions). As the perspective correct HMP display and the perspective distorted STP display conditions do not differ much when viewed from the center position, only confederates were assigned to that position. The participants and confederates were then assisted in donning of head mounted projection displays and given instructions for the upcoming game play. The participants were then asked to perform one 10 min experimental trial randomly selected out of 6 possible conditions (2 display types \(\times \) 3 gaze/gesture directions). In each trial, the participants/confederates played guessing games with the virtual character. The games consisted of the virtual character secretly “selecting” a object, such as a frog, tree, or ocean and responding to yes/no questions given by each participant/confederate in turn. The virtual character would respond with a variety of affirmative or negative responses or advise the player’s to ask a yes/no question as appropriate. Each game would continue until 20 questions were asked or until the player’s correctly guessed the secret object. Additional games would be run until the 10 min trial was completed. The confederates were trained to perform consistent actions throughout all of the interaction sessions and to allow the participants the opportunity to play a significant role in questioning and guessing in the games. Participants completed questionnaires after the experimental trial.

4 Results

We conducted a 2-way ANOVA to investigate the effect of a projection type and a gaze pattern on users’ responses to the experience. We also performed a 2-way ANOVA to evaluate the effect of the condition and a gender on users’ responses.

Fig. 3.
figure 3

Difference between user perception of enough and appropriate gaze from the virtual human across 3 gaze conditions (*p < .05).

For users’ perceptions of enough and appropriate gaze from the virtual human, the gaze pattern affected the users’ perceptions significantly [F(2, 101) = 4.03, p = .021] (see Fig. 3). Users felt they received enough and appropriate levels of gaze when they interacted with a virtual human that displayed the Real-life gaze (M = 5.38), but less so when interacting with a virtual human that presented Extensive gaze (M = 5.06) or Random gaze (M = 4.50). This implies that some users might have felt the extensive gaze to be too much and thus socially inappropriate. A Tukey HSD test shows that there was a statistically significant difference between the Real-life gaze and the Random gaze conditions (p = .018). There was no interaction effect with projection type for appropriate gaze level.

Fig. 4.
figure 4

Differences between user reports concerning the percentage of eye contact from the virtual human across 3 gaze conditions (*p < .05).

Regarding the users’ report of the percentage of eye contact from the virtual human, the results show that there was a statistically significant difference among three gaze patterns [F(2, 101) = 3.97, p = .022] (see Fig. 4). Users reported that they had the greatest amount of eye contact from a virtual human that displayed the Real-life gaze (M = 57.21%), compared to interacting with a virtual human that presented Extensive gaze (M = 54.76%) or Random gaze (M = 39.02%). A Tukey HSD test showed that there was a statistically significant difference between the Real-life gaze and the Random gaze (p = .029). There was no interaction effect between projection type and gaze pattern.

A deeper look at the percentage of eye contact from the virtual human, also shows that there is a statistically significant difference when testing the 6 conditions formed by the combination of 2 projection types and 3 gaze patterns [F(5, 93) = 2.86, p = .019] (see Fig. 5). According to the results of a Tukey HSD test, there is no statistically significant difference between any condition as we found no interaction effect between projection type and gaze pattern that is described above. However, there is a trend that users had the greatest eye contact from a virtual human displayed by HMP with extensive gaze (M = 60.14%) and less eye contact when they experienced HMP with real-life gaze (M = 57.65%), STP with real-life (M = 56.75%), or STP with extensive gaze (M = 47.69%).

Fig. 5.
figure 5

Differences between user reports concerning the percentage of eye contact from a virtual human across 6 conditions.

Users also reported a similar amount of eye contact for the Real Life gaze in both the HMP with real-life gaze (M = 57.65%) and STP with real-life gaze (M = 56.75%) conditions. They also reported much less eye contact for STP with Extensive gaze (M = 47.69%), which is in contrast to the HMP with Extensive gaze condition (M = 60.14%). This is an interesting inconsistency as the extensive gaze condition in the STP projection condition was the worst scoring condition, if we disregard the random gaze conditions that correspond poorly with natural behavior: HMP with random gaze (M = 36.18%) and STP with random gaze (M = 41.63%).

This inconsistency between might be due to some type of interaction between Extensive gaze and the “Mona Lisa” effect, which can realign the apparant gaze direction in the STP projection condition. The “Mona Lisa” effect appears to break down in this case. Perhaps the Extensive gaze condition magnifies differences between the STP and HMP projection styles, creating increased perception of eye contact for the HMP condition. Perhaps the STP form of Extensive gaze is eventually perceived as slightly off and thus socially inappropriate and not true eye contact.

For users’ feelings of rapport, we ran a Factor Analysis and obtained the four sub-scales of rapport scale. The Factor Analysis was a Principal Components Analysis with Varimax rotation (Kaiser-Meyer-Olkin Measure of Sampling Adequacy = .806, Bartlett’s Test of Sphericity < .001). The first factor, Engagement, explains 31.65% of the variance (Cronbach’s Alpha = .91). The second factor, Attachment, explains 14.21% of the variance (Cronbach’s Alpha = .79). The third factor, Closeness, explains 6.59% of the variance (Cronbach’s Alpha = .65). The fourth factor, Connection, explains 6.00% of the variance (Cronbach’s Alpha = .75). There are low correlations among the sub-scales, thus we ran 2-way ANOVA using the condition and gender as independent variables for each sub-scale separately. The results demonstrate that there is a statistically significant difference among the 6 conditions [F(5, 93) = 2.64, p = .028] (see Fig. 6) and gender [F(2, 93) = 7.52, p = .001] for Closeness, but no statistically significant results were seen for the other sub-scales. Post-hoc test shows that HMP with extensive gaze (M = 3.71) is significantly higher than HMP with random gaze (2.39). Overall, there is a trend that users had the greatest feeling of closeness to a virtual human in HMP with extensive condition (M = 3.71) than HMP with real-life gaze (M = 3.53), STP with extensive gaze (M = 2.94), or STP with real-life gaze (M = 2.92) conditions. This implies that users might have felt more closeness to a virtual human when they had constant mutual gaze from the virtual human via an HMP display although they might have felt extensive gaze as inappropriate gaze as the gaze could be perceived as overwhelming. Users had the least feeling of closeness to a virtual human in the STP with random gaze condition (M = 2.70). Male users (M = 3.24) reported higher feeling of closeness to the virtual human than female users (M = 2.69). There were no other statistically significant results for projection type and gaze pattern nor was there an interaction effect for the two variables.

Fig. 6.
figure 6

Differences between users feelings of closeness to the virtual human across 6 conditions (*p < .05).

5 Discussion

With regards to the percentage of gaze perceived by users, the results show an interesting discrepency between extensive gaze in the HMP and STP conditions. The STP with Extensive gaze condition provided a lower level of gaze than the HMP with Extensive gaze condition. The Extensive gaze condition might be magnifying small but perceivable differences in the delivery of gaze between the two projection conditions. The trends in the Closeness subscale of the rapport measure suggest that HMP might facilitate social feelings of closeness, particularly with regards to extensive gaze.

Interestingly, participants appear to overestimate the percentage of eye contact provided to them by the virtual character. In the case of the Real-life gaze condition, the virtual character gazes at each game player in turn. With three players, participants could be expected to respond with estimates near 33%. However, the average estimates were over 50%. The Random gaze estimates were near 40% and should be much lower considering the character’s random gaze. However, the participants’ seemed to discount the continuous gaze provided in the Extensive gaze condition, responding with over 50% when a reasonable estimation might approach 100%.

The results suggest that behavior of the character is a strong cue for social interactions. For example, the gaze condition was the key factor in users deciding if the character provided the correct and appropriate level of mutual gaze. However, there were some small indications that the HMP condition, when highlighted by extensive gaze, could produce some measurable social effects.

Some participants reported that the character’s eyes appeared to move even in the Extensive gaze condition. This may reveal some limitations of the current REFLCT system that reduced the effectiveness of the enhancement of gaze through personalized perspective correct rendering. The human eye can estimate the relative pose of an eyeball by comparing the brightness of the white sclera regions surrounding the iris and framed by the eye socket. If the relative brightness of the left and right regions of sclera varies, the eyeball is perceived as moving left and right. The pico-projectors used in the current REFLCT display take a 640\(\,\times \,\)480 image that is downscaled to 480\(\,\times \,\)320. That low resolution as well as the downscaling approximations could introduce some artifacts. Small movements of the user might shift pixels left and right, substantially changing the small number of pixels available to render the left and right sclera regions and causing variations in brightness and apparent eye motion. The current REFLCT resolution probably caused some uncertainty in judging gaze direction.

6 Conclusion and Future Work

Using a multi-party task involving interactions with a virtual human, this study was able to demonstrate some measurable social effects and trends when using the REFLCT system. The REFLCT system provides personalized perspective correct rendering for multiple users using head worn projectors and retroreflective screens. This personalized perspective correct imagery can be used to enhance the portrayal of gaze provided by a virtual character. This is preliminary work that identifies future improvements needed in such systems and the subtlety required to measure the social effects involved.

While the gaze behaviors of the virtual character appeared to be strong factors in determining the social response, the projection condition had some influence on social measures. Differences appear to be most apparent with extensive gaze as rendered in REFLCT’s head mounted projection display and when compared to a simulated traditional projection display.

Future versions of REFLCT and other head mounted projection systems must provide enough resolution to accurately portray eyes behavior, especially for conveying a steady and direct gaze without movement.