How we trust, perceive, and learn from virtual humans: The influence of voice quality

https://doi.org/10.1016/j.compedu.2019.103756Get rights and content

Highlights

  • Voice quality of a virtual human pedagogical agent can impact its trust ratings.

  • Voice quality did not influence learning outcomes.

  • Trust is related to the perceived virtual human qualities of facilitating learning, credibility, and engaging.

  • Trust in virtual humans may be an important factor to consider in future research.

Abstract

Research has shown that creating environments in which social cues are present (social agency) benefits learning. One way to create these environments is to incorporate a virtual human as a pedagogical agent in computer-based learning environments. However, essential questions remain about virtual human design, such as what voice should the virtual human use to communicate. Furthermore, to date research in the education literature around virtual humans has largely ignored one potentially salient construct – trust. This study examines how the quality of a virtual human's voice influences learning, perceptions, and trust in the virtual human. Results of on an online study show that the voice quality did not significantly influence learning, but it did influence trust and learners' other perceptions of the virtual human. This study, consistent with recent work around the voice effect, questions the efficacy of the voice effect and highlights areas of research around trust to further extend social agency theory in virtual human based learning environments.

Introduction

As artificial intelligence becomes more prevalent in everyday technology use, understanding what impacts people's trust in technology becomes increasingly important for technology design. Trust guides people's reliance on and compliance with technology. Therefore, trust has important implications for technology acceptance, adoption, and appropriate use (Davis, 1985; Muir, 1987). This may be particularly true in relation to learning technologies, where there are multiple stakeholders who must maintain a level of trust in those technologies to realize their effective and sustained use.

Past research has investigated myriad factors that influence trust in technology, including trust disposition, past experiences, task characteristics, work environment factors, and technology characteristics (Hoff & Bashir, 2014; Lee & See, 2004). Technology characteristics are of particular interest because they are often what designers have the most control over. One increasingly important design component of learning technologies are virtual humans. Virtual humans, which can take the role of pedagogical agents (Schroeder, Adesope, & Gilbert, 2013) or conversational agents (Graesser, Cai, Morgan, & Wang, 2017) depending on their specific implementations, were posited as a way to socially engage the learner with the learning system because they can add social presence, dynamic interactions, and feedback to a wide range of computer-mediated tasks (Craig & Schroeder, 2018; Park & Catrambone, 2007). As digital embodiments of artificially intelligent agents, virtual humans may be used to mediate interactions between people and computers to more effectively accomplish a goal, whether it is coordinating schedules (Horvitz, 1999), walking a customer through a sales process (Cassell & Bickmore, 2000), or learning about a topic (Craig & Schroeder, 2018; Schroeder et al., 2013).

The design of virtual humans is complex because they can influence the social interaction a learner has with the system (Bailenson & Yee, 2005; Nass & Moon, 2000). Heidig and Clarebout (2011) proposed a multi-level framework of design considerations to consider when creating a virtual human or other pedagogical agent. Beyond the visual element, interactive structure, timing, and content that make up the design of a virtual human, voice is another prominent characteristic that has been of particular interest. Using a human-recorded voice for a wide-range of customized applications is difficult to achieve at scale, thus researchers and designers have turned to text-to-speech technology or computer-generated voice. Computer-generated voice has been found to increase cognitive (analytic) and emotional (affective) trust in a virtual human compared to a virtual human without voice (Qiu & Benbasat, 2005). Others have investigated the effects of vocal pitch (Elkins & Derrick, 2013), personality (Nass & Lee, 2001a, 2001b), and gender in voice (Lee, Nass, & Brave, 2000), all of which have been found to influence trust. However, while these studies investigated how various qualities of an agent's voice affects trust, they have not done so in an educational context.

The current study investigates the voice effect, or voice principle (Mayer, 2014) of virtual humans. The voice effect states that narration should be provided by “a standard-accented human voice” (Mayer, 2014, p. 358). However, recent work using new technology has called this effect into question (Craig & Schroeder, 2017) and also asked if voice quality is the only factor influencing the voice effect (Craig & Schroeder, 2017; Davis, Vincent, & Park, 2019; Finkelstein, Yarzebinski, Vaughn, Ogan, & Cassell, 2013). This study examines how a virtual human's voice influences learning, perceptions, and trust.

Virtual humans are human-like representations of software agents that appear on-screen to facilitate some sort of interaction between the learner and the computer (Craig & Schroeder, 2017, 2018; Badler, Phillips, & Webber, 1993; Yee & Bailenson, 2007). When virtual humans are integrated into learning technologies, researchers often leverage social agency theory as their theoretical framework. In short, social agency theory suggests that learners will learn more if there are social cues integrated into the learning environment, because these social cues encourage them to put forth more effort (Mayer, Sobko, & Mautone, 2003). To date, there have been limited extensions of social agency theory in the virtual human literature. More research exists in the realm of pedagogical agents (which can be non-humanlike representations of software agents), although it is still sparse. For example, in their systematic review, Heidig and Clarebout (2011) found only 24 studies examining the design of the agent. While it is known that the valence of social cues can influence learning from these types of systems (Domagk, 2010), clearly more systematic research is needed. This study aims to extend an existing, systematic line of research (Craig & Schroeder, 2017; Davis et al., 2019) examining the influence of one particular social cue, the virtual human's voice.

The voice effect states that in order to promote deep learning, narration should be provided using recorded human voices of standard accent rather than machine-synthesized voices, as one may create through text-to-speech engines (Mayer, 2014). Studies in the mid 2000's seemed to affirm this effect, as results from different studies largely showed that the recorded human voice was superior to text-to-speech generated voices (Atkinson, Mayer, & Merrill, 2005; Mayer et al., 2003).

However, there has been a recent resurgence of scholarly interest in the voice effect, particularly in relation to virtual humans (Craig & Schroeder, 2017; Davis et al., 2019). For example, Craig and Schroeder (2017, 2019) posited that the voice effect could have been due to the technologies available at the time of the research in the mid-2000's, and so they sought to re-examine the effect. Craig and Schroeder (2017) compared the effects of a recorded human voice, a modern text-to-speech engine, and the text-to-speech engine used in the studies that informed the voice effect during the 2000's. They concluded that the type of voice used by a virtual human, when comparing modern text-to speech or recorded human voice, may not be as important for learning outcomes as once postulated, and that “modern voice engines may be just as effective … as a recorded human voice” (p. 201). Craig and Schroeder (2017, 2019) tested the same voice conditions, but without a virtual human present, and found similar results. These results suggest that learners' perceptions of voice and of virtual humans can be of critical importance.

Researchers frequently use the Agent Persona Instrument (Ryu & Baylor, 2005) to measure learners' perceptions of a virtual human. Specifically, the API has four subscales that measure perceptions of how humanlike, credible, and engaging the virtual human was, and how well it facilitated learning. Craig and Schroeder (2017) found that the human voice and the modern text-to-speech engine both outperformed the older text-to-speech engine on the measures of credibility and facilitated learning. There was not a significant difference between the modern text-to-speech voice and the human voice in terms of credibility and facilitated learning. However, as may be expected, the human voice outperformed the two text-to-speech voices on measures of human-likeness and engaging.

Recently, Schroeder, Romine, and Craig (2017) conducted a path analysis examining the influence of the API subscales on each other and on learning outcomes. Their results showed that while the API subscales significantly influenced each other, they did not significantly influence learning. Moreover, Author conducted confirmatory factor analysis and Rasch analysis on the API. Results showed that while the instrument performed well overall, there were a few items that may benefit from modification. Accordingly, Schroeder, Yang, Banerjee, Romine, and Craig (2018) revised the API into an instrument they called the API-R. Their confirmatory factor analysis confirmed the factor structure of the instrument; subsequent cluster analyses and regression showed that API-R subscales had a small effect on learning. They concluded that in order to understand what perceptions may influence learning with virtual humans, researchers may need to look outside of the API-R. Their conclusion, in part, helped inspire this study investigating learners' trust in the virtual human.

Trust has long been recognized as an important facilitator of interactions, transactions, agreements, and exchanges between people, and in the past four decades has become a well-studied concept in human-technology interaction. In the late 1980's and 1990's, trust gained traction in the applied sciences alongside advances in automation, and was investigated more directly in the context of human-technology interaction with the goal of improving technology design (Muir, 1987, 1994; Muir & Moray, 1996).

Several trust scales were subsequently developed to measure a person's trust in technology through self-report (to name a few: Komiak & Benbasat, 2004; Jian, Bisantz, & Drury, 2000; Madsen & Gregor, 2000; Moore & Benbasat, 1991). The expectation was that these scales could capture people's perceptions of technology, perceptions that might predict behaviors with technology (e.g., technology acceptance, adoption, and use), and in many cases, predictors of human-technology performance.

Because self-report measures of trust are reflective perceptions, such measures may not always capture or predict in-the-moment behavior, which is critical for interactions with intelligent agents (Takayama, 2009). Therefore, studies investigating trust in agents commonly use a behavioral or performance indicator of trust, in addition to a trust scale. Beyond the established self-report scales (e.g., how reliable, dependable, or familiar people feel it is) there are also binary responses from the adoption or use of technology (Venkatesh, Morris, & Davis, 2003) or reliance and compliance (Meyer & Lee, 2013), that have been used as proxy indicators of trust, although it is generally understood that these behavioral indicators of trust may also occur in the absence of trust.

To distinguish trust from an intention or behavior, which has the potential to confuse the effects of trust with the effects of other factors that can influence performance (e.g., workload, self-confidence, social pressure), trust is defined to be “the attitude that an agent will help achieve an individual's goals in a situation characterized by uncertainty and vulnerability” (Lee & See, 2004, p. 54). This definition of trust has been used in several subsequent studies of trust in automation, especially in high criticality domains in which risk is apparent, such as medical devices (Montague, Kleiner, & Winchester, 2009), emergency options in commercial aviation (Lyons et al., 2016), or automated vehicles (Lee & Kolodge, 2019) to name a few.

However, current models of trust in automation might be extended to include new performance indicators of trust, such as learning outcomes. Automated agents as learning facilitators qualitatively differ from the agents in function-allocation roles (Fitts et al., 1951) with people in high-risk industrial or commercial settings (Sheridan, 2002). A learning context in which agents could be peers or mentors, rather than tools that need supervision, softens the lines of function-allocation and broadens the view of trust in automation from decisions to rely or not rely on what the automation has to offer, to a more nuanced view of how social qualities such as paralinguistic behavior may affect learners' perceptions of the virtual human.

Although previous studies have addressed various design qualities of virtual human agents used for learning and their impact on trust (Liew, Tan, & Jayothisa, 2013; Roselyn Lee et al., 2007), and previous studies have also looked into the effects of voice quality of pedagogical agents on learning outcomes (Atkinson et al., 2005; Domagk, 2010), few if any studies have looked into the effects of voice quality on trust in a pedagogical virtual human, in the context of learning outcomes.

It is clear that the voice effect remains an important component in virtual human design. Furthermore, it seems plausible that a virtual human's voice can influence a learners' trust in the virtual human. This study examines the how the virtual human's voice influences learning, perceptions (via the API-R), and trust.

The study was divided into two parts based on conceptual objectives. Part one examined how a virtual human's voice influences learning outcomes, perception outcomes, and trust. The research questions for part one are as follows:

  • 1.

    How does a virtual human's voice influence learning outcomes?

  • 2.

    How does a virtual human's voice influence learners' trust in the virtual human?

  • 3.

    How does a virtual human's voice influence perceptions of the virtual human as measured by the API-R?

As noted above, little research exists examining the influence of virtual human voice on learner trust specifically. Accordingly, in part two of this study will explore the relationships between the constructs measured. The specific question for part two was

  • 1.

    What are the relationships between the API-R subscales, trust, and learning outcomes?

Section snippets

Design

This study implemented a randomized alternative treatments design with a pretest (Shadish, Cook, & Campbell, 2002). After recruitment, the experiment was implemented using the Qualtrics survey system, including the trickle process randomization procedure. Participants were randomized into one of three treatments: low-quality text-to-speech (TTS) engine female voice (Microsoft speech engine), high quality TTS engine female voice (Neospeech voice engine), and a human voice (native female English

Part 1 results and discussion: group differences

The randomized pretest-posttest design with one independent variable with three groups and continuous dependent variables suggests that analysis of covariance (ANCOVA) would be appropriate to address the research question concerning learning outcomes. Also, Analysis of Variance (ANOVA) would be appropriate to address research questions concerning participant perceptions. However, because unequal sample sizes could produce differences in the homogeneity of variance between groups, a Levene's

What are the relationships between the API-R subscales, trust, and learning outcomes?

Correlational analyses were used to investigate the relationships between trust, API-R, and learning outcomes. Cohen's recommended levels (0.1–0.3 is small, 0.3 up to 0.5 is moderate, and 0.5 and above is large; Cohen, 1988) were used to interpret the magnitude of the correlations.

A series of correlations were performed to explore the relationships between trust, learning measures (posttest knowledge measures), and the four API-R subscales. Correlations between trust and learning measures

General discussion

The results show that voice quality did not significantly influence learning, calling into question the voice effect, but voice quality did influence trust and learners' other perceptions of the virtual human. Overall, the findings indicate that the human recorded voice is superior to the computerized voices when it comes to positive perceptions of the virtual human. However, the high-quality voice engine may not be far behind when it comes to maintaining positive perceptions. Although the

Conclusion

This study manipulated and tested the influence of one specific social cue – voice quality – of a virtual human agent in a pedagogical role and its impact on learning outcomes, trust, and other perceptions of the agent. Echoing recent research, the results from this study question the efficacy of the voice effect on learning, but also raise new insights about the potential effect of trust on learning. Although voice quality was observed to impact trust in the virtual human, the implications of

Acknowledgements

The authors thank Algelia Burton and Ameera Patel for assistance with data coding.

References (70)

  • R.F. Azevedo et al.

    Using conversational agents to explain medication instructions to older adults

  • N.I. Badler et al.

    Simulating humans: Computer graphics animation and control

    (1993)
  • J.N. Bailenson et al.

    Digital chameleons: Automatic assimilation of nonverbal gestures in immersive virtual environments

    Psychological Science

    (2005)
  • S.E. Carrell et al.

    Does professor quality matter? Evidence from random assignment of students to professors

    Journal of Political Economy

    (2010)
  • J. Cassell et al.

    External manifestations of trustworthiness in the interface

    Communications of the ACM

    (2000)
  • J. Cohen

    A coefficient of agreement for nominal scales

    Educational and Psychological Measurement

    (1960)
  • J. Cohen

    Statistical power analysis for the behavioral sciences

    (1988)
  • S.D. Craig et al.

    Animated pedagogical agents in multimedia educational environments: Effects of agent properties, picture features, and redundancy

    Journal of Educational Psychology

    (2002)
  • S.D. Craig et al.

    Design principles for virtual humans in educational technology environments

  • S.D. Craig et al.

    Text to speech software and learning: Investigating the relevancy of the voice effect

    Journal of Educational Computing Research

    (2019)
  • F. Davis

    A technology acceptance model for empirically testing new end-user information systems: Theory and results

    (1985)
  • F. Davis et al.

    User acceptance of computer technology: A comparison of two theoretical models

    Management Science

    (1989)
  • D. DeVault et al.

    SimSensei Kiosk: A virtual human interviewer for healthcare decision support

  • S. Domagk

    Do pedagogical agents facilitate learner motivation and learning outcomes?: The role of the appeal of agent's appearance and voice

    Journal of Media Psychology

    (2010)
  • A.C. Elkins et al.

    The sound of trust: Voice as a measurement of trust during interactions with embodied conversational agents

    Group Decision and Negotiation

    (2013)
  • S. Finkelstein et al.

    The effects of culturally congruent educational technologies on student achievement

  • P.M.P. Fitts et al.
  • D.J. Hauser et al.

    Attentive turkers: MTurk participants perform better on online attention checks than do subject pool participants

    Behavior Research Methods

    (2016)
  • K. Hoff et al.

    Trust in automation: Integrating empirical evidence on factors that influence trust

    Human Factors: The Journal of the Human Factors and Ergonomics Society

    (2014)
  • E. Horvitz

    Principles of mixed-initiative user interfaces

  • J.-Y. Jian et al.

    Foundations for an empirically determined scale of trust in automated systems

    International Journal of Cognitive Ergonomics

    (2000)
  • S.X. Komiak et al.

    Understanding customer trust in agent-mediated electronic commerce, web-mediated electronic commerce, and traditional commerce

    Information and Technology Management

    (2004)
  • J.D. Lee et al.

    Exploring trust in self-driving vehicles with text analysis

    Human Factors

    (2019)
  • J.D. Lee et al.

    Trust, control strategies and allocation of function in human-machine systems

    Ergonomics

    (1992)
  • E.J. Lee et al.
    (2000)
  • Cited by (41)

    View all citing articles on Scopus
    View full text