Elsevier

Computers & Education

Volume 140, October 2019, 103605
Computers & Education

Reconsidering the Voice Principle with Non-native Language Speakers

https://doi.org/10.1016/j.compedu.2019.103605Get rights and content

Highlights

  • The voice principle is more complex than human or computer voice.

  • Technological advancements in computer voice makes it comparable to human voice.

  • Designers should consider the target population when recording vocal elements.

  • Weak-prosodic human voice significantly increases germane cognitive load.

  • Fewer prosodic elements increase agent persona with non-native speakers.

Abstract

Researchers have suggested that use of pedagogical agents speaking with a human voice increases social perception and enables deeper learning when compared against computer-generated voice. However, recent research (Craig & Schroeder, 2017) found modern computer voice was as effective as human voice in certain social measures, and can outperformed human voice in particular learning outcomes. This research aimed to study whether two human voice conditions (strong-prosodic and weak-prosodic) produced consistent measures when compared against modern computer voice and each other in social perception and retention measures with non-native speakers. The human weak-prosodic voice was rated significantly higher on four of seven scale items compared to modern computer voice. However, no significant differences were found in the retention of information. These results show that non-native speakers prefer human voice with less prosodic elements, and factors behind voice are more complicated than just categorizing it as either human or computer.

Introduction

It is widely believed that pedagogical agents (PAs) speaking with a human voice provide users with a better learning experience than PAs speaking with computer-generated voices. Compared to computer-generated voice, natural human voice has intrinsic features of prosody that convey a significant amount of lexical, semantic, syntactic, and discourse information (Akker & Cutler, 2003; Cutler, Oahan, & Van Donselaar, 1997). These prosodic features can be very difficult for non-native speakers to comprehend, whereas they are easy for native speakers to understand (Vanlancker-Sidtis, 2003). With PA design principles such as social agency theory and the voice principle used to support such voice claims, human voice has been advanced as the superior choice when designing a PA. However, the reality for such claims is based on a limited amount of research comparing the two voice conditions in an experimental setting. Therefore, with the continuous advancement of technology, claims of voice benefits need to be further examined and routinely assessed, especially in regards to different populations such as non-native speakers.

One design principle that supports the use of human voice is social agency theory (Mayer, Sobko, & Mautone, 2003). Social agency theory suggests that PA social cues within multimedia presentations activate the user's conversation scheme. If the PA is viewed as a social actor, then users will apply the same social rules found in human-to-human communication. Mayer (2014, p. 345) suggests that three of the most important social cues for PA design are conversational language, human voice, and human-like gestures. These components create a social partnership between the user and the PA that encourages the user to exert more effort during the learning process (Mayer, 2017). However, social agency is more complicated than conversational language, human voice, and human-like gestures. The image of the PA signals to the user that someone is present, which requires a social stance (Nam, Shu, & Chung, 2008). This creates other potential variables to social agency such as gender, age, ethnicity, visual appeal, and dynamism (Van der Meij, Van der Meij, & Harmsen, 2015). Thus, human voice within social agency theory is one component of a larger complex system that contributes to social perception and learning with PAs.

Even though voice is seen as a priming factor within social agency theory, the voice principle (Atkinson, Mayer, & Merrill, 2005) suggests participants learn better from human voice than from computer synthesized voice. Mayer (2017) examined five experiments over three studies (Atkinson et al., 2005; Mayer & DaPra, 2012; Mayer et al., 2003) that directly compared human voice versus computer synthesized voice, and found people learned better when presented with human voice (d = 0.74). A more detailed examination of the experiments comparing learning outcomes between human voice and computer synthesized voice detected participants listening to the human voice condition had significantly higher retention scores (Mayer et al., 2003, Expt. 2), and significantly higher near transfer and far transfer scores (Atkinson et al., 2005, Expts. 1 & 2). However, it must be noted the technology used to create the computer synthesized voice in these experiments is vastly different than the text-to-speech technology currently available (Craig and Schroeder (2017). In later experiments with advancing technology, Mayer and DaPra (2012) compared human voice and computer synthesized voice with the extra variable of embodiment (low embodiment versus high embodiment) that is measured on the production of human-like gestures, lip synchronization, facial expression, and eye and body movements (Basori & Ali, 2013; Mayer & DaPra, 2012; Ochs, Niewiadomski, & Pelachaud, 2015). From a pure comparison of voice conditions, transfer and retention scores were not significant. However, the level of embodiment combined with the human voice was significant with the transfer of knowledge. As for embodiment and machine voice, no significance was found between the conditions. The authors suggest that embodiment helps participants learn more deeply, but negative social cues like machine voice compromise the potential benefits.

Recently, Craig and Schroeder (2017) revisited the issue of voice and accounted for the advancements of technology. In their experiment, the authors compared human voice against two forms of computer generated voice: modern computer voice and classic computer voice. Modern computer voice was created with the Neospeech voice engine, which integrates today's advanced methods of text-to-speech to sound more natural, and the classic computer voice using the Microsoft speech engine, which mirrored the capabilities of text-to-speech software of the early 2000s. Results from the learning outcome measures showed that while retention was not significant across the conditions, transfer of learning was significant. Participants in the modern computer voice condition scored significantly higher than those in the human voice (d = 0.54) and classic computer voice conditions (d = 0.41). The authors propose that voice may not be as important for learning now as it was in the past, and that modern text-to-speech software performs as well as a human recording. In this way, technology has advanced enough in the field of speech production to perform as well as, if not better in some instances, as the human voice. The purpose of this study is to evaluate the effect human voice with prosody and human voice without prosody compare to modern computer voice in measured outcomes (cognitive load, agent persona, and retention) with non-native speakers of English.

Multimedia environment researchers have routinely been concerned with social perception and whether the persona of the agent is beneficial to the learning process. One of the earliest foundations for agent persona comes from media equation theory (Reeves & Nass, 1996), which suggest humans will apply social rules to media and perceive media agents as real people. However, a specific meaning of “agent persona” has been difficult to establish based on the variety of definitions found in the literature (Schroeder & Adesope, 2014). Persona has been described as the ability to influence the perception of a system (Lester et al., 1997), as an agent that is real and authentic (van Mulken, Andre, & Muller, 1998), as engaging, human-like, credible, and facilitates learning (Baylor & Ryu, 2003), or as anthropomorphized due to life-like features such as facial expressions, body movements, and gestures (Woo, 2008). Although persona might be vaguely defined, research has found that PAs can simulate instructional roles such as motivator, mentor, and expert (Baylor & Kim, 2005). Participants also expect the PA to have a personality (Kim, Baylor, & Shin, 2007) and visually attend to PAs as they would to a human when holding a conversation. However, the concept of agent persona has produced mixed results in the literature. Heidig and Clarebout (2011), in a comprehensive review, contribute these mixed results to the highly complex nature of researching PAs, and in understanding both how PAs need to be designed and the conditions in which they are effective.

Even though evaluating agent persona is a complex process when examining learning outcomes, motivation, and other concepts such as engagement, isolating individual variables such as voice has made it easier to understand the elemental constructs of persona. Early experiments found evidence that human participants applied social rules to computer-based voices, which indicates that voice strongly boosts the perception of social presence (Nass & Steuer, 1993). Thus, social perception of the PA can help researchers in the evaluation of the persona of the agent. However, measuring agent persona in relation to voice has been labeled differently across experiments. Three experiments (Mayer et al., 2003, Expt. 2; Atkinson et al., 2005, Expts. 1 & 2) measured agent persona as speaker rating, which rated human voice significantly higher than computer synthesized voice at effect sizes d = 1.45, d = 0.76, and d = 0.83 respectively. Later voice comparison studies measured agent persona according to the agent persona instrument (API, Baylor & Ryu, 2003), or the translated Korean agent persona instrument (KAPI, Ryu, 2012). The API and KAPI both measure agent facilitation, credibility, human-likeness, and engagement. Using the API, Craig and Schroeder (2017) compared the agent persona with PAs using human voice, modern synthesized voice, and classic computer voice. Both human voice and modern synthesized voice scored significantly higher than classic computer voice in facilitation and credibility, but human voice scored significantly higher than modern synthesized voice and classic computer voice in the categories of human-likeness and engagement. These results indicate that human voice provides better persona than classic computer voice in all measures, but modern synthesized voice may not be as human-like or engaging as human voice, though there is no measurable difference in facilitation and credibility. Also, Ryu and Fengfeng (2018) conducted experiments with the KAPI on human voice and the same modern computer voice software (Neospeech) used in the Craig and Schroeder (2017) study. Across two experiments comparing voice type with image, and voice type with screen/presentation size, the results were similar in both studies. Human voice was rated significantly higher than modern computer voice in human-likeness and engagement, but the other subscales failed to reach significance. These recent studies indicate modern computer voice is similar to human voice in some instances, but fails to match the human-likeness and engagement that human voice creates in most experiments.

In addition to agent persona, researchers are concerned with mental processing demands placed upon the cognitive architecture of the learner during the learning process. Since the intrinsic features of prosody in the human voice can convey lexical, semantic, syntactic, and discourse information (Akker & Cutler, 2003; Cutler, Oahan, & Van Donselaar, 1997), the prosodic features can cause difficulty with comprehension for non-native speakers, while native speakers have no such difficulties (Vanlancker-Sidtis, 2003). Thus, the benefits and hindrances of voice processing and comprehension could be viewed through the different aspects of cognitive load theory. Cognitive load theory differentiates the type of processing demands as intrinsic, extraneous, and germane (Paas, Tuovinen, Tabbers, & Van Gerven, 2003). Intrinsic cognitive load is the inherent amount of processing required for an individual topic (Van Merriënboer & Sweller, 2005). Extraneous cognitive load could waste the individual's effort and time on more irrelevant or insignificant cognitive processing that does not contribute to schema construction or automation (Paas, Renkl, & Sweller, 2004); while germane cognitive load benefits cognitive processing by accessing knowledge that has already been acquired and automatized (Kalyuga, 2011). However, the interaction between cognitive load and mental processing is complicated since the processing in working memory is limited, but the nature of intrinsic and extraneous cognitive load are additive, which means intrinsic and extraneous cognitive load are fluid depending on certain variables (Leppink, van Gog, Paas, & Sweller, 2015). Since intrinsic cognitive load is connected with the inherent difficulty of the topic, a learner's prior knowledge of the content can dramatically influence how much intrinsic cognitive load is present in working memory. In other words, a multimedia presentation on statistics would, as a whole, cause less intrinsic cognitive load for participants who majored in math education than for participants who majored in history. Likewise, the design of multimedia presentations can increase or decrease extraneous cognitive load within the participant. One such design strategy is the temporal congruity principle, which holds that narration and graphical information should be presented simultaneously rather than separately (Mayer, 2017). Separating voice and graphical representation causes undue mental processing, which increases the processing demand on a limited information system like working memory. Therefore, components linked to intrinsic cognitive load and extraneous cognitive load need to be carefully considered so that working memory is free to process the information being presented.

The element of voice is a key feature in theories examining cognitive load within multimedia presentations. The most prominent theory, the cognitive theory of multimedia learning (CTML), views learning as an interaction between people and computer-based environments using words and pictures. CTML assumes that (1) people use verbal and visual channels to process information; (2) each channel has a limited capacity to process information; and (3) the process of learning is cognitively demanding (Mayer & Moreno, 2003). If participants have to read on-screen text and view graphical representations simultaneously, then the visual channel is likely to become overloaded due to the amount of information which needs to be processed. Therefore, to avoid overloading particular channels, design strategies, such as using narration instead of text with graphical representations, have the potential to distribute the processing needed to understand the content over two channels instead of one. Overall, there are twelve principles related to CTML that help multimedia instructional designers design content that does not cognitively overload the participant (Mayer, 2009, for review).

While the literature comparing human voice versus computer synthesized voice is limited, direct comparisons of human and computer voice in relation to cognitive load are scarcer. Only two experiments (Atkinson et al., 2005, Expts. 1 & 2) attempted to measure cognitive load between voice types and found no significant difference. However, those experiments examined cognitive load as difficulty, which did not detail whether the measure was explicitly measuring intrinsic or extraneous cognitive load. This separation is important since it is possible that extraneous cognitive load may burden working memory and learning ability of listeners with irrelevant information that makes them feel bored or annoyed while listening to a synthesized voice (Wouters, Paas, & van Merriënboer, 2008). Therefore, more precise measures to assess the different types of cognitive load are needed to understand how different components of the PA are cognitively impacting the participant in the multimedia environment.

Although spoken language centers on the words that are articulated, one key feature of the human voice is its ability to alter meaning through prosody. Prosody is pitch, tempo, stress, intonation, melody, loudness, accent and pause (Kent, 1997; Ross, 2000), which helps listeners comprehend the intention of the speaker. Whenever listeners recognize the utterances of others, they are processing prosodic cues (Cutler, Dahan, & Van Donselaar, 1997). For example, stress can add several different meanings to this sentence, “I didn't tell you to do that.” If the speaker says, “I didn't tell you to do that,” the information being communicated suggests someone else was responsible to do something. However, if the speaker says, “I didn't tell you to do that,” the stress at the end signals the listener was to do something, but not what the listener did. For native speakers of the language, these nuances to prosody are easily recognized and understood during discourse.

However, meaning communicated through prosody is not as easy for non-native speakers to comprehend because the listening process is different than the process for a native speaker. One of the striking differences between native speakers and non-native speakers is how the language is processed. Nonnative listeners process prosodic information for semantic structure less efficiently than native listeners (Akker & Cutler, 2003). Also, native speakers tend to use a top-down approach to that focuses on meaning, whereas non-native speakers utilize a bottom-up approach focusing on words that command more cognitive resources (Osada, 2001). In addition to the less efficient bottom-up approach, issues such as limited vocabulary, unclear pronunciation, rate of speech, and the ability to segment the speech can complicate the listening process (Goh, 2000; Hasan, 2000). Therefore, non-native speakers ability to map the prosodic features of spoken language to the semantics of communicative intent is less efficient compared to the abilities of native speakers (Akker & Cutler, 2003). In order to map these prosodic features, Lindfield, Wingfield, and Goodglass (1999) suggest that individuals must have the communicative prosodic information in their lexicon and possess the ability to access these features if they want to produce or recognize prosodic characteristics of language. Because of this, more advanced users of the foreign language may have more cognitive resources available to benefit from prosodic elements than less advanced language learners who need to dedicate vast amounts of cognitive resources to process the language being heard. That is to say, these prosodic elements play a role as indicators of cognitive load in spoken language. Researchers have found that many prosodic changes are most often associated with heavy cognitive load on non-native listeners (Akker & Cutler, 2003), while gestures may help to reduce second language learner's cognitive load (McCafferty, 2004).

The research questions for this experiment sought to answer:

RQ1

To what extent does agent voice affect intrinsic cognitive load, extraneous cognitive load, and germane cognitive load with non-native speakers?(cognitive load)

H1a

Extraneous cognitive load will increase in the modern computer voice condition compared to the two human voice conditions. While Wouters and colleagues suggested that computer synthesized voice had the ability to increase extraneous cognitive load, Veletsainos (2012) provided qualitative evidence that participants found computer synthesized voice to be “obnoxious,” “distracting,” and “hard to listen to at times” (p. 280).

H1b

Germane cognitive load will increase in the human weak-prosodic voice condition due to less efficient prosodic feature mapping and potential decreased access to prosodic features of the language by non-native learners (Akker & Cutler, 2003; Lindfield et al., 1999).

RQ2

In what ways does voice affect non-native speakers' perceptions of the agent's persona? (perception of persona)

H2a

Human weak-prosodic voice will positively influence the agent persona subscale of facilitation against the human weak-prosodic voice due to the bottom-up approach non-native speakers use while listening (Osada, 2001).

H2b

Only human strong-prosodic voice will positively influence the agent persona subscales of credibility, human-like, and engagement, since prosodic features are expected and previous research has found human voice outperforms machine generated voice in these subscales (Mayer & DaPra, 2012; Craig & Schroeder, 2017; Ryu & Fengfeng, 2018).

RQ3

To what extent does voice type influence non-native speakers' retention of information? (retention of information)

H3

Voice will not be a significant factor in the retention of information (Mayer & DaPra, 2012; Craig & Schroeder, 2017)

RQ4

How do the two different human voices compare in cognitive load, agent persona, and retention measurements against each other and the modern computer voice?

H4a

Human voice conditions will show no significant differences when compared against each other.

Section snippets

Design and participants

A between-subjects experimental design was used to study whether PA voice alters perceptive and learning outcomes in computer-based environments with non-native speakers of English. To measure the independent variable of voice, participants were randomly assigned to one of three conditions: human strong-prosodic voice, human weak-prosodic voice, or modern computer voice. The dependent variables were participants' perceived cognitive load, evaluation of agent persona, and the retention of

Prior knowledge

Before using inferential statistics, a Levene's test was performed to evaluate whether the data met the standards of homogeneity. The results indicated evidence of no heterogeneity with the data at F(2, 169) = 2.00, p = 0.138. Thus, it was determined that the data was appropriate for inferential statistics. See Table 3 for prior knowledge scores and retention scores.

Cognitive load

To analyze how different PA voice types affect cognitive load in computer based environments, ANOVAs were performed on the cognitive load constructs of intrinsic cognitive load, extraneous cognitive load, and germane cognitive load. Intrinsic cognitive load and extraneous cognitive load were not significant with F(2, 169) = 1.843, p = 0.162 and F(2, 169) = 1.445, p = 0.239 respectively. However, germane cognitive load was significant at F(2, 169) = 3.076, p = 0.049. A Tukey's HSD post hoc test

Discussion

This experiment examined the voice principle (Atkinson et al., 2005) in relation to non-native speakers assessment of cognitive load, evaluation of agent persona, and retention of information with human strong-prosodic voice, human weak-prosodic voice, and modern computer voice. The findings from this research show that understanding the voice principle and its relation to non-native speakers might be more complicated than previously thought. Some patterns were found with previous results

Conclusion

This study sought to explore topics regarding the voice principle with non-native speaking populations. The first objective was to replicate earlier findings that modern computer voice is comparable to human voice in terms of agent persona and retention of knowledge. As far as persona, being comparable depended on which human voice the modern computer voice was being measured against. The human strong-prosodic voice only scored significantly higher on one of the four persona subscales, while

Robert O. Davis is an assistant professor in the Department of English Linguistics and Language Technology at Hankuk University of Foreign Studies. His current research interests involve pedagogical agent gesturing, social acceptance of virtual characters, interaction with computer-based environments, and virtual reality in the foreign language classroom.

References (58)

  • C.S. Nam et al.

    The roles of sensory modalities in collaborative virtual environments (CVEs)

    Computers in Human Behavior

    (2008)
  • N.L. Schroeder et al.

    Measuring pedagogical agent persona and the influence of agent persona on learning

    Computers & Education

    (2017)
  • G. Veletsianos

    How do learners respond to pedagogical agents that deliver social-oriented non-task messages? Impact on student learning, perceptions, and experiences

    Computers in Human Behavior

    (2012)
  • O.O. Adesope et al.

    Verbal redundancy in multimedia learning environments: A meta-analysis

    Journal of Educational Psychology

    (2012)
  • E. Akker et al.

    Prosodic cues to semantic structure in native and nonnative listening

    Bilingualism: Language and Cognition

    (2003)
  • A.L. Baylor et al.

    Simulating instructional roles through pedagogical agents

    International Journal of Artificial Intelligence in Education

    (2005)
  • A.L. Baylor et al.

    The effects of image and animation in enhancing pedagogical agent persona

    Journal of Educational Computing Research

    (2003)
  • J. Cohen

    A coefficient of agreement for nominal scales

    Educational and Psychological Measurement

    (1960)
  • A. Cutler et al.

    Prosody in the comprehension of spoken language: A literature review

    Language and Speech

    (1997)
  • R.O. Davis et al.

    Effects of pedagogical agent gestures on social acceptance and learning: Virtual real relationships in an elementary foreign language classroom

    Journal of Interactive Learning Research

    (2017)
  • A.P. Gilakjani et al.

    The effect of multimodal learning models on language teaching and learning

    Theory and Practice in Language Studies

    (2011)
  • A.S. Hasan

    Learners' perceptions of listening comprehension problems

    Language Culture and Curriculum

    (2000)
  • M.J. Jannati et al.

    Speech naturalness improvement via ε-closed extended vectors sets in voice conversion systems

    Multidimensional Systems and Signal Processing

    (2018)
  • S. Kalyuga

    Cognitive load theory: How many types of load does it really need?

    Educational Psychology Review

    (2011)
  • R.D. Kent

    Gestural phonology: Basic concepts and applications in speech-language pathology

  • Y. Kim et al.

    Pedagogical agents as learning companions: The impact of agent emotion and gender

    Journal of Computer Assisted Learning

    (2007)
  • J. Leppink et al.

    18 cognitive load theory: researching and planning teaching to maximise learning

    Res. Med. Educ.

    (2015)
  • J.C. Lester et al.

    The persona effect: Affective impact of animated pedagogical agents

  • K.C. Lindfield et al.

    The contribution of prosody to spoken word recognition

    Applied PsychoLinguistics

    (1999)
  • Cited by (29)

    • A systematic review of pedagogical agent research: Similarities, differences and unexplored aspects

      2022, Computers and Education
      Citation Excerpt :

      Atkinson, Mayer, and Merrill (2005) demonstrated that agents with a human voice performed significantly better in retention than agents with a machine voice. However, Davis, Vincent, and Park (2019) showed that there was no significant difference for the usage of the human voice and synthesized voice in retention. Even when it comes to determining whether an agent needs to be static or dynamic, the jury is still out.

    • Trust influences perceptions of virtual humans, but not necessarily learning

      2021, Computers and Education
      Citation Excerpt :

      This is evidenced by Heidig and Clarebout's (2011) systematic review, which located only 26 articles that measured pedagogical agents and learning or motivation. Previous research has examined factors such as the virtual human's appearance and role (Baylor & Kim, 2004, 2005), voice (Craig & Schroeder, 2017; Davis et al., 2019), and the learning environment itself (Schroeder et al., in press), yet this work has often led to inconclusive findings. This study begins to address this gap in the literature around social agency theory, specifically with voice, by expanding the perspective around social aspects in a learning environment.

    • Social fidelity in virtual agents: Impacts on presence and learning

      2021, Computers in Human Behavior
      Citation Excerpt :

      Other work compared modern synthetic speech with natural speech with high or low prosody. A sample of non-native English speakers showed equivalent learning outcomes for all three voice conditions, but workload differed, such that low prosody natural speech induced less workload than the synthetic speech, and the high prosody natural speech was rated as more engaging than the synthetic speech (Davis et al., 2019), suggesting that the trade-offs between synthetic and natural speech can be complex. While generating synthetic speech is more flexible and cost effective than pre-recording specific natural speech segments and pre-planned words, it also has a few considerations that need to be taken into account when it is used.

    • How we trust, perceive, and learn from virtual humans: The influence of voice quality

      2020, Computers and Education
      Citation Excerpt :

      While it is known that the valence of social cues can influence learning from these types of systems (Domagk, 2010), clearly more systematic research is needed. This study aims to extend an existing, systematic line of research (Craig & Schroeder, 2017; Davis et al., 2019) examining the influence of one particular social cue, the virtual human's voice. The voice effect states that in order to promote deep learning, narration should be provided using recorded human voices of standard accent rather than machine-synthesized voices, as one may create through text-to-speech engines (Mayer, 2014).

    View all citing articles on Scopus

    Robert O. Davis is an assistant professor in the Department of English Linguistics and Language Technology at Hankuk University of Foreign Studies. His current research interests involve pedagogical agent gesturing, social acceptance of virtual characters, interaction with computer-based environments, and virtual reality in the foreign language classroom.

    Joseph Vincent is a professor in the Department of English for International Conferences and Communication at Hankuk University of Foreign Studies. His current research interests involve mixed media learning and teaching.

    Taejung Park is a research professor in Education Advancement Centre at Hankuk University of Foreign Studies in Seoul, South Korea. Her current research interests focus on instructional design, MOOCs, AR/VR/MR, software education, flipped learning, and future school.

    View full text