Investigation of golden speakers for second language learners from imitation preference perspective by voice modification

doi:10.1016/j.specom.2010.08.015

Speech Communication

Volume 53, Issue 2, February 2011, Pages 175-184

https://doi.org/10.1016/j.specom.2010.08.015 Get rights and content

Abstract

This paper investigates what voice features (e.g., speech rate and pitch-formants) make a teacher’s voice preferable for second language learners to imitate, when they practice sentence pronunciation using Computer-Assisted Pronunciation Training (CAPT) systems. The CAPT system employed in our investigation uses a single teacher’s voice as the source to automatically resynthesize several sample voices with different voice features based on the features of a learner’s voice. Our approach is different from that in the study conducted by Probst et al. which uses multiple native speakers’ voices as sample voices [Probst, K., Ke, Y., Eskenazi, M., 2002. Enhancing foreign language tutors—in search of the golden speaker. Speech Communication 37 (3–4), 161–173]. Our approach can reduce the influence of characteristics of teachers’ voices (e.g., voice quality and clarity) on the investigation. Our experimental results show that a teacher’s voice, which has similar speech rate and pitch-formants to a learner’s voice, is not always the learner’s first imitation preference. Many factors can influence learners’ imitation preferences, e.g., background and proficiency of the language that they are learning. Also, a learner’s preferences may change at different learning stages. We thus advocate an automatic voice modification function in CAPT systems to provide speech learning material with a wide variety of voice features, e.g., different speech rates or different pitch-formants. Learners then can control the voice modifications according to their preferences.

Graphical abstract

Research highlights

► Voice modification is used to investigate golden speakers for ESL learns. ► Learners of English as a second language have different imitation preferences. ► Imitation preferences may change at different learning stages. ► An automatic voice modification function is advocated for CAPT systems.

Introduction

The importance of pronunciation in second language learning has been recognized by teachers and learners (Derwing, 2003) since verbal communication between people from different countries are becoming frequent with the development of economic globalization. Good pronunciation can make listeners understand easily, while bad pronunciation may become a barrier to verbal communication, or even break down conversations. Thus, language learners are encouraged to improve their pronunciation at least to the intelligible level (Hişmanoğlu, 2006). In the traditional teacher-student-based language learning model, imitation is the most commonly used method to improve pronunciation, and also considered as one of the most effective methods (Ding, 2007).

With the development of speech processing technologies and the popularity of personal computers, Computer-Assisted Pronunciation Training (CAPT) is playing an increasingly important role in pronunciation learning (Eskenazi, 2009). CAPT can provide a private and stress-free learning environment, and allows learners to learn anytime and anywhere, where a computer is available. Moreover, CAPT can also provide individualized learning material and prompt feedbacks. Since CAPT can provide individualized learning material and give learners more autonomy, a question is raised whether different voices which produce same learning material make a difference for pronunciation learning. In other words, what voices are suitable for language learners to imitate? Some previous research has attempted to answer this question.

Some studies have suggested that language learners can benefit from listening to their own voices producing native-like utterances since it may be easier for them to perceive differences between their own utterances and their native-like utterances (Sundström, 1998, Bissiri and Pfitzinger, 2009). Also, speech synthesis technologies have been developed to synthesize native-like utterances with learners’ voice characteristics (Nagano and Ozawa, 1990, Sundström, 1998, Hirose, 2004, Bissiri and Pfitzinger, 2009, Felps et al., 2009).

In order to correct prosodic errors of a learner’s voice, prosody conversion techniques have been used to transfer the prosodic features of a teacher’s voice to the learner’s voice (Nagano and Ozawa, 1990, Sundström, 1998, Hirose, 2004). However, this prosody transferring keeps the segmental errors (e.g., mispronounced phonemes) in the learner’s voice intact. The segmental errors of the learner’s voice, which are unavoidable in a learner’s speech, are then inherited into the prosody modified learners’ voices. Because of the segmental errors, practicing with the prosody modified learners’ voices goes against the objective of CAPT, which is to help learners produce more native-like utterances in a second language. Thus, these resynthesized utterances by mapping the prosody of a teacher’s voice onto a learner’s voice are not suitable for learners to imitate.

The foreign accent conversion proposed in (Felps et al., 2009) is claimed to be able to correct both prosodic and segmental errors. However, this foreign accent conversion lowered the voice quality to 2.67 on a 5-point scale due to the distortion generated in the conversion process, in which a score of 1 means bad voice quality and a score of 5 means excellent voice quality. Thus, the voice quality of the foreign accent conversion needs to be improved before it can be applied into CAPT systems.

Voice conversion techniques (e.g., Erro and Moreno, 2007), which transform a source speaker’s voice to a target speaker’s voice, can potentially be used to modify a teacher’s utterance to make it sound as being produced by a learner. However, the aim of voice conversion is to make a voice sound as if it is being produced by the target speaker. Thus, the converted speech also preserves the accent of the target speaker, such as a foreign accent of a language learner. Moreover, voice conversion needs to record a set of the teacher’s utterances, as well as the learner’s utterances, which have to be fluent, without errors, and being recorded in good quality (Black, 2007), e.g., in a studio-like environment with a high quality microphone. Recording a learner’s voice in such good quality is not an easy task since not all learners can speak accurately and fluently, and not all learners’ learning environments can meet the studio-like requirements. Thus, more research needs to be done to make the learner’s voice more native-like through voice conversion techniques.

Apart from the immature speech synthesis technologies to make a learner’s voice more native-like, there are also some negative opinions about the idea of “hearing your own voice speaking”. For example, (Black, 2007) claimed that it may be the novelty of this idea impresses language learners and makes it useful, and moreover not everyone likes to listen to his/her own voice. Also, to some learners, hearing their own voices could be distracting, and could hinder them from improving their pronunciation.

Some language educators and teachers advocate that CAPT systems should have a number of speakers’ voices for users to select, listen to and imitate. They should also cover different genders, and a wide range of pitch and speech rate (Probst et al., 2002, Dyck, 2002, Lee, 2008). By listening to and imitating their favorite voices, learners might have a better perception of pronunciation. Moreover, hearing multiple voices might also help learners to generalize pronunciation skills that they have gained. This can result in more robust learning.

Lee’s study (2008) shows that learners found it difficult to catch each word and imitate utterances when the speech rates of the utterances were high. Thus, the learners would like to control the speed of speech material. Hearing fast speech might increase learners’ cognitive load, thereby impeding their interpretation and production of speech in a second language. It is understandable that it may be difficult for novices to imitate utterances of fast speakers, as their efforts might be concentrated on how to speed up their speech rather than how to pronounce each word correctly (Lee, 2008).

Also, in (Dyck’s, 2002) review of “Tsi Karhakta: At the Edge of the Woods” (a CAPT system of Mohawk language), Dyck indicated that a slow version of the pronunciation of longer words and sentences would be helpful to novices, and the speech learning material in a system should be produced at least by a male and a female speakers, so that learners could be exposed to more variations in speech. Although slow speech might be beneficial to novices, it is worth to note that slow speech might be detrimental over a long-term course of second language learning, since the objective of second language learning is to perceive and produce natural speech with a regular speed.

However, providing multiple teachers’ voices multiplies the workload of recording speech learning material and the storage space. Moreover, no matter how wide the range of the prosodic features of the teachers’ voices covers, they cannot always meet all learners’ needs. Also, the characteristics of the multiple teachers’ voices, such as voice quality and clarity, might also have an impact on the learners’ performances.

Although some CAPT systems can provide multiple speakers’ voices, the question of which voice is the “golden voice” for a language learner to imitate is still a research issue open to discussion. The pioneer study that is intended to answer this question is conducted by Probst et al. (2002). The survey conducted by Probst et al. (2002) shows that same gender, reasonable speed and clarity are the most commonly mentioned criteria of selecting preferred learning utterances by second language learners. Thus, Probst et al. suggested that CAPT systems should provide multiple teachers’ voices producing same learning material in order to select the “golden speaker” for different learners. The study conducted by Probst et al. (2002) investigated the “golden speaker” from the pronunciation improvement perspective. In their study, the measurements to evaluate the effectiveness of different teachers’ voices were the reductions of phone error and duration error from pretest to posttest. The subjects were randomly divided into three groups. Given six native speakers’ voices, Group 1 subjects were allowed to choose one speaker’s voice to imitate by themselves. Group 2 subjects imitated the voices that were the most similar to their own voices in term of pitch and speed, which were automatically chosen by the CAPT system, FLUENCY (Eskenazi and Hansma, 1998). Group 3 subjects imitated the voices that were the least similar to their own voices, which were chosen by FLUENCY. Probst et al. (2002) found that Group 2 improved their pronunciation slightly more than Group 3, and more significantly than Group 1. In their experiment, learners could practice each sentence as many times as desired. It was noticed that on average Group 1 subjects practiced each sentence (3.5 times) fewer times than Group 2 subjects (4.5 times) and group 3 subjects (4.8 times). Probst et al. (2002) argued that whether the less practice was one of the reasons for the poor performance of Group 1 needed to undertake further test. They also claimed that it might be beneficial for CAPT systems to automatically choose the voice that is the most similar to a learner’s voice for the learner to imitate.

The study conducted by Probst et al. (2002) investigated the “golden speaker” from the pronunciation improvement perspective. There is no doubt about the importance of pronunciation improvements since the ultimate goal of pronunciation learning is to improve pronunciation. However, pronunciation improvements can be influenced by many factors, such as learners’ learning ability and proficiency of the language that they are learning, not only the acoustic features of learning material. Also, these factors make it difficult to directly investigate the relationship between speech learning material and pronunciation improvements.

In this paper, we study the “golden speaker” from the learners’ imitation preference perspective. We investigate what voice features make a teacher’s voice preferable for language learners to imitate since learners’ preferred speech learning material may please them and increase their learning interests. As indicated by Arnett (1952), if a teacher speaks with a smooth, easy and pleasant voice, his/her students try to imitate his/her voice. Also, some learners may be more receptive to certain voices. For instance, as claimed by Jacob and Mythili (2008), children might be more receptive to their parents’ or teachers’ voices. A pleasant voice may also help to maintain a positive learning environment that plays an important role in a learning process.

In this paper, we focus on two voice features: speech rate and pitch-formants. In order to provide speech learning material with different voice features, CAPT system CASTLE (Computer-Assisted Stress pattern Teaching and Learning Environment) is employed in our investigation. CASTLE (Lu et al., 2010), a system that we have recently developed, is intended to help learners of English as a Second Language (ESL) to improve their abilities to correctly use stress patterns (both sentence stress and lexical stress). The learning material in CASTLE is in the form of sentences. To reduce the influence of characteristics of teachers’ voices (e.g., voice quality and clarity), CASTLE uses a single teacher’s voice as the source to automatically resynthesize several sample voices based on a learner’s voice features (i.e., speech rate and pitch-formants) and the learners’ imitation preferences.

Our voice modification transfers the voice features of a learner’s voice to a teacher’s voice, unlike previous prosody conversions, which transfer the prosodic features of a teacher’s voice to a learner’s voice. Because our voice modification is based on a teacher’s voice, the resynthesized utterances can be free from segmental error. Previous prosody conversions are normally based on a learner’s voice (e.g., in (Nagano and Ozawa, 1990, Sundström, 1998, Hirose, 2004, Bissiri and Pfitzinger, 2009)), which causes the resynthesized utterances to inevitably inherit the segmental errors (e.g., mispronounced phonemes) from the learner’s utterances. Compared with a teacher’s speech, a learner’s speech is more likely to have segmental errors.

Moreover, unlike the approach in (Probst et al., 2002), which needs to record multiple teachers’ voices in order to make the teachers’ voices cover a variety of prosodic features, our approach only needs to record one teacher’s voice. Based on the teacher’s voice, our CAPT system, CASTLE, can resynthesize multiple sample voices with different prosodies by voice modification. Compared with recording multiple teachers’ voices, providing multiple sample voices based on the voice modification reduces the workload of producing speech learning material and saves storage space in a computer. Also, the voice modification can resynthesize voices with any prosodic features that language learners may prefer. By investigating learners’ imitation preferences, CAPT systems can be developed to provide learners’ favorite voices, which may please the learners and promote their learning interests.

This paper is organized as follows. In Section 2, we present the voice modification techniques which were employed in our study to resynthesize sample voices with different voice features. Section 3 describes the setup of the experiments that we conducted to explore language learners’ imitation preferences. Experimental results and discussions are provided in Section 4. Section 5 concludes our present work and discusses our future work.

Section snippets

Voice modification

Based on a teacher’s voice, our CASTLE system resynthesizes sample voices with different voice features (i.e., speech rate and pitch-formants) by voice modification. In the following, we identify the teacher’s utterances as original teacher’s utterances, and identify the resynthesized utterances as individualized teacher’s utterances. The individualized teacher’s utterances are automatically resynthesized based on the original teacher’s utterances and learners’ preferences. Our voice

Setup of the experiments

The experiments are to investigate how the voice features (i.e., speech rate and pitch-formants) of teachers’ voices influence learners’ imitation preferences. We tested the following two hypotheses: (i) whether language learners prefer to imitate voices that sound like being produced by the same genders as themselves and possess similar pitches to their own voices; (ii) whether language learners prefer to imitate voices with speech rates close to their own voices. We expected that learners

Experimental results and discussion

The distributions of the most and least wanted to be imitated utterances labeled by the subjects are given in Fig. 2(a). Since for the three types of resynthesized individualized teacher’s utterances of each sentence, a learner could label none, one or more than one utterance as the most (or least) wanted speech, totally there are 146 utterances being labeled by the subjects as the most wanted to be imitated, and 141 utterances labeled as the least wanted to be imitated. Among the utterances

Conclusions and future work

In this paper, we have investigated what voice features (i.e., speech rate and pitch-formants) make a teacher’s voice be a “golden voice” that is preferable for a language learner to imitate.

Our approach of searching the “golden voice” is different from the study conducted by Probst et al. (2002). Probst et al. investigated the “golden voice” from learners’ pronunciation improvement perspective, while we investigated the “golden voice” from learners’ imitation preference perspective. Providing

References (27)

M.P. Bissiri et al.
Italian speakers learn lexical stress of German morphologically complex words
Speech Comm.
(2009)
Y. Ding
Text memorization and imitation: The practices of successful Chinese learners of English
System
(2007)
M. Eskenazi
An overview of spoken language technology for education
Speech Comm.
(2009)
D. Felps et al.
Foreign accent conversion in computer assisted pronunciation training
Speech Comm.
(2009)
E. Moulines et al.
Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones
Speech Comm.
(1990)
K. Probst et al.
Enhancing foreign language tutors—in search of the golden speaker
Speech Comm.
(2002)
M.K. Arnett
Does the elementary teacher have time to teach speech?
J. Southern States Comm. Assoc.
(1952)
Black, A., 2007. Speech synthesis for educational technology. In: Proc. ISCA ITRW SLaTE Workshop on Speech and Language...
Boersma, P., Weenink, D., 2009. Praat: doing phonetics by computer (Version 5.1.05). <http://www.praat.org/> (retrieved...
J. Clark et al.
An Introduction to Phonetics and Phonology
(2007)

T.M. Derwing

What do ESL students say about their accents?

Can. Mod. Lang. Rev.

(2003)

C. Dyck

Review of Tsi Karhakta: At the edge of the woods

Lang. Learn. Technol.

(2002)

Erro, D., Moreno, A., 2007. Weighted frequency warping for voice conversion. In: InterSpeech 2007 –...

Cited by (7)

Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning
2022, Computer Speech and Language
Foreign accent conversion (FAC) aims to create a new voice that has the voice identity of a given second-language (L2) speaker but with a native (L1) accent. Previous FAC approaches usually require training a separate model for each L2 speaker and, more importantly, generally require considerable speech data from each L2 speaker for training. To address these limitations, we propose Accentron, an approach that can generate accent-converted speech for arbitrary L2 speakers unseen during training. In the proposed approach, we first train a speaker-independent acoustic model on L1 corpora to extract bottleneck features that represent the linguistic content of utterances. Then, we develop a speaker encoder and an accent encoder to generate embedding vectors for the desired voice identity (L2 speaker’s) and accent (L1 accent), respectively. Lastly, we use a sequence-to-sequence model to transform bottleneck-features to Mel-spectrograms, conditioned on the L2 speaker embedding and the L1 accent embedding. We conducted experiments on the L2-ARCTIC corpus under two testing conditions: the standard FAC setting where test L2 speakers were seen during training, and a zero-shot FAC setting where test L2 speakers were unseen during training. Accentron achieves over 27% relative improvement in accentedness ratings compared to two state-of-the-art FAC systems in the standard FAC setting. More importantly, our results show that Accentron generalizes to the zero-shot FAC setting with no performance loss. Therefore, in practical use scenarios (e.g., computer-assisted pronunciation training software), Accentron can effectively avoid the need to adapt or retrain the model, which significantly reduces computations and the users’ waiting time.
Golden speaker builder – An interactive tool for pronunciation training
2019, Speech Communication
Citation Excerpt :
For instance, speed of utterance preferences of learners may go from slower to faster once they feel comfortable with pronunciation features of an utterance (Wang and Lu, 2011). Probst et al. (2002) concluded that a CAPT program should provide learners multiple golden speakers to listen to; Wang and Lu (2011) suggested that this means that learners should be given a chance to control voice modification features such as different speech rates and pitch formants, based on the learners’ own preferences. A handful of studies have examined the possibility of modifying the learner's own voice and using it for pronunciation training (Hirose et al., 2003; Peabody and Seneff, 2006; Bissiri and Pfitzinger, 2009; Bissiri et al., 2006; De Meo et al., 2012; Pellegrino and Vigliano, 2015).
The type of voice model used in Computer Assisted Pronunciation Instruction is a crucial factor in the quality of practice and the amount of uptake by language learners. As an example, prior research indicates that second-language learners are more likely to succeed when they imitate a speaker with a voice similar to their own, a so-called “golden speaker”. This manuscript presents Golden Speaker Builder (GSB), a tool that allows learners to generate a personalized “golden-speaker” voice: one that mirrors their own voice but with a native accent. We describe the overall system design, including the web application with its user interface, and the underlying speech analysis/synthesis algorithms. Next, we present results from a series of listening tests, which show that GSB is capable of synthesizing such golden-speaker voices. Finally, we present results from a user study in a language-instruction setting, which show that practising with GSB leads to improved fluency and comprehensibility. We suggest reasons for why learners improved as they did and recommendations for the next iteration of the training.
Converting Foreign Accent Speech without a Reference
2021, IEEE/ACM Transactions on Audio Speech and Language Processing
Talking Head-based L2 Pronunciation Training: Impact on Achievement Emotions, Cognitive Load, and Their Relationships with Learning Performance
2020, International Journal of Human-Computer Interaction
The effect of teaching prosody awareness on interpreting performance: an experimental study of consecutive interpreting from English into Farsi
2018, Perspectives: Studies in Translatology
Standard speaker selection in speech synthesis for Mandarin tone learning
2013, Lecture Notes in Electrical Engineering

View all citing articles on Scopus

View full text

Investigation of golden speakers for second language learners from imitation preference perspective by voice modification

Abstract

Graphical abstract

Research highlights

Introduction

Section snippets

Voice modification

Setup of the experiments

Experimental results and discussion

Conclusions and future work

Speech Comm.

System

Speech Comm.

Speech Comm.

Speech Comm.

Speech Comm.

Does the elementary teacher have time to teach speech?

J. Southern States Comm. Assoc.

An Introduction to Phonetics and Phonology

What do ESL students say about their accents?

Can. Mod. Lang. Rev.

Review of Tsi Karhakta: At the edge of the woods

Lang. Learn. Technol.