Investigation of golden speakers for second language learners from imitation preference perspective by voice modification
Graphical abstract
Research highlights
► Voice modification is used to investigate golden speakers for ESL learns. ► Learners of English as a second language have different imitation preferences. ► Imitation preferences may change at different learning stages. ► An automatic voice modification function is advocated for CAPT systems.
Introduction
The importance of pronunciation in second language learning has been recognized by teachers and learners (Derwing, 2003) since verbal communication between people from different countries are becoming frequent with the development of economic globalization. Good pronunciation can make listeners understand easily, while bad pronunciation may become a barrier to verbal communication, or even break down conversations. Thus, language learners are encouraged to improve their pronunciation at least to the intelligible level (Hişmanoğlu, 2006). In the traditional teacher-student-based language learning model, imitation is the most commonly used method to improve pronunciation, and also considered as one of the most effective methods (Ding, 2007).
With the development of speech processing technologies and the popularity of personal computers, Computer-Assisted Pronunciation Training (CAPT) is playing an increasingly important role in pronunciation learning (Eskenazi, 2009). CAPT can provide a private and stress-free learning environment, and allows learners to learn anytime and anywhere, where a computer is available. Moreover, CAPT can also provide individualized learning material and prompt feedbacks. Since CAPT can provide individualized learning material and give learners more autonomy, a question is raised whether different voices which produce same learning material make a difference for pronunciation learning. In other words, what voices are suitable for language learners to imitate? Some previous research has attempted to answer this question.
Some studies have suggested that language learners can benefit from listening to their own voices producing native-like utterances since it may be easier for them to perceive differences between their own utterances and their native-like utterances (Sundström, 1998, Bissiri and Pfitzinger, 2009). Also, speech synthesis technologies have been developed to synthesize native-like utterances with learners’ voice characteristics (Nagano and Ozawa, 1990, Sundström, 1998, Hirose, 2004, Bissiri and Pfitzinger, 2009, Felps et al., 2009).
In order to correct prosodic errors of a learner’s voice, prosody conversion techniques have been used to transfer the prosodic features of a teacher’s voice to the learner’s voice (Nagano and Ozawa, 1990, Sundström, 1998, Hirose, 2004). However, this prosody transferring keeps the segmental errors (e.g., mispronounced phonemes) in the learner’s voice intact. The segmental errors of the learner’s voice, which are unavoidable in a learner’s speech, are then inherited into the prosody modified learners’ voices. Because of the segmental errors, practicing with the prosody modified learners’ voices goes against the objective of CAPT, which is to help learners produce more native-like utterances in a second language. Thus, these resynthesized utterances by mapping the prosody of a teacher’s voice onto a learner’s voice are not suitable for learners to imitate.
The foreign accent conversion proposed in (Felps et al., 2009) is claimed to be able to correct both prosodic and segmental errors. However, this foreign accent conversion lowered the voice quality to 2.67 on a 5-point scale due to the distortion generated in the conversion process, in which a score of 1 means bad voice quality and a score of 5 means excellent voice quality. Thus, the voice quality of the foreign accent conversion needs to be improved before it can be applied into CAPT systems.
Voice conversion techniques (e.g., Erro and Moreno, 2007), which transform a source speaker’s voice to a target speaker’s voice, can potentially be used to modify a teacher’s utterance to make it sound as being produced by a learner. However, the aim of voice conversion is to make a voice sound as if it is being produced by the target speaker. Thus, the converted speech also preserves the accent of the target speaker, such as a foreign accent of a language learner. Moreover, voice conversion needs to record a set of the teacher’s utterances, as well as the learner’s utterances, which have to be fluent, without errors, and being recorded in good quality (Black, 2007), e.g., in a studio-like environment with a high quality microphone. Recording a learner’s voice in such good quality is not an easy task since not all learners can speak accurately and fluently, and not all learners’ learning environments can meet the studio-like requirements. Thus, more research needs to be done to make the learner’s voice more native-like through voice conversion techniques.
Apart from the immature speech synthesis technologies to make a learner’s voice more native-like, there are also some negative opinions about the idea of “hearing your own voice speaking”. For example, (Black, 2007) claimed that it may be the novelty of this idea impresses language learners and makes it useful, and moreover not everyone likes to listen to his/her own voice. Also, to some learners, hearing their own voices could be distracting, and could hinder them from improving their pronunciation.
Some language educators and teachers advocate that CAPT systems should have a number of speakers’ voices for users to select, listen to and imitate. They should also cover different genders, and a wide range of pitch and speech rate (Probst et al., 2002, Dyck, 2002, Lee, 2008). By listening to and imitating their favorite voices, learners might have a better perception of pronunciation. Moreover, hearing multiple voices might also help learners to generalize pronunciation skills that they have gained. This can result in more robust learning.
Lee’s study (2008) shows that learners found it difficult to catch each word and imitate utterances when the speech rates of the utterances were high. Thus, the learners would like to control the speed of speech material. Hearing fast speech might increase learners’ cognitive load, thereby impeding their interpretation and production of speech in a second language. It is understandable that it may be difficult for novices to imitate utterances of fast speakers, as their efforts might be concentrated on how to speed up their speech rather than how to pronounce each word correctly (Lee, 2008).
Also, in (Dyck’s, 2002) review of “Tsi Karhakta: At the Edge of the Woods” (a CAPT system of Mohawk language), Dyck indicated that a slow version of the pronunciation of longer words and sentences would be helpful to novices, and the speech learning material in a system should be produced at least by a male and a female speakers, so that learners could be exposed to more variations in speech. Although slow speech might be beneficial to novices, it is worth to note that slow speech might be detrimental over a long-term course of second language learning, since the objective of second language learning is to perceive and produce natural speech with a regular speed.
However, providing multiple teachers’ voices multiplies the workload of recording speech learning material and the storage space. Moreover, no matter how wide the range of the prosodic features of the teachers’ voices covers, they cannot always meet all learners’ needs. Also, the characteristics of the multiple teachers’ voices, such as voice quality and clarity, might also have an impact on the learners’ performances.
Although some CAPT systems can provide multiple speakers’ voices, the question of which voice is the “golden voice” for a language learner to imitate is still a research issue open to discussion. The pioneer study that is intended to answer this question is conducted by Probst et al. (2002). The survey conducted by Probst et al. (2002) shows that same gender, reasonable speed and clarity are the most commonly mentioned criteria of selecting preferred learning utterances by second language learners. Thus, Probst et al. suggested that CAPT systems should provide multiple teachers’ voices producing same learning material in order to select the “golden speaker” for different learners. The study conducted by Probst et al. (2002) investigated the “golden speaker” from the pronunciation improvement perspective. In their study, the measurements to evaluate the effectiveness of different teachers’ voices were the reductions of phone error and duration error from pretest to posttest. The subjects were randomly divided into three groups. Given six native speakers’ voices, Group 1 subjects were allowed to choose one speaker’s voice to imitate by themselves. Group 2 subjects imitated the voices that were the most similar to their own voices in term of pitch and speed, which were automatically chosen by the CAPT system, FLUENCY (Eskenazi and Hansma, 1998). Group 3 subjects imitated the voices that were the least similar to their own voices, which were chosen by FLUENCY. Probst et al. (2002) found that Group 2 improved their pronunciation slightly more than Group 3, and more significantly than Group 1. In their experiment, learners could practice each sentence as many times as desired. It was noticed that on average Group 1 subjects practiced each sentence (3.5 times) fewer times than Group 2 subjects (4.5 times) and group 3 subjects (4.8 times). Probst et al. (2002) argued that whether the less practice was one of the reasons for the poor performance of Group 1 needed to undertake further test. They also claimed that it might be beneficial for CAPT systems to automatically choose the voice that is the most similar to a learner’s voice for the learner to imitate.
The study conducted by Probst et al. (2002) investigated the “golden speaker” from the pronunciation improvement perspective. There is no doubt about the importance of pronunciation improvements since the ultimate goal of pronunciation learning is to improve pronunciation. However, pronunciation improvements can be influenced by many factors, such as learners’ learning ability and proficiency of the language that they are learning, not only the acoustic features of learning material. Also, these factors make it difficult to directly investigate the relationship between speech learning material and pronunciation improvements.
In this paper, we study the “golden speaker” from the learners’ imitation preference perspective. We investigate what voice features make a teacher’s voice preferable for language learners to imitate since learners’ preferred speech learning material may please them and increase their learning interests. As indicated by Arnett (1952), if a teacher speaks with a smooth, easy and pleasant voice, his/her students try to imitate his/her voice. Also, some learners may be more receptive to certain voices. For instance, as claimed by Jacob and Mythili (2008), children might be more receptive to their parents’ or teachers’ voices. A pleasant voice may also help to maintain a positive learning environment that plays an important role in a learning process.
In this paper, we focus on two voice features: speech rate and pitch-formants. In order to provide speech learning material with different voice features, CAPT system CASTLE (Computer-Assisted Stress pattern Teaching and Learning Environment) is employed in our investigation. CASTLE (Lu et al., 2010), a system that we have recently developed, is intended to help learners of English as a Second Language (ESL) to improve their abilities to correctly use stress patterns (both sentence stress and lexical stress). The learning material in CASTLE is in the form of sentences. To reduce the influence of characteristics of teachers’ voices (e.g., voice quality and clarity), CASTLE uses a single teacher’s voice as the source to automatically resynthesize several sample voices based on a learner’s voice features (i.e., speech rate and pitch-formants) and the learners’ imitation preferences.
Our voice modification transfers the voice features of a learner’s voice to a teacher’s voice, unlike previous prosody conversions, which transfer the prosodic features of a teacher’s voice to a learner’s voice. Because our voice modification is based on a teacher’s voice, the resynthesized utterances can be free from segmental error. Previous prosody conversions are normally based on a learner’s voice (e.g., in (Nagano and Ozawa, 1990, Sundström, 1998, Hirose, 2004, Bissiri and Pfitzinger, 2009)), which causes the resynthesized utterances to inevitably inherit the segmental errors (e.g., mispronounced phonemes) from the learner’s utterances. Compared with a teacher’s speech, a learner’s speech is more likely to have segmental errors.
Moreover, unlike the approach in (Probst et al., 2002), which needs to record multiple teachers’ voices in order to make the teachers’ voices cover a variety of prosodic features, our approach only needs to record one teacher’s voice. Based on the teacher’s voice, our CAPT system, CASTLE, can resynthesize multiple sample voices with different prosodies by voice modification. Compared with recording multiple teachers’ voices, providing multiple sample voices based on the voice modification reduces the workload of producing speech learning material and saves storage space in a computer. Also, the voice modification can resynthesize voices with any prosodic features that language learners may prefer. By investigating learners’ imitation preferences, CAPT systems can be developed to provide learners’ favorite voices, which may please the learners and promote their learning interests.
This paper is organized as follows. In Section 2, we present the voice modification techniques which were employed in our study to resynthesize sample voices with different voice features. Section 3 describes the setup of the experiments that we conducted to explore language learners’ imitation preferences. Experimental results and discussions are provided in Section 4. Section 5 concludes our present work and discusses our future work.
Section snippets
Voice modification
Based on a teacher’s voice, our CASTLE system resynthesizes sample voices with different voice features (i.e., speech rate and pitch-formants) by voice modification. In the following, we identify the teacher’s utterances as original teacher’s utterances, and identify the resynthesized utterances as individualized teacher’s utterances. The individualized teacher’s utterances are automatically resynthesized based on the original teacher’s utterances and learners’ preferences. Our voice
Setup of the experiments
The experiments are to investigate how the voice features (i.e., speech rate and pitch-formants) of teachers’ voices influence learners’ imitation preferences. We tested the following two hypotheses: (i) whether language learners prefer to imitate voices that sound like being produced by the same genders as themselves and possess similar pitches to their own voices; (ii) whether language learners prefer to imitate voices with speech rates close to their own voices. We expected that learners
Experimental results and discussion
The distributions of the most and least wanted to be imitated utterances labeled by the subjects are given in Fig. 2(a). Since for the three types of resynthesized individualized teacher’s utterances of each sentence, a learner could label none, one or more than one utterance as the most (or least) wanted speech, totally there are 146 utterances being labeled by the subjects as the most wanted to be imitated, and 141 utterances labeled as the least wanted to be imitated. Among the utterances
Conclusions and future work
In this paper, we have investigated what voice features (i.e., speech rate and pitch-formants) make a teacher’s voice be a “golden voice” that is preferable for a language learner to imitate.
Our approach of searching the “golden voice” is different from the study conducted by Probst et al. (2002). Probst et al. investigated the “golden voice” from learners’ pronunciation improvement perspective, while we investigated the “golden voice” from learners’ imitation preference perspective. Providing
References (27)
- et al.
Italian speakers learn lexical stress of German morphologically complex words
Speech Comm.
(2009) Text memorization and imitation: The practices of successful Chinese learners of English
System
(2007)An overview of spoken language technology for education
Speech Comm.
(2009)- et al.
Foreign accent conversion in computer assisted pronunciation training
Speech Comm.
(2009) - et al.
Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones
Speech Comm.
(1990) - et al.
Enhancing foreign language tutors—in search of the golden speaker
Speech Comm.
(2002) Does the elementary teacher have time to teach speech?
J. Southern States Comm. Assoc.
(1952)- Black, A., 2007. Speech synthesis for educational technology. In: Proc. ISCA ITRW SLaTE Workshop on Speech and Language...
- Boersma, P., Weenink, D., 2009. Praat: doing phonetics by computer (Version 5.1.05). <http://www.praat.org/> (retrieved...
- et al.
An Introduction to Phonetics and Phonology
(2007)
What do ESL students say about their accents?
Can. Mod. Lang. Rev.
Review of Tsi Karhakta: At the edge of the woods
Lang. Learn. Technol.
Cited by (7)
Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning
2022, Computer Speech and LanguageGolden speaker builder – An interactive tool for pronunciation training
2019, Speech CommunicationCitation Excerpt :For instance, speed of utterance preferences of learners may go from slower to faster once they feel comfortable with pronunciation features of an utterance (Wang and Lu, 2011). Probst et al. (2002) concluded that a CAPT program should provide learners multiple golden speakers to listen to; Wang and Lu (2011) suggested that this means that learners should be given a chance to control voice modification features such as different speech rates and pitch formants, based on the learners’ own preferences. A handful of studies have examined the possibility of modifying the learner's own voice and using it for pronunciation training (Hirose et al., 2003; Peabody and Seneff, 2006; Bissiri and Pfitzinger, 2009; Bissiri et al., 2006; De Meo et al., 2012; Pellegrino and Vigliano, 2015).
Converting Foreign Accent Speech without a Reference
2021, IEEE/ACM Transactions on Audio Speech and Language ProcessingTalking Head-based L2 Pronunciation Training: Impact on Achievement Emotions, Cognitive Load, and Their Relationships with Learning Performance
2020, International Journal of Human-Computer InteractionThe effect of teaching prosody awareness on interpreting performance: an experimental study of consecutive interpreting from English into Farsi
2018, Perspectives: Studies in TranslatologyStandard speaker selection in speech synthesis for Mandarin tone learning
2013, Lecture Notes in Electrical Engineering