Abstract
Conversational agents are widely used in many situations, especially for speech tutoring. However, their contents and functions are often pre-defined and not customizable for people without technical backgrounds, thus significantly limiting their flexibility and usability. Besides, conventional agents often cannot provide feedback in the middle of training sessions because they lack technical approaches to evaluate users' speech dynamically. We propose JustSpeak: automated and interactive speech tutoring agents with various configurable feedback mechanisms, using any speech recordings with its transcription text as the template for speech training. In JustSpeak, we developed an automated procedure to generate customized tutoring agents from user-inputted templates. Moreover, we created a set of methods to dynamically synchronize speech recognizers' behavior with the agent's tutoring progress, making it possible to detect various speech mistakes dynamically such as being stuck, mispronunciation, and rhythm deviations. Furthermore, we identified the design primitives in JustSpeak to create different novel feedback mechanisms, such as adaptive playback, follow-on training, and passive adaptation. They can be combined to create customized tutoring agents, which we demonstrate with an example for language learning. We believe JustSpeak can create more personalized speech learning opportunities by enabling tutoring agents that are customizable, always available, and easy-to-use.
Supplemental Material
- [n.d.]. openFrameworks. https://openframeworks.cc/.Google Scholar
- Jeesoo Bang, Sechun Kang, and Gary Geunbae Lee. 2013. An automatic feedback system for English speaking integrating pronunciation and prosody assessments. In Speech and Language Technology in Education.Google Scholar
- Tony Beltramelli. 2018. pix2code: Generating code from a graphical user interface screenshot. In Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems. 1--6. Google ScholarDigital Library
- Jared Bernstein and Dimitry Rtischev. 1991. A voice interactive language instruction system. In Second European Conference on Speech Communication and Technology.Google Scholar
- Amber Bloomfield, Sarah C Wayland, Elizabeth Rhoades, Allison Blodgett, Jared Linck, and Steven Ross. 2010. What makes listening difficult? Factors affecting second language listening comprehension. Technical Report. MARYLAND UNIV COLLEGE PARK.Google Scholar
- David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer, Kallirroi Georgila, Jon Gratch, Arno Hartholt, Margaux Lhommet, et al. 2014. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. 1061--1068. Google ScholarDigital Library
- Farzad Ehsani, Jared Bernstein, Amir Najmi, and Ognjen Todic. 1997. Subarashii: Japanese interactive spoken language education. In Fifth European Conference on Speech Communication and Technology.Google ScholarCross Ref
- Farzad Ehsani and Eva Knodt. 1998. Speech technology in computer-aided language learning: Strengths and limitations of a new CALL paradigm. (1998).Google Scholar
- Dominik Ertl. 2009. Semi-automatic multimodal user interface generation. In Proceedings of the 1st ACM SIGCHI symposium on Engineering interactive computing systems. 321--324. Google ScholarDigital Library
- Maxine Eskenazi and Scott Hansma. 1998. The Fluency pronunciation trainer. In Proceedings of the STiLL Workshop. Citeseer.Google Scholar
- Maxine Eskenazi, Yan Ke, Jordi Albornoz, and Katharina Probst. 2000. The fluency pronunciation trainer: Update and user issues. In Proceedings INSTiL2000, Vol. 1.Google Scholar
- Patrick Gebhard, Gregor Mehlmann, and Michael Kipp. 2012. Visual SceneMaker-a tool for authoring interactive virtual characters. Journal on Multimodal User Interfaces, Vol. 6, 1--2 (2012), 3--11.Google ScholarCross Ref
- Anna Hjalmarsson, Preben Wik, and Jenny Brusk. 2007. Dealing with DEAL: a dialogue system for conversation training. In Proceedings of SIGDIAL. 132--135.Google Scholar
- Mohammed Hoque, Matthieu Courgeon, Jean-Claude Martin, Bilge Mutlu, and Rosalind W Picard. 2013. Mach: My automated conversation coach. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing. 697--706. Google ScholarDigital Library
- Ni Kang, Willem-Paul Brinkman, M Birna van Riemsdijk, and Mark A Neerincx. 2013. An Expressive Virtual Audiencewith Flexible Behavioral Styles. IEEE Transactions on Affective Computing, Vol. 4, 4 (2013), 326--340.Google ScholarCross Ref
- Effie Karuzaki and Anthony Savidis. 2015. Yeti: yet another automatic interface composer. In Proceedings of the 7th ACM SIGCHI Symposium on Engineering Interactive Computing Systems. 12--21. Google ScholarDigital Library
- Donald E Knuth. 1964. Backus normal form vs. backus naur form. Commun. ACM, Vol. 7, 12 (1964), 735--736. Google ScholarDigital Library
- Stefan Kopp, Lars Gesellensetter, Nicole C Kr"amer, and Ipke Wachsmuth. 2005. A conversational agent as museum guide--design and evaluation of a real-world application. In International workshop on intelligent virtual agents. Springer, 329--343. Google ScholarDigital Library
- Oh-Woog Kwon, Kiyoung Lee, Yoon-Hyung Roh, Jin-Xia Huang, Sung-Kwon Choi, Young-Kil Kim, Hyung Bae Jeon, Yoo Rhee Oh, Yun-Kyung Lee, Byung Ok Kang, et al. 2015. GenieTutor: A Computer-Assisted Second-Language Learning System Based on Spoken Language Understanding. In Natural language dialog systems and intelligent assistants. Springer, 257--262.Google Scholar
- Craig Lambert, Judit Kormos, and Danny Minn. 2017. Task repetition and second language speech processing. Studies in Second Language Acquisition, Vol. 39, 1 (2017), 167--196.Google ScholarCross Ref
- Akinobu Lee and Tatsuya Kawahara. 2009. Recent development of open-source speech recognition engine julius. In Proceedings: APSIPA ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference. Asia-Pacific Signal and Information Processing Association, 2009 Annual ?, 131--137.Google Scholar
- Kyusong Lee, Soo-Ok Kweon, Sungjin Lee, Hyungjong Noh, and Gary Geunbae Lee. 2014. POSTECH immersive English study (POMY): Dialog-based language learning game. IEICE TRANSACTIONS on Information and Systems, Vol. 97, 7 (2014), 1830--1841.Google ScholarCross Ref
- Sungjin Lee, Hyungjong Noh, Jonghoon Lee, Kyusong Lee, and G Lee. 2010. POSTECH approaches for dialog-based english conversation tutoring. Proc. APSIPA ASC (2010), 794--803.Google Scholar
- John M. Levis. 2018. Rhythm and Intelligibility .Cambridge University Press, 127--149. https://doi.org/10.1017/9781108241564.009Google Scholar
- Pierrick Milhorat, Stephan Schlögl, Gérard Chollet, and Jerome Boudy. 2013. What if everyone could do it? a framework for easier spoken dialog system design. In Proceedings of the 5th ACM SIGCHI symposium on engineering interactive computing systems. 217--222. Google ScholarDigital Library
- Hazel Morton and Mervyn A Jack. 2005. Scenario-based spoken interaction with virtual agents. Computer Assisted Language Learning, Vol. 18, 3 (2005), 171--191.Google ScholarCross Ref
- Jack Mostow et al. 2001. Evaluating tutors that listen: An overview of Project LISTEN. (2001).Google Scholar
- Roger Nkambou, Jacqueline Bourdeau, and Valéry Psyché. 2010. Building intelligent tutoring systems: An overview. In Advances in Intelligent Tutoring Systems. Springer, 361--375.Google Scholar
- David-Paul Pertaub, Mel Slater, and Chris Barker. 2002. An experiment on public speaking anxiety in response to three different types of virtual audience. Presence: Teleoperators & Virtual Environments, Vol. 11, 1 (2002), 68--78. Google ScholarDigital Library
- Antoine Raux and Maxine Eskenazi. 2004. Using task-oriented spoken dialogue systems for language learning: potential, practical applications and challenges. In InSTIL/ICALL Symposium 2004.Google Scholar
- Marikka Elizabeth Rypa and Patti Price. 1999. VILTS: A tale of two technologies. CALICO journal (1999), 385--404.Google Scholar
- Kousuke SUGAI, Shigeru YAMANE, and Kazuo KANZAKI. 2016. The Time Domain Factors Affecting EFL Learners' Listening Comprehension: a Study on Japanese EFL Learners. ARELE: Annual Review of English Language Education in Japan, Vol. 27 (2016), 97--108.Google Scholar
- Ha Trinh, Reza Asadi, Darren Edge, and T Bickmore. 2017. Robocop: A robotic coach for oral presentations. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 1, 2 (2017), 1--24. Google ScholarDigital Library
- Ha Trinh, Lazlo Ring, and Timothy Bickmore. 2015. Dynamicduo: co-presenting with virtual agents. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 1739--1748. Google ScholarDigital Library
- Jelte van Waterschoot, Merijn Bruijnes, Jan Flokstra, Dennis Reidsma, Daniel Davison, Mariët Theune, and Dirk Heylen. 2018. Flipper 2.0: a pragmatic dialogue engine for embodied conversational agents. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 43--50. Google ScholarDigital Library
- Kurt VanLehn. 2011. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, Vol. 46, 4 (2011), 197--221.Google ScholarCross Ref
- Gervasio Varela. 2013. Autonomous adaptation of user interfaces to support mobility in ambient intelligence systems. In Proceedings of the 5th ACM SIGCHI symposium on Engineering interactive computing systems. 179--182. Google ScholarDigital Library
- Keitaro Wakabayashi, Daisuke Yamamoto, and Naohisa Takahashi. 2016. A Voice Dialog Editor Based on Finite State Transducer Using Composite State for Tablet Devices. In Computer and Information Science 2015. Springer, 125--139.Google Scholar
- Richard C Waters. 1995. The audio interactive tutor. Computer Assisted Language Learning, Vol. 8, 4 (1995), 325--354.Google ScholarCross Ref
- Silke M Witt and Steve J Young. 2000. Phone-level pronunciation scoring and assessment for interactive language learning. Speech communication, Vol. 30, 2--3 (2000), 95--108. Google ScholarDigital Library
- Yuki Yoshimura and Brian MacWhinney. 2007. The effect of oral repetition on L2 speech fluency: An experimental tool and language tutor. In Workshop on Speech and Language Technology in Education.Google Scholar
- Xinlei Zhang, Takashi Miyaki, and Jun Rekimoto. 2020. WithYou: Automated Adaptive Speech Tutoring With Context-Dependent Speech Recognition. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI '20). Association for Computing Machinery, New York, NY, USA, 1--12. https://doi.org/10.1145/3313831.3376322 Google ScholarDigital Library
Index Terms
- JustSpeak: Automated, User-Configurable, Interactive Agents for Speech Tutoring
Recommendations
Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System
Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...
Summarization of Spontaneous Speech using Automatic Speech Recognition and a Speech Prosody based Tokenizer
IC3K 2016: Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge ManagementThis paper addresses speech summarization of highly spontaneous speech. The audio signal is transcribed using
an Automatic Speech Recognizer, which operates at relatively high word error rates due to the complexity
of the recognition task and high ...
Comments