skip to main content
research-article

JustSpeak: Automated, User-Configurable, Interactive Agents for Speech Tutoring

Published:29 May 2021Publication History
Skip Abstract Section

Abstract

Conversational agents are widely used in many situations, especially for speech tutoring. However, their contents and functions are often pre-defined and not customizable for people without technical backgrounds, thus significantly limiting their flexibility and usability. Besides, conventional agents often cannot provide feedback in the middle of training sessions because they lack technical approaches to evaluate users' speech dynamically. We propose JustSpeak: automated and interactive speech tutoring agents with various configurable feedback mechanisms, using any speech recordings with its transcription text as the template for speech training. In JustSpeak, we developed an automated procedure to generate customized tutoring agents from user-inputted templates. Moreover, we created a set of methods to dynamically synchronize speech recognizers' behavior with the agent's tutoring progress, making it possible to detect various speech mistakes dynamically such as being stuck, mispronunciation, and rhythm deviations. Furthermore, we identified the design primitives in JustSpeak to create different novel feedback mechanisms, such as adaptive playback, follow-on training, and passive adaptation. They can be combined to create customized tutoring agents, which we demonstrate with an example for language learning. We believe JustSpeak can create more personalized speech learning opportunities by enabling tutoring agents that are customizable, always available, and easy-to-use.

Skip Supplemental Material Section

Supplemental Material

v5eics202vf.mp4

mp4

42.8 MB

References

  1. [n.d.]. openFrameworks. https://openframeworks.cc/.Google ScholarGoogle Scholar
  2. Jeesoo Bang, Sechun Kang, and Gary Geunbae Lee. 2013. An automatic feedback system for English speaking integrating pronunciation and prosody assessments. In Speech and Language Technology in Education.Google ScholarGoogle Scholar
  3. Tony Beltramelli. 2018. pix2code: Generating code from a graphical user interface screenshot. In Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jared Bernstein and Dimitry Rtischev. 1991. A voice interactive language instruction system. In Second European Conference on Speech Communication and Technology.Google ScholarGoogle Scholar
  5. Amber Bloomfield, Sarah C Wayland, Elizabeth Rhoades, Allison Blodgett, Jared Linck, and Steven Ross. 2010. What makes listening difficult? Factors affecting second language listening comprehension. Technical Report. MARYLAND UNIV COLLEGE PARK.Google ScholarGoogle Scholar
  6. David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer, Kallirroi Georgila, Jon Gratch, Arno Hartholt, Margaux Lhommet, et al. 2014. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. 1061--1068. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Farzad Ehsani, Jared Bernstein, Amir Najmi, and Ognjen Todic. 1997. Subarashii: Japanese interactive spoken language education. In Fifth European Conference on Speech Communication and Technology.Google ScholarGoogle ScholarCross RefCross Ref
  8. Farzad Ehsani and Eva Knodt. 1998. Speech technology in computer-aided language learning: Strengths and limitations of a new CALL paradigm. (1998).Google ScholarGoogle Scholar
  9. Dominik Ertl. 2009. Semi-automatic multimodal user interface generation. In Proceedings of the 1st ACM SIGCHI symposium on Engineering interactive computing systems. 321--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Maxine Eskenazi and Scott Hansma. 1998. The Fluency pronunciation trainer. In Proceedings of the STiLL Workshop. Citeseer.Google ScholarGoogle Scholar
  11. Maxine Eskenazi, Yan Ke, Jordi Albornoz, and Katharina Probst. 2000. The fluency pronunciation trainer: Update and user issues. In Proceedings INSTiL2000, Vol. 1.Google ScholarGoogle Scholar
  12. Patrick Gebhard, Gregor Mehlmann, and Michael Kipp. 2012. Visual SceneMaker-a tool for authoring interactive virtual characters. Journal on Multimodal User Interfaces, Vol. 6, 1--2 (2012), 3--11.Google ScholarGoogle ScholarCross RefCross Ref
  13. Anna Hjalmarsson, Preben Wik, and Jenny Brusk. 2007. Dealing with DEAL: a dialogue system for conversation training. In Proceedings of SIGDIAL. 132--135.Google ScholarGoogle Scholar
  14. Mohammed Hoque, Matthieu Courgeon, Jean-Claude Martin, Bilge Mutlu, and Rosalind W Picard. 2013. Mach: My automated conversation coach. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing. 697--706. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ni Kang, Willem-Paul Brinkman, M Birna van Riemsdijk, and Mark A Neerincx. 2013. An Expressive Virtual Audiencewith Flexible Behavioral Styles. IEEE Transactions on Affective Computing, Vol. 4, 4 (2013), 326--340.Google ScholarGoogle ScholarCross RefCross Ref
  16. Effie Karuzaki and Anthony Savidis. 2015. Yeti: yet another automatic interface composer. In Proceedings of the 7th ACM SIGCHI Symposium on Engineering Interactive Computing Systems. 12--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Donald E Knuth. 1964. Backus normal form vs. backus naur form. Commun. ACM, Vol. 7, 12 (1964), 735--736. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Stefan Kopp, Lars Gesellensetter, Nicole C Kr"amer, and Ipke Wachsmuth. 2005. A conversational agent as museum guide--design and evaluation of a real-world application. In International workshop on intelligent virtual agents. Springer, 329--343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Oh-Woog Kwon, Kiyoung Lee, Yoon-Hyung Roh, Jin-Xia Huang, Sung-Kwon Choi, Young-Kil Kim, Hyung Bae Jeon, Yoo Rhee Oh, Yun-Kyung Lee, Byung Ok Kang, et al. 2015. GenieTutor: A Computer-Assisted Second-Language Learning System Based on Spoken Language Understanding. In Natural language dialog systems and intelligent assistants. Springer, 257--262.Google ScholarGoogle Scholar
  20. Craig Lambert, Judit Kormos, and Danny Minn. 2017. Task repetition and second language speech processing. Studies in Second Language Acquisition, Vol. 39, 1 (2017), 167--196.Google ScholarGoogle ScholarCross RefCross Ref
  21. Akinobu Lee and Tatsuya Kawahara. 2009. Recent development of open-source speech recognition engine julius. In Proceedings: APSIPA ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference. Asia-Pacific Signal and Information Processing Association, 2009 Annual ?, 131--137.Google ScholarGoogle Scholar
  22. Kyusong Lee, Soo-Ok Kweon, Sungjin Lee, Hyungjong Noh, and Gary Geunbae Lee. 2014. POSTECH immersive English study (POMY): Dialog-based language learning game. IEICE TRANSACTIONS on Information and Systems, Vol. 97, 7 (2014), 1830--1841.Google ScholarGoogle ScholarCross RefCross Ref
  23. Sungjin Lee, Hyungjong Noh, Jonghoon Lee, Kyusong Lee, and G Lee. 2010. POSTECH approaches for dialog-based english conversation tutoring. Proc. APSIPA ASC (2010), 794--803.Google ScholarGoogle Scholar
  24. John M. Levis. 2018. Rhythm and Intelligibility .Cambridge University Press, 127--149. https://doi.org/10.1017/9781108241564.009Google ScholarGoogle Scholar
  25. Pierrick Milhorat, Stephan Schlögl, Gérard Chollet, and Jerome Boudy. 2013. What if everyone could do it? a framework for easier spoken dialog system design. In Proceedings of the 5th ACM SIGCHI symposium on engineering interactive computing systems. 217--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hazel Morton and Mervyn A Jack. 2005. Scenario-based spoken interaction with virtual agents. Computer Assisted Language Learning, Vol. 18, 3 (2005), 171--191.Google ScholarGoogle ScholarCross RefCross Ref
  27. Jack Mostow et al. 2001. Evaluating tutors that listen: An overview of Project LISTEN. (2001).Google ScholarGoogle Scholar
  28. Roger Nkambou, Jacqueline Bourdeau, and Valéry Psyché. 2010. Building intelligent tutoring systems: An overview. In Advances in Intelligent Tutoring Systems. Springer, 361--375.Google ScholarGoogle Scholar
  29. David-Paul Pertaub, Mel Slater, and Chris Barker. 2002. An experiment on public speaking anxiety in response to three different types of virtual audience. Presence: Teleoperators & Virtual Environments, Vol. 11, 1 (2002), 68--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Antoine Raux and Maxine Eskenazi. 2004. Using task-oriented spoken dialogue systems for language learning: potential, practical applications and challenges. In InSTIL/ICALL Symposium 2004.Google ScholarGoogle Scholar
  31. Marikka Elizabeth Rypa and Patti Price. 1999. VILTS: A tale of two technologies. CALICO journal (1999), 385--404.Google ScholarGoogle Scholar
  32. Kousuke SUGAI, Shigeru YAMANE, and Kazuo KANZAKI. 2016. The Time Domain Factors Affecting EFL Learners' Listening Comprehension: a Study on Japanese EFL Learners. ARELE: Annual Review of English Language Education in Japan, Vol. 27 (2016), 97--108.Google ScholarGoogle Scholar
  33. Ha Trinh, Reza Asadi, Darren Edge, and T Bickmore. 2017. Robocop: A robotic coach for oral presentations. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 1, 2 (2017), 1--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ha Trinh, Lazlo Ring, and Timothy Bickmore. 2015. Dynamicduo: co-presenting with virtual agents. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 1739--1748. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jelte van Waterschoot, Merijn Bruijnes, Jan Flokstra, Dennis Reidsma, Daniel Davison, Mariët Theune, and Dirk Heylen. 2018. Flipper 2.0: a pragmatic dialogue engine for embodied conversational agents. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 43--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kurt VanLehn. 2011. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, Vol. 46, 4 (2011), 197--221.Google ScholarGoogle ScholarCross RefCross Ref
  37. Gervasio Varela. 2013. Autonomous adaptation of user interfaces to support mobility in ambient intelligence systems. In Proceedings of the 5th ACM SIGCHI symposium on Engineering interactive computing systems. 179--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Keitaro Wakabayashi, Daisuke Yamamoto, and Naohisa Takahashi. 2016. A Voice Dialog Editor Based on Finite State Transducer Using Composite State for Tablet Devices. In Computer and Information Science 2015. Springer, 125--139.Google ScholarGoogle Scholar
  39. Richard C Waters. 1995. The audio interactive tutor. Computer Assisted Language Learning, Vol. 8, 4 (1995), 325--354.Google ScholarGoogle ScholarCross RefCross Ref
  40. Silke M Witt and Steve J Young. 2000. Phone-level pronunciation scoring and assessment for interactive language learning. Speech communication, Vol. 30, 2--3 (2000), 95--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yuki Yoshimura and Brian MacWhinney. 2007. The effect of oral repetition on L2 speech fluency: An experimental tool and language tutor. In Workshop on Speech and Language Technology in Education.Google ScholarGoogle Scholar
  42. Xinlei Zhang, Takashi Miyaki, and Jun Rekimoto. 2020. WithYou: Automated Adaptive Speech Tutoring With Context-Dependent Speech Recognition. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI '20). Association for Computing Machinery, New York, NY, USA, 1--12. https://doi.org/10.1145/3313831.3376322 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. JustSpeak: Automated, User-Configurable, Interactive Agents for Speech Tutoring

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the ACM on Human-Computer Interaction
      Proceedings of the ACM on Human-Computer Interaction  Volume 5, Issue EICS
      EICS
      June 2021
      546 pages
      EISSN:2573-0142
      DOI:10.1145/3468527
      Issue’s Table of Contents

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 May 2021
      Published in pacmhci Volume 5, Issue EICS

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader