research-article

JustSpeak: Automated, User-Configurable, Interactive Agents for Speech Tutoring

Authors:

Takashi Miyaki,

Jun RekimotoAuthors Info & Claims

Proceedings of the ACM on Human-Computer Interaction, Volume 5, Issue EICS

Article No.: 202, Pages 1 - 24

https://doi.org/10.1145/3459744

Published: 29 May 2021 Publication History

Abstract

Conversational agents are widely used in many situations, especially for speech tutoring. However, their contents and functions are often pre-defined and not customizable for people without technical backgrounds, thus significantly limiting their flexibility and usability. Besides, conventional agents often cannot provide feedback in the middle of training sessions because they lack technical approaches to evaluate users' speech dynamically. We propose JustSpeak: automated and interactive speech tutoring agents with various configurable feedback mechanisms, using any speech recordings with its transcription text as the template for speech training. In JustSpeak, we developed an automated procedure to generate customized tutoring agents from user-inputted templates. Moreover, we created a set of methods to dynamically synchronize speech recognizers' behavior with the agent's tutoring progress, making it possible to detect various speech mistakes dynamically such as being stuck, mispronunciation, and rhythm deviations. Furthermore, we identified the design primitives in JustSpeak to create different novel feedback mechanisms, such as adaptive playback, follow-on training, and passive adaptation. They can be combined to create customized tutoring agents, which we demonstrate with an example for language learning. We believe JustSpeak can create more personalized speech learning opportunities by enabling tutoring agents that are customizable, always available, and easy-to-use.

Supplementary Material

MP4 File (v5eics202vf.mp4)

Supplemental video

Download
42.83 MB

References

[1]

[n.d.]. openFrameworks. https://openframeworks.cc/.

[2]

Jeesoo Bang, Sechun Kang, and Gary Geunbae Lee. 2013. An automatic feedback system for English speaking integrating pronunciation and prosody assessments. In Speech and Language Technology in Education.

[3]

Tony Beltramelli. 2018. pix2code: Generating code from a graphical user interface screenshot. In Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems. 1--6.

Digital Library

[4]

Jared Bernstein and Dimitry Rtischev. 1991. A voice interactive language instruction system. In Second European Conference on Speech Communication and Technology.

[5]

Amber Bloomfield, Sarah C Wayland, Elizabeth Rhoades, Allison Blodgett, Jared Linck, and Steven Ross. 2010. What makes listening difficult? Factors affecting second language listening comprehension. Technical Report. MARYLAND UNIV COLLEGE PARK.

[6]

David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer, Kallirroi Georgila, Jon Gratch, Arno Hartholt, Margaux Lhommet, et al. 2014. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. 1061--1068.

Digital Library

[7]

Farzad Ehsani, Jared Bernstein, Amir Najmi, and Ognjen Todic. 1997. Subarashii: Japanese interactive spoken language education. In Fifth European Conference on Speech Communication and Technology.

[8]

Farzad Ehsani and Eva Knodt. 1998. Speech technology in computer-aided language learning: Strengths and limitations of a new CALL paradigm. (1998).

[9]

Dominik Ertl. 2009. Semi-automatic multimodal user interface generation. In Proceedings of the 1st ACM SIGCHI symposium on Engineering interactive computing systems. 321--324.

Digital Library

[10]

Maxine Eskenazi and Scott Hansma. 1998. The Fluency pronunciation trainer. In Proceedings of the STiLL Workshop. Citeseer.

[11]

Maxine Eskenazi, Yan Ke, Jordi Albornoz, and Katharina Probst. 2000. The fluency pronunciation trainer: Update and user issues. In Proceedings INSTiL2000, Vol. 1.

[12]

Patrick Gebhard, Gregor Mehlmann, and Michael Kipp. 2012. Visual SceneMaker-a tool for authoring interactive virtual characters. Journal on Multimodal User Interfaces, Vol. 6, 1--2 (2012), 3--11.

[13]

Anna Hjalmarsson, Preben Wik, and Jenny Brusk. 2007. Dealing with DEAL: a dialogue system for conversation training. In Proceedings of SIGDIAL. 132--135.

[14]

Mohammed Hoque, Matthieu Courgeon, Jean-Claude Martin, Bilge Mutlu, and Rosalind W Picard. 2013. Mach: My automated conversation coach. In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing. 697--706.

Digital Library

[15]

Ni Kang, Willem-Paul Brinkman, M Birna van Riemsdijk, and Mark A Neerincx. 2013. An Expressive Virtual Audiencewith Flexible Behavioral Styles. IEEE Transactions on Affective Computing, Vol. 4, 4 (2013), 326--340.

[16]

Effie Karuzaki and Anthony Savidis. 2015. Yeti: yet another automatic interface composer. In Proceedings of the 7th ACM SIGCHI Symposium on Engineering Interactive Computing Systems. 12--21.

Digital Library

[17]

Donald E Knuth. 1964. Backus normal form vs. backus naur form. Commun. ACM, Vol. 7, 12 (1964), 735--736.

Digital Library

[18]

Stefan Kopp, Lars Gesellensetter, Nicole C Kr"amer, and Ipke Wachsmuth. 2005. A conversational agent as museum guide--design and evaluation of a real-world application. In International workshop on intelligent virtual agents. Springer, 329--343.

Digital Library

[19]

Oh-Woog Kwon, Kiyoung Lee, Yoon-Hyung Roh, Jin-Xia Huang, Sung-Kwon Choi, Young-Kil Kim, Hyung Bae Jeon, Yoo Rhee Oh, Yun-Kyung Lee, Byung Ok Kang, et al. 2015. GenieTutor: A Computer-Assisted Second-Language Learning System Based on Spoken Language Understanding. In Natural language dialog systems and intelligent assistants. Springer, 257--262.

[20]

Craig Lambert, Judit Kormos, and Danny Minn. 2017. Task repetition and second language speech processing. Studies in Second Language Acquisition, Vol. 39, 1 (2017), 167--196.

[21]

Akinobu Lee and Tatsuya Kawahara. 2009. Recent development of open-source speech recognition engine julius. In Proceedings: APSIPA ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference. Asia-Pacific Signal and Information Processing Association, 2009 Annual ?, 131--137.

[22]

Kyusong Lee, Soo-Ok Kweon, Sungjin Lee, Hyungjong Noh, and Gary Geunbae Lee. 2014. POSTECH immersive English study (POMY): Dialog-based language learning game. IEICE TRANSACTIONS on Information and Systems, Vol. 97, 7 (2014), 1830--1841.

[23]

Sungjin Lee, Hyungjong Noh, Jonghoon Lee, Kyusong Lee, and G Lee. 2010. POSTECH approaches for dialog-based english conversation tutoring. Proc. APSIPA ASC (2010), 794--803.

[24]

John M. Levis. 2018. Rhythm and Intelligibility .Cambridge University Press, 127--149. https://doi.org/10.1017/9781108241564.009

[25]

Pierrick Milhorat, Stephan Schlögl, Gérard Chollet, and Jerome Boudy. 2013. What if everyone could do it? a framework for easier spoken dialog system design. In Proceedings of the 5th ACM SIGCHI symposium on engineering interactive computing systems. 217--222.

Digital Library

[26]

Hazel Morton and Mervyn A Jack. 2005. Scenario-based spoken interaction with virtual agents. Computer Assisted Language Learning, Vol. 18, 3 (2005), 171--191.

[27]

Jack Mostow et al. 2001. Evaluating tutors that listen: An overview of Project LISTEN. (2001).

[28]

Roger Nkambou, Jacqueline Bourdeau, and Valéry Psyché. 2010. Building intelligent tutoring systems: An overview. In Advances in Intelligent Tutoring Systems. Springer, 361--375.

[29]

David-Paul Pertaub, Mel Slater, and Chris Barker. 2002. An experiment on public speaking anxiety in response to three different types of virtual audience. Presence: Teleoperators & Virtual Environments, Vol. 11, 1 (2002), 68--78.

Digital Library

[30]

Antoine Raux and Maxine Eskenazi. 2004. Using task-oriented spoken dialogue systems for language learning: potential, practical applications and challenges. In InSTIL/ICALL Symposium 2004.

[31]

Marikka Elizabeth Rypa and Patti Price. 1999. VILTS: A tale of two technologies. CALICO journal (1999), 385--404.

[32]

Kousuke SUGAI, Shigeru YAMANE, and Kazuo KANZAKI. 2016. The Time Domain Factors Affecting EFL Learners' Listening Comprehension: a Study on Japanese EFL Learners. ARELE: Annual Review of English Language Education in Japan, Vol. 27 (2016), 97--108.

[33]

Ha Trinh, Reza Asadi, Darren Edge, and T Bickmore. 2017. Robocop: A robotic coach for oral presentations. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 1, 2 (2017), 1--24.

Digital Library

[34]

Ha Trinh, Lazlo Ring, and Timothy Bickmore. 2015. Dynamicduo: co-presenting with virtual agents. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 1739--1748.

Digital Library

[35]

Jelte van Waterschoot, Merijn Bruijnes, Jan Flokstra, Dennis Reidsma, Daniel Davison, Mariët Theune, and Dirk Heylen. 2018. Flipper 2.0: a pragmatic dialogue engine for embodied conversational agents. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 43--50.

Digital Library

[36]

Kurt VanLehn. 2011. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, Vol. 46, 4 (2011), 197--221.

[37]

Gervasio Varela. 2013. Autonomous adaptation of user interfaces to support mobility in ambient intelligence systems. In Proceedings of the 5th ACM SIGCHI symposium on Engineering interactive computing systems. 179--182.

Digital Library

[38]

Keitaro Wakabayashi, Daisuke Yamamoto, and Naohisa Takahashi. 2016. A Voice Dialog Editor Based on Finite State Transducer Using Composite State for Tablet Devices. In Computer and Information Science 2015. Springer, 125--139.

[39]

Richard C Waters. 1995. The audio interactive tutor. Computer Assisted Language Learning, Vol. 8, 4 (1995), 325--354.

[40]

Silke M Witt and Steve J Young. 2000. Phone-level pronunciation scoring and assessment for interactive language learning. Speech communication, Vol. 30, 2--3 (2000), 95--108.

Digital Library

[41]

Yuki Yoshimura and Brian MacWhinney. 2007. The effect of oral repetition on L2 speech fluency: An experimental tool and language tutor. In Workshop on Speech and Language Technology in Education.

[42]

Xinlei Zhang, Takashi Miyaki, and Jun Rekimoto. 2020. WithYou: Automated Adaptive Speech Tutoring With Context-Dependent Speech Recognition. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI '20). Association for Computing Machinery, New York, NY, USA, 1--12. https://doi.org/10.1145/3313831.3376322

Digital Library

Cited By

Zhang QLan YGuo KWang D(2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659614
Zhang QLiu KWang D(2024)Sensing to Hear through MemoryProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595988:2(1-31)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659598
Zhang YMukhopadhyay RChaintreau A(2024)Network Fairness Ambivalence: When does social network capital mitigate or amplify unfairness?Proceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36560178:2(1-28)Online publication date: 29-May-2024
https://dl.acm.org/doi/10.1145/3656017
Show More Cited By

Index Terms

JustSpeak: Automated, User-Configurable, Interactive Agents for Speech Tutoring
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Natural language interfaces

Recommendations

Speech recognition by machines and humans
Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System

Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...
Summarization of Spontaneous Speech using Automatic Speech Recognition and a Speech Prosody based Tokenizer
IC3K 2016: Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management

This paper addresses speech summarization of highly spontaneous speech. The audio signal is transcribed using

an Automatic Speech Recognizer, which operates at relatively high word error rates due to the complexity

of the recognition task and high ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Human-Computer Interaction

Proceedings of the ACM on Human-Computer Interaction Volume 5, Issue EICS

EICS

June 2021

546 pages

EISSN:2573-0142

DOI:10.1145/3468527

Editor:
Jeff Nichols
Apple Inc., United States

Issue’s Table of Contents

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 May 2021

Published in PACMHCI Volume 5, Issue EICS

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

64
Total Citations
View Citations
212
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)4

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang QLan YGuo KWang D(2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659614
Zhang QLiu KWang D(2024)Sensing to Hear through MemoryProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595988:2(1-31)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659598
Zhang YMukhopadhyay RChaintreau A(2024)Network Fairness Ambivalence: When does social network capital mitigate or amplify unfairness?Proceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36560178:2(1-28)Online publication date: 29-May-2024
https://dl.acm.org/doi/10.1145/3656017
Zeng JZhang GYuan JLi YJin D(2024)Empowering Predictive Modeling by GAN-based Causal Information LearningACM Transactions on Intelligent Systems and Technology10.1145/365261015:3(1-19)Online publication date: 17-May-2024
https://dl.acm.org/doi/10.1145/3652610
Lian JYuan XLou JChen LWang HTzeng N(2024)Room-scale Location Trace Tracking via Continuous Acoustic WavesACM Transactions on Sensor Networks10.1145/364913620:3(1-23)Online publication date: 13-Apr-2024
https://dl.acm.org/doi/10.1145/3649136
Zhao LLyu RLin QZhou AZhang HMa HWang JShao CTang Y(2024)mmArrhythmiaProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36435498:1(1-25)Online publication date: 6-Mar-2024
https://dl.acm.org/doi/10.1145/3643549
Wang SZhong LFu YChen LRen JZhang Y(2024)UFaceProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36435468:1(1-27)Online publication date: 6-Mar-2024
https://dl.acm.org/doi/10.1145/3643546
Chang ZZhang FXiong JChen WZhang DGanesan DLane NShi W(2024)MSense: Boosting Wireless Sensing Capability Under Motion InterferenceProceedings of the 30th Annual International Conference on Mobile Computing and Networking10.1145/3636534.3649350(108-123)Online publication date: 29-May-2024
https://dl.acm.org/doi/10.1145/3636534.3649350
Zhan LXiong TZhang HGuo SChen XGong JLin JQin Y(2024)TouchEditorProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314547:4(1-29)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3631454
Duan DChen YXu WLi T(2024)EarSEProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314477:4(1-33)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1145/3631447
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents