Abstract
Appropriate turn-taking is an important issue in spoken dialogue systems. Especially in ones that feature quick responses, a user utterance is often incorrectly segmented by voice activity detection (VAD) because of short pauses within it. Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors, causing the system to start responding even while the user is still speaking. The problems get worse when an interruption occurs in the middle of a keyword such as a POI name because ASR results are unreliable for such fragments. We have developed a method that alleviates these problems and have implemented it as a plug-in for the MMDAgent open-source software. The segmented utterances are integrated and interpreted as a unit. An erroneously started system utterance is terminated by adding new states for the finite state transducer, which controls the system’s dialogue management. Evaluation results showed that this method improved utterance interpretation accuracy.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
Delay occurs for the duration of the combined wav file itself (a few seconds). This is because the combination and ASR processes start only after the second fragment is obtained in the current implementation of our plug-in.
- 3.
The option -rejectshort of Julius was used for this purpose.
- 4.
An example of this case is shown in Fig. 2.
References
Baumann T, Schlangen D (2011) Predicting the micro-timing of user input for an incremental spoken dialogue system that completes a user’s ongoing turn. In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 120–129
Bell L, Boye J, Gustafson J (2001) Real-time handling of fragmented utterances. In: Proceedings of the NAACL workshop on adaption in dialogue systems, pp 2–8
Benyassine A, Shlomot E, Yu Su H, Massaloux D, Lamblin C, Petit JP (1997) ITU-T recommendation G.729 annex B: a silence compression scheme for use with g.729 optimized for v.70 digital simultaneous voice and data applications. IEEE Commun Mag 35(9):64–73
Core MG, Schubert LK (1999) A syntactic framework for speech repairs and other disruptions. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 413–420. http://dx.doi.org/10.3115/1034678.1034742
Edlund J, Heldner M, Gustafson J (2005) Utterance segmentation and turn-taking in spoken dialogue systems. In: Computer studies in language and speech, pp 576–587
Georgila K, Wang N, Gratch J (2010) Cross-domain speech disfluency detection. In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 237–240
Heeman PA, Allen JF (1999) Speech repairs, intonational phrases and discourse markers: modeling speakers’ utterances in spoken dialogue. Comput Linguist 25:527–571
Jan EE, Maison B, Mangu L, Zweig G (2003) Automatic construction of unique signatures and confusable sets for natural language directory assistance application. In: Proceedings of the European conference speech communication and technology (EUROSPEECH), pp 1249–1252
Katsumaru M, Komatani K, Ogata T, Okuno HG (2009) Adjusting occurrence probabilities of automatically-generated abbreviated words in spoken dialogue systems. In: Next-generation applied intelligence. Lecture notes in computer science, vol 5579. Springer, Berlin, pp 481–490. http://dx.doi.org/10.1007/978-3-642-02568-6_49
Lee A, Kawahara T (2009) Recent development of open-source speech recognition engine Julius. In: Proceedings of the APSIPA ASC: Asia-Pacific signal and information processing association, annual summit and conference, pp 131–137
Lee A, Oura K, Tokuda K (2013) MMDAgent—a fully open-source toolkit for voice interaction systems. In: Proceedings of the IEEE international conference on acoustic, speech and signal processing (ICASSP), pp 8382–8385
Liu Y, Shriberg E, Stolcke A, Hillard D, Ostendorf M, Harper M (2006) Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans Audio Speech Lang Process 14(5):1526–1540. http://dx.doi.org/10.1109/TASL.2006.878255
Nakano M, Miyazaki N, Ichi Hirasawa J, Dohsaka K, Kawabata T (1999) Understanding unsegmented user utterances in real-time spoken dialogue systems. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 200–207
Raux A, Eskenazi M (2008) Optimizing endpointing thresholds using dialogue features in a spoken dialogue system. In: Proceedings of the SIGdial workshop on discourse and dialogue, pp 1–10
Raux A, Eskenazi M (2009) A finite-state turn-taking model for spoken dialog systems. In: Proceedings of the human language technologies: annual conference of the North American chapter of the association for computational linguistics (HLT NAACL), pp 629–637
Sato R, Higashinaka R, Tamoto M, Nakano M, Aikawa K (2002) Learning decision trees to determine turn-taking by spoken dialogue systems. In: Proceedings of the international conference on spoken language processing (ICSLP), pp 861–864
Selfridge E, Arizmendi I, Heeman PA, Williams JD (2011) Stability and accuracy in incremental speech recognition. In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 110–119
Shneiderman B (1997) Designing the user interface, 3rd edn. Addison-Wesley, New York
Singh R, Seltzer ML, Raj B, Stern RM (2001) Speech in noisy environments: robust automatic segmentation, feature extraction, and hypothesis combination. In: Proceedings of the IEEE international conferenceon acoustic, speech and signal processing (ICASSP), vol 1, pp 273–276
Skantze G, Hjalmarsson A (2010) Towards incremental speech generation in dialogue systems. In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 1–8
Traum D, DeVault D, Lee J, Wang Z, Marsella S (2012) Incremental dialogue understanding and feedback for multiparty, multimodal conversation. In: Intelligent virtual agents. Lecture notes in computer science, vol 7502. Springer, Berlin, pp 275–288. http://dx.doi.org/10.1007/978-3-642-33197-8_29
Acknowledgments
This research was partly supported by the JST PRESTO Program and the Naito Science & Engineering Foundation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Komatani, K., Hotta, N., Sato, S. (2016). Restoring Incorrectly Segmented Keywords and Turn-Taking Caused by Short Pauses. In: Rudnicky, A., Raux, A., Lane, I., Misu, T. (eds) Situated Dialog in Speech-Based Human-Computer Interaction. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-21834-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-21834-2_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21833-5
Online ISBN: 978-3-319-21834-2
eBook Packages: EngineeringEngineering (R0)