Restoring Incorrectly Segmented Keywords and Turn-Taking Caused by Short Pauses

Komatani, Kazunori; Hotta, Naoki; Sato, Satoshi

doi:10.1007/978-3-319-21834-2_18

Restoring Incorrectly Segmented Keywords and Turn-Taking Caused by Short Pauses

Kazunori Komatani⁵,
Naoki Hotta⁶ &
Satoshi Sato⁶

Chapter
First Online: 01 January 2016

704 Accesses

Part of the book series: Signals and Communication Technology ((SCT))

Abstract

Appropriate turn-taking is an important issue in spoken dialogue systems. Especially in ones that feature quick responses, a user utterance is often incorrectly segmented by voice activity detection (VAD) because of short pauses within it. Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors, causing the system to start responding even while the user is still speaking. The problems get worse when an interruption occurs in the middle of a keyword such as a POI name because ASR results are unreliable for such fragments. We have developed a method that alleviates these problems and have implemented it as a plug-in for the MMDAgent open-source software. The segmented utterances are integrated and interpreted as a unit. An erroneously started system utterance is terminated by adding new states for the finite state transducer, which controls the system’s dialogue management. Evaluation results showed that this method improved utterance interpretation accuracy.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://www.mmdagent.jp/.
2.
Delay occurs for the duration of the combined wav file itself (a few seconds). This is because the combination and ASR processes start only after the second fragment is obtained in the current implementation of our plug-in.
3.
The option -rejectshort of Julius was used for this purpose.
4.
An example of this case is shown in Fig. 2.

References

Baumann T, Schlangen D (2011) Predicting the micro-timing of user input for an incremental spoken dialogue system that completes a user’s ongoing turn. In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 120–129
Google Scholar
Bell L, Boye J, Gustafson J (2001) Real-time handling of fragmented utterances. In: Proceedings of the NAACL workshop on adaption in dialogue systems, pp 2–8
Google Scholar
Benyassine A, Shlomot E, Yu Su H, Massaloux D, Lamblin C, Petit JP (1997) ITU-T recommendation G.729 annex B: a silence compression scheme for use with g.729 optimized for v.70 digital simultaneous voice and data applications. IEEE Commun Mag 35(9):64–73
Google Scholar
Core MG, Schubert LK (1999) A syntactic framework for speech repairs and other disruptions. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 413–420. http://dx.doi.org/10.3115/1034678.1034742
Edlund J, Heldner M, Gustafson J (2005) Utterance segmentation and turn-taking in spoken dialogue systems. In: Computer studies in language and speech, pp 576–587
Google Scholar
Georgila K, Wang N, Gratch J (2010) Cross-domain speech disfluency detection. In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 237–240
Google Scholar
Heeman PA, Allen JF (1999) Speech repairs, intonational phrases and discourse markers: modeling speakers’ utterances in spoken dialogue. Comput Linguist 25:527–571
Google Scholar
Jan EE, Maison B, Mangu L, Zweig G (2003) Automatic construction of unique signatures and confusable sets for natural language directory assistance application. In: Proceedings of the European conference speech communication and technology (EUROSPEECH), pp 1249–1252
Google Scholar
Katsumaru M, Komatani K, Ogata T, Okuno HG (2009) Adjusting occurrence probabilities of automatically-generated abbreviated words in spoken dialogue systems. In: Next-generation applied intelligence. Lecture notes in computer science, vol 5579. Springer, Berlin, pp 481–490. http://dx.doi.org/10.1007/978-3-642-02568-6_49
Google Scholar
Lee A, Kawahara T (2009) Recent development of open-source speech recognition engine Julius. In: Proceedings of the APSIPA ASC: Asia-Pacific signal and information processing association, annual summit and conference, pp 131–137
Google Scholar
Lee A, Oura K, Tokuda K (2013) MMDAgent—a fully open-source toolkit for voice interaction systems. In: Proceedings of the IEEE international conference on acoustic, speech and signal processing (ICASSP), pp 8382–8385
Google Scholar
Liu Y, Shriberg E, Stolcke A, Hillard D, Ostendorf M, Harper M (2006) Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans Audio Speech Lang Process 14(5):1526–1540. http://dx.doi.org/10.1109/TASL.2006.878255
Nakano M, Miyazaki N, Ichi Hirasawa J, Dohsaka K, Kawabata T (1999) Understanding unsegmented user utterances in real-time spoken dialogue systems. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 200–207
Google Scholar
Raux A, Eskenazi M (2008) Optimizing endpointing thresholds using dialogue features in a spoken dialogue system. In: Proceedings of the SIGdial workshop on discourse and dialogue, pp 1–10
Google Scholar
Raux A, Eskenazi M (2009) A finite-state turn-taking model for spoken dialog systems. In: Proceedings of the human language technologies: annual conference of the North American chapter of the association for computational linguistics (HLT NAACL), pp 629–637
Google Scholar
Sato R, Higashinaka R, Tamoto M, Nakano M, Aikawa K (2002) Learning decision trees to determine turn-taking by spoken dialogue systems. In: Proceedings of the international conference on spoken language processing (ICSLP), pp 861–864
Google Scholar
Selfridge E, Arizmendi I, Heeman PA, Williams JD (2011) Stability and accuracy in incremental speech recognition. In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 110–119
Google Scholar
Shneiderman B (1997) Designing the user interface, 3rd edn. Addison-Wesley, New York
Google Scholar
Singh R, Seltzer ML, Raj B, Stern RM (2001) Speech in noisy environments: robust automatic segmentation, feature extraction, and hypothesis combination. In: Proceedings of the IEEE international conferenceon acoustic, speech and signal processing (ICASSP), vol 1, pp 273–276
Google Scholar
Skantze G, Hjalmarsson A (2010) Towards incremental speech generation in dialogue systems. In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 1–8
Google Scholar
Traum D, DeVault D, Lee J, Wang Z, Marsella S (2012) Incremental dialogue understanding and feedback for multiparty, multimodal conversation. In: Intelligent virtual agents. Lecture notes in computer science, vol 7502. Springer, Berlin, pp 275–288. http://dx.doi.org/10.1007/978-3-642-33197-8_29
Google Scholar

Download references

Acknowledgments

This research was partly supported by the JST PRESTO Program and the Naito Science & Engineering Foundation.

Author information

Authors and Affiliations

The Institute of Scientific and Industrial Research, Osaka University, 8-1 Mihogaoka, Ibaraki, Osaka, 567-0047, Japan
Kazunori Komatani
Graduate School of Engineering, Nagoya University, Furo-cho C3-1(631), Chikusa-ku, Nagoya, Aichi, 464-8603, Japan
Naoki Hotta & Satoshi Sato

Authors

Kazunori Komatani
View author publications
You can also search for this author in PubMed Google Scholar
Naoki Hotta
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Sato
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kazunori Komatani .

Editor information

Editors and Affiliations

School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Alexander Rudnicky
Cupertino, California, USA
Antoine Raux
Silicon Valley, Carnegie Mellon University, Moffett Field, California, USA
Ian Lane
Mountain View, California, USA
Teruhisa Misu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Komatani, K., Hotta, N., Sato, S. (2016). Restoring Incorrectly Segmented Keywords and Turn-Taking Caused by Short Pauses. In: Rudnicky, A., Raux, A., Lane, I., Misu, T. (eds) Situated Dialog in Speech-Based Human-Computer Interaction. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-21834-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-21834-2_18
Published: 21 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21833-5
Online ISBN: 978-3-319-21834-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics