Skip to main content

Restoring Incorrectly Segmented Keywords and Turn-Taking Caused by Short Pauses

  • Chapter
  • First Online:
  • 704 Accesses

Part of the book series: Signals and Communication Technology ((SCT))

Abstract

Appropriate turn-taking is an important issue in spoken dialogue systems. Especially in ones that feature quick responses, a user utterance is often incorrectly segmented by voice activity detection (VAD) because of short pauses within it. Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors, causing the system to start responding even while the user is still speaking. The problems get worse when an interruption occurs in the middle of a keyword such as a POI name because ASR results are unreliable for such fragments. We have developed a method that alleviates these problems and have implemented it as a plug-in for the MMDAgent open-source software. The segmented utterances are integrated and interpreted as a unit. An erroneously started system utterance is terminated by adding new states for the finite state transducer, which controls the system’s dialogue management. Evaluation results showed that this method improved utterance interpretation accuracy.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.mmdagent.jp/.

  2. 2.

    Delay occurs for the duration of the combined wav file itself (a few seconds). This is because the combination and ASR processes start only after the second fragment is obtained in the current implementation of our plug-in.

  3. 3.

    The option -rejectshort of Julius was used for this purpose.

  4. 4.

    An example of this case is shown in Fig. 2.

References

  1. Baumann T, Schlangen D (2011) Predicting the micro-timing of user input for an incremental spoken dialogue system that completes a user’s ongoing turn. In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 120–129

    Google Scholar 

  2. Bell L, Boye J, Gustafson J (2001) Real-time handling of fragmented utterances. In: Proceedings of the NAACL workshop on adaption in dialogue systems, pp 2–8

    Google Scholar 

  3. Benyassine A, Shlomot E, Yu Su H, Massaloux D, Lamblin C, Petit JP (1997) ITU-T recommendation G.729 annex B: a silence compression scheme for use with g.729 optimized for v.70 digital simultaneous voice and data applications. IEEE Commun Mag 35(9):64–73

    Google Scholar 

  4. Core MG, Schubert LK (1999) A syntactic framework for speech repairs and other disruptions. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 413–420. http://dx.doi.org/10.3115/1034678.1034742

  5. Edlund J, Heldner M, Gustafson J (2005) Utterance segmentation and turn-taking in spoken dialogue systems. In: Computer studies in language and speech, pp 576–587

    Google Scholar 

  6. Georgila K, Wang N, Gratch J (2010) Cross-domain speech disfluency detection. In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 237–240

    Google Scholar 

  7. Heeman PA, Allen JF (1999) Speech repairs, intonational phrases and discourse markers: modeling speakers’ utterances in spoken dialogue. Comput Linguist 25:527–571

    Google Scholar 

  8. Jan EE, Maison B, Mangu L, Zweig G (2003) Automatic construction of unique signatures and confusable sets for natural language directory assistance application. In: Proceedings of the European conference speech communication and technology (EUROSPEECH), pp 1249–1252

    Google Scholar 

  9. Katsumaru M, Komatani K, Ogata T, Okuno HG (2009) Adjusting occurrence probabilities of automatically-generated abbreviated words in spoken dialogue systems. In: Next-generation applied intelligence. Lecture notes in computer science, vol 5579. Springer, Berlin, pp 481–490. http://dx.doi.org/10.1007/978-3-642-02568-6_49

    Google Scholar 

  10. Lee A, Kawahara T (2009) Recent development of open-source speech recognition engine Julius. In: Proceedings of the APSIPA ASC: Asia-Pacific signal and information processing association, annual summit and conference, pp 131–137

    Google Scholar 

  11. Lee A, Oura K, Tokuda K (2013) MMDAgent—a fully open-source toolkit for voice interaction systems. In: Proceedings of the IEEE international conference on acoustic, speech and signal processing (ICASSP), pp 8382–8385

    Google Scholar 

  12. Liu Y, Shriberg E, Stolcke A, Hillard D, Ostendorf M, Harper M (2006) Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans Audio Speech Lang Process 14(5):1526–1540. http://dx.doi.org/10.1109/TASL.2006.878255

  13. Nakano M, Miyazaki N, Ichi Hirasawa J, Dohsaka K, Kawabata T (1999) Understanding unsegmented user utterances in real-time spoken dialogue systems. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 200–207

    Google Scholar 

  14. Raux A, Eskenazi M (2008) Optimizing endpointing thresholds using dialogue features in a spoken dialogue system. In: Proceedings of the SIGdial workshop on discourse and dialogue, pp 1–10

    Google Scholar 

  15. Raux A, Eskenazi M (2009) A finite-state turn-taking model for spoken dialog systems. In: Proceedings of the human language technologies: annual conference of the North American chapter of the association for computational linguistics (HLT NAACL), pp 629–637

    Google Scholar 

  16. Sato R, Higashinaka R, Tamoto M, Nakano M, Aikawa K (2002) Learning decision trees to determine turn-taking by spoken dialogue systems. In: Proceedings of the international conference on spoken language processing (ICSLP), pp 861–864

    Google Scholar 

  17. Selfridge E, Arizmendi I, Heeman PA, Williams JD (2011) Stability and accuracy in incremental speech recognition. In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 110–119

    Google Scholar 

  18. Shneiderman B (1997) Designing the user interface, 3rd edn. Addison-Wesley, New York

    Google Scholar 

  19. Singh R, Seltzer ML, Raj B, Stern RM (2001) Speech in noisy environments: robust automatic segmentation, feature extraction, and hypothesis combination. In: Proceedings of the IEEE international conferenceon acoustic, speech and signal processing (ICASSP), vol 1, pp 273–276

    Google Scholar 

  20. Skantze G, Hjalmarsson A (2010) Towards incremental speech generation in dialogue systems. In: Proceedings of the annual meeting of the special interest group in discourse and dialogue (SIGDIAL), pp 1–8

    Google Scholar 

  21. Traum D, DeVault D, Lee J, Wang Z, Marsella S (2012) Incremental dialogue understanding and feedback for multiparty, multimodal conversation. In: Intelligent virtual agents. Lecture notes in computer science, vol 7502. Springer, Berlin, pp 275–288. http://dx.doi.org/10.1007/978-3-642-33197-8_29

    Google Scholar 

Download references

Acknowledgments

This research was partly supported by the JST PRESTO Program and the Naito Science & Engineering Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kazunori Komatani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Komatani, K., Hotta, N., Sato, S. (2016). Restoring Incorrectly Segmented Keywords and Turn-Taking Caused by Short Pauses. In: Rudnicky, A., Raux, A., Lane, I., Misu, T. (eds) Situated Dialog in Speech-Based Human-Computer Interaction. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-21834-2_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-21834-2_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-21833-5

  • Online ISBN: 978-3-319-21834-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics