Skip to main content

Recognising Conversational Speech: What an Incremental ASR Should Do for a Dialogue System and How to Get There

  • Chapter
  • First Online:
Book cover Dialogues with Social Robots

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 427))

Abstract

Automatic speech recognition (asr) is not only becoming increasingly accurate, but also increasingly adapted for producing timely, incremental output. However, overall accuracy and timeliness alone are insufficient when it comes to interactive dialogue systems which require stability in the output and responsivity to the utterance as it is unfolding. Furthermore, for a dialogue system to deal with phenomena such as disfluencies, to achieve deep understanding of user utterances these should be preserved or marked up for use by downstream components, such as language understanding, rather than be filtered out. Similarly, word timing can be informative for analyzing deictic expressions in a situated environment and should be available for analysis. Here we investigate the overall accuracy and incremental performance of three widely used systems and discuss their suitability for the aforementioned perspectives. From the differing performance along these measures we provide a picture of the requirements for incremental asr in dialogue systems and describe freely available tools for using and evaluating incremental asr.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://bitbucket.org/inpro/inprotk.

  2. 2.

    http://bitbucket.org/inpro/intelida.

  3. 3.

    In our effort, we tried reasonably hard to build well-performing models, but we did not strive for best performance, using as much material (whether in-domain or not) as we could get; e.g., blending our LMs with Wikipedia, or the like.

References

  1. Schlangen, D., Skantze, G.: A general, abstract model of incremental dialogue processing. Dial. Discourse 2(1), 83–111 (2011)

    Article  Google Scholar 

  2. Aist, G., Allen, J., Campana, E., Galescu, L., Gallo, C.A.G., Stoness, S., Swift, M., Tanenhaus, M.: Software architectures for incremental understanding of human speech. In: Proceedings of Interspeech, pp. 1922–1925 (2006)

    Google Scholar 

  3. Skantze, G., Schlangen, D.: Incremental dialogue processing in a micro-domain. In: Proceedings of EACL, pp. 745–753 (2009)

    Google Scholar 

  4. Skantze, G., Hjalmarsson, A.: Towards incremental speech generation in dialogue systems. In: Proceedings of SIGdial (2010)

    Google Scholar 

  5. Asri, L.E., Laroche, R., Pietquin, O., Khouzaimi, H.: NASTIA: negotiating appointment setting interface. In: Proceedings of LREC, pp. 266–271 (2014)

    Google Scholar 

  6. Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A.: Towards human-like spoken dialogue systems. Speech Commun. 50(8–9), 630–645 (2008)

    Article  Google Scholar 

  7. Aist, G., Allen, J., Campana, E., Gallo, C.G., Stoness, S., Swift, M.: Incremental understanding in human-computer dialogue and experimental evidence for advantages over nonincremental methods. In: Proceedings of SemDial, pp. 149–154 (2007)

    Google Scholar 

  8. Baumann, T., Atterer, M., Schlangen, D.: Assessing and improving the performance of speech recognition for incremental systems. In: Proceedings of NAACL-HTL 2009, pp. 380–388. ACL (2009)

    Google Scholar 

  9. Selfridge, E.O., Arizmendi, I., Heeman, P.A., Williams, J.D.: Stability and accuracy in incremental speech recognition. In: Proceedings of SigDial, pp. 110–119. ACL (2011)

    Google Scholar 

  10. McGraw, I., Gruenstein, A.: Estimating word-stability during incremental speech recognition. In: Proceedings of Interspeech (2012)

    Google Scholar 

  11. Baumann, T., Schlangen, D.: The inproTK 2012 release. In: Proceedings of SDCTD. ACL (2012)

    Google Scholar 

  12. Shriberg, E.: Disfluencies in switchboard. In: Proceedings of ICSLP (1996)

    Google Scholar 

  13. Fischer, K.: What computer talk is and isn’t: Human-computer conversation as intercultural communication. In: Linguistics—Computational Linguistics, vol. 17. AQ-Verlag (2006)

    Google Scholar 

  14. Paul, D.B., Baker, J.M.: The design for the wall street journal-based CSR corpus. In: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362. ACL (1992)

    Google Scholar 

  15. Morbini, F., Audhkhasi, K., Sagae, K., Artstein, R., Can, D., Georgiou, P., Narayanan, S., Leuski, A., Traum, D.: Which ASR should i choose for my dialogue system? In: Proceedings of SigDial, pp. 394–403 (2013)

    Google Scholar 

  16. Ginzburg, J., Tian, Y., Amsili, P., Beyssade, C., Hemforth, B., Mathieu, Y., Saillard, C., Hough, J., Kousidis, S., Schlangen, D.: The disfluency, exclamation and laughter in dialogue (DUEL) project. In: Proceedings of SemDial, pp. 176–178 (2014)

    Google Scholar 

  17. Meteer, M., Taylor, A., MacIntyre, R., Iyer, R.: Disfluency annotation stylebook for the switchboard corpus. ms. Technical report, Department of Computer and Information Science, University of Pennsylvania (1995)

    Google Scholar 

  18. Brennan, S., Schober, M.: How listeners compensate for disfluencies in spontaneous speech. J. Memory Lang. 44(2), 274–296 (2001)

    Article  Google Scholar 

  19. Hough, J., Purver, M.: Strongly incremental repair detection. In: Proceedings of EMNLP, pp. 78–89. ACL (2014)

    Google Scholar 

  20. Clark, H.H.: Using Language. Cambridge University Press (1996)

    Google Scholar 

  21. Core, M.G., Schubert, L.K.: A syntactic framework for speech repairs and other disruptions. In: Proceedings of ACL, pp. 413–420 (1999)

    Google Scholar 

  22. Clark, H.H., Fox Tree, J.E.: Using uh and um in spontaneous speaking. Cognition 84(1), 73–111 (2002)

    Article  Google Scholar 

  23. Ginzburg, J., Fernández, R., Schlangen, D.: Disfluencies as intra-utterance dialogue moves. Seman. Pragmat. 7(9), 1–64 (2014)

    Google Scholar 

  24. von der Malsburg, T., Baumann, T., Schlangen, D.: TELIDA: a package for manipulation and visualisation of timed linguistic data. In: Proceedings of SigDial (2009)

    Google Scholar 

  25. Baumann, T., Buß, O., Schlangen, D.: Evaluation and optimisation of incremental processors. Dial. Discourse 2(1), 113–141 (2011)

    Article  Google Scholar 

  26. Fernández, R., Lucht, T., Schlangen, D.: Referring under restricted interactivity conditions. In: Proceedings of SIGdial, pp. 136–139. ACL (2007)

    Google Scholar 

  27. Kousidis, S., Kennington, C., Schlangen, D.: Investigating speaker gaze and pointing behaviour in human-computer interaction with the mint.tools collection. In: Proceedings of SIGdial (2013)

    Google Scholar 

  28. Kennington, C., Schlangen, D.: Simple learning and compositional application of perceptually grounded word meanings for incremental reference resolution. In: Proceedings of ACL (2015)

    Google Scholar 

  29. Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., Woelfel, J.: Sphinx-4: a flexible open source framework for speech recognition. Technical Report SMLI TR2004-0811, Sun Microsystems Inc. (2004)

    Google Scholar 

  30. Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Kamvar, M., Strope, B.: Your word is my command: Google search by voice: a case study. In: Advances in Speech Recognition, pp. 61–90. Springer (2010)

    Google Scholar 

  31. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU (2011)

    Google Scholar 

  32. Plátek, O., Jurčíček, F.: Free on-line speech recogniser based on Kaldi ASR toolkit producing word posterior lattices. In: Proceedings of SIGdial, pp. 108–112. ACL (2014)

    Google Scholar 

Download references

Acknowledgements

This work is supported by a Daimler and Benz Foundation PostDoc Grant to the first author, by the BMBF KogniHome project, DFG DUEL project (grant SCHL 845/5-1) and the Cluster of Excellence Cognitive Interaction Technology ‘CITEC’ (EXC 277) at Bielefeld University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Timo Baumann .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media Singapore

About this chapter

Cite this chapter

Baumann, T., Kennington, C., Hough, J., Schlangen, D. (2017). Recognising Conversational Speech: What an Incremental ASR Should Do for a Dialogue System and How to Get There. In: Jokinen, K., Wilcock, G. (eds) Dialogues with Social Robots. Lecture Notes in Electrical Engineering, vol 427. Springer, Singapore. https://doi.org/10.1007/978-981-10-2585-3_35

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-2585-3_35

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-2584-6

  • Online ISBN: 978-981-10-2585-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics