Recognising Conversational Speech: What an Incremental ASR Should Do for a Dialogue System and How to Get There

Baumann, Timo; Kennington, Casey; Hough, Julian; Schlangen, David

doi:10.1007/978-981-10-2585-3_35

Timo Baumann³,
Casey Kennington⁴,
Julian Hough⁴ &
…
David Schlangen⁴

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 427))

1590 Accesses
6 Citations

Abstract

Automatic speech recognition (asr) is not only becoming increasingly accurate, but also increasingly adapted for producing timely, incremental output. However, overall accuracy and timeliness alone are insufficient when it comes to interactive dialogue systems which require stability in the output and responsivity to the utterance as it is unfolding. Furthermore, for a dialogue system to deal with phenomena such as disfluencies, to achieve deep understanding of user utterances these should be preserved or marked up for use by downstream components, such as language understanding, rather than be filtered out. Similarly, word timing can be informative for analyzing deictic expressions in a situated environment and should be available for analysis. Here we investigate the overall accuracy and incremental performance of three widely used systems and discuss their suitability for the aforementioned perspectives. From the differing performance along these measures we provide a picture of the requirements for incremental asr in dialogue systems and describe freely available tools for using and evaluating incremental asr.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://bitbucket.org/inpro/inprotk.
2.
http://bitbucket.org/inpro/intelida.
3.
In our effort, we tried reasonably hard to build well-performing models, but we did not strive for best performance, using as much material (whether in-domain or not) as we could get; e.g., blending our LMs with Wikipedia, or the like.

References

Schlangen, D., Skantze, G.: A general, abstract model of incremental dialogue processing. Dial. Discourse 2(1), 83–111 (2011)
Article Google Scholar
Aist, G., Allen, J., Campana, E., Galescu, L., Gallo, C.A.G., Stoness, S., Swift, M., Tanenhaus, M.: Software architectures for incremental understanding of human speech. In: Proceedings of Interspeech, pp. 1922–1925 (2006)
Google Scholar
Skantze, G., Schlangen, D.: Incremental dialogue processing in a micro-domain. In: Proceedings of EACL, pp. 745–753 (2009)
Google Scholar
Skantze, G., Hjalmarsson, A.: Towards incremental speech generation in dialogue systems. In: Proceedings of SIGdial (2010)
Google Scholar
Asri, L.E., Laroche, R., Pietquin, O., Khouzaimi, H.: NASTIA: negotiating appointment setting interface. In: Proceedings of LREC, pp. 266–271 (2014)
Google Scholar
Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A.: Towards human-like spoken dialogue systems. Speech Commun. 50(8–9), 630–645 (2008)
Article Google Scholar
Aist, G., Allen, J., Campana, E., Gallo, C.G., Stoness, S., Swift, M.: Incremental understanding in human-computer dialogue and experimental evidence for advantages over nonincremental methods. In: Proceedings of SemDial, pp. 149–154 (2007)
Google Scholar
Baumann, T., Atterer, M., Schlangen, D.: Assessing and improving the performance of speech recognition for incremental systems. In: Proceedings of NAACL-HTL 2009, pp. 380–388. ACL (2009)
Google Scholar
Selfridge, E.O., Arizmendi, I., Heeman, P.A., Williams, J.D.: Stability and accuracy in incremental speech recognition. In: Proceedings of SigDial, pp. 110–119. ACL (2011)
Google Scholar
McGraw, I., Gruenstein, A.: Estimating word-stability during incremental speech recognition. In: Proceedings of Interspeech (2012)
Google Scholar
Baumann, T., Schlangen, D.: The inproTK 2012 release. In: Proceedings of SDCTD. ACL (2012)
Google Scholar
Shriberg, E.: Disfluencies in switchboard. In: Proceedings of ICSLP (1996)
Google Scholar
Fischer, K.: What computer talk is and isn’t: Human-computer conversation as intercultural communication. In: Linguistics—Computational Linguistics, vol. 17. AQ-Verlag (2006)
Google Scholar
Paul, D.B., Baker, J.M.: The design for the wall street journal-based CSR corpus. In: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362. ACL (1992)
Google Scholar
Morbini, F., Audhkhasi, K., Sagae, K., Artstein, R., Can, D., Georgiou, P., Narayanan, S., Leuski, A., Traum, D.: Which ASR should i choose for my dialogue system? In: Proceedings of SigDial, pp. 394–403 (2013)
Google Scholar
Ginzburg, J., Tian, Y., Amsili, P., Beyssade, C., Hemforth, B., Mathieu, Y., Saillard, C., Hough, J., Kousidis, S., Schlangen, D.: The disfluency, exclamation and laughter in dialogue (DUEL) project. In: Proceedings of SemDial, pp. 176–178 (2014)
Google Scholar
Meteer, M., Taylor, A., MacIntyre, R., Iyer, R.: Disfluency annotation stylebook for the switchboard corpus. ms. Technical report, Department of Computer and Information Science, University of Pennsylvania (1995)
Google Scholar
Brennan, S., Schober, M.: How listeners compensate for disfluencies in spontaneous speech. J. Memory Lang. 44(2), 274–296 (2001)
Article Google Scholar
Hough, J., Purver, M.: Strongly incremental repair detection. In: Proceedings of EMNLP, pp. 78–89. ACL (2014)
Google Scholar
Clark, H.H.: Using Language. Cambridge University Press (1996)
Google Scholar
Core, M.G., Schubert, L.K.: A syntactic framework for speech repairs and other disruptions. In: Proceedings of ACL, pp. 413–420 (1999)
Google Scholar
Clark, H.H., Fox Tree, J.E.: Using uh and um in spontaneous speaking. Cognition 84(1), 73–111 (2002)
Article Google Scholar
Ginzburg, J., Fernández, R., Schlangen, D.: Disfluencies as intra-utterance dialogue moves. Seman. Pragmat. 7(9), 1–64 (2014)
Google Scholar
von der Malsburg, T., Baumann, T., Schlangen, D.: TELIDA: a package for manipulation and visualisation of timed linguistic data. In: Proceedings of SigDial (2009)
Google Scholar
Baumann, T., Buß, O., Schlangen, D.: Evaluation and optimisation of incremental processors. Dial. Discourse 2(1), 113–141 (2011)
Article Google Scholar
Fernández, R., Lucht, T., Schlangen, D.: Referring under restricted interactivity conditions. In: Proceedings of SIGdial, pp. 136–139. ACL (2007)
Google Scholar
Kousidis, S., Kennington, C., Schlangen, D.: Investigating speaker gaze and pointing behaviour in human-computer interaction with the mint.tools collection. In: Proceedings of SIGdial (2013)
Google Scholar
Kennington, C., Schlangen, D.: Simple learning and compositional application of perceptually grounded word meanings for incremental reference resolution. In: Proceedings of ACL (2015)
Google Scholar
Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., Woelfel, J.: Sphinx-4: a flexible open source framework for speech recognition. Technical Report SMLI TR2004-0811, Sun Microsystems Inc. (2004)
Google Scholar
Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Kamvar, M., Strope, B.: Your word is my command: Google search by voice: a case study. In: Advances in Speech Recognition, pp. 61–90. Springer (2010)
Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU (2011)
Google Scholar
Plátek, O., Jurčíček, F.: Free on-line speech recogniser based on Kaldi ASR toolkit producing word posterior lattices. In: Proceedings of SIGdial, pp. 108–112. ACL (2014)
Google Scholar

Download references

Acknowledgements

This work is supported by a Daimler and Benz Foundation PostDoc Grant to the first author, by the BMBF KogniHome project, DFG DUEL project (grant SCHL 845/5-1) and the Cluster of Excellence Cognitive Interaction Technology ‘CITEC’ (EXC 277) at Bielefeld University.

Author information

Authors and Affiliations

Natural Language Systems Group, Informatics Department, Universität Hamburg, Hamburg, Germany
Timo Baumann
Dialogue Systems Group, Faculty of Linguistics and Literature and CITEC, Bielefeld University, Bielefeld, Germany
Casey Kennington, Julian Hough & David Schlangen

Authors

Timo Baumann
View author publications
You can also search for this author in PubMed Google Scholar
Casey Kennington
View author publications
You can also search for this author in PubMed Google Scholar
Julian Hough
View author publications
You can also search for this author in PubMed Google Scholar
David Schlangen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Timo Baumann .

Editor information

Editors and Affiliations

Institute of Behavioural Sciences, University of Helsinki Institute of Behavioural Sciences, Helsinki, Finland
Kristiina Jokinen
University of Helsinki , Helsinki, Finland
Graham Wilcock

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Baumann, T., Kennington, C., Hough, J., Schlangen, D. (2017). Recognising Conversational Speech: What an Incremental ASR Should Do for a Dialogue System and How to Get There. In: Jokinen, K., Wilcock, G. (eds) Dialogues with Social Robots. Lecture Notes in Electrical Engineering, vol 427. Springer, Singapore. https://doi.org/10.1007/978-981-10-2585-3_35

Download citation

DOI: https://doi.org/10.1007/978-981-10-2585-3_35
Published: 25 December 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2584-6
Online ISBN: 978-981-10-2585-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics