Abstract
Automatic speech recognition (asr) is not only becoming increasingly accurate, but also increasingly adapted for producing timely, incremental output. However, overall accuracy and timeliness alone are insufficient when it comes to interactive dialogue systems which require stability in the output and responsivity to the utterance as it is unfolding. Furthermore, for a dialogue system to deal with phenomena such as disfluencies, to achieve deep understanding of user utterances these should be preserved or marked up for use by downstream components, such as language understanding, rather than be filtered out. Similarly, word timing can be informative for analyzing deictic expressions in a situated environment and should be available for analysis. Here we investigate the overall accuracy and incremental performance of three widely used systems and discuss their suitability for the aforementioned perspectives. From the differing performance along these measures we provide a picture of the requirements for incremental asr in dialogue systems and describe freely available tools for using and evaluating incremental asr.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
In our effort, we tried reasonably hard to build well-performing models, but we did not strive for best performance, using as much material (whether in-domain or not) as we could get; e.g., blending our LMs with Wikipedia, or the like.
References
Schlangen, D., Skantze, G.: A general, abstract model of incremental dialogue processing. Dial. Discourse 2(1), 83–111 (2011)
Aist, G., Allen, J., Campana, E., Galescu, L., Gallo, C.A.G., Stoness, S., Swift, M., Tanenhaus, M.: Software architectures for incremental understanding of human speech. In: Proceedings of Interspeech, pp. 1922–1925 (2006)
Skantze, G., Schlangen, D.: Incremental dialogue processing in a micro-domain. In: Proceedings of EACL, pp. 745–753 (2009)
Skantze, G., Hjalmarsson, A.: Towards incremental speech generation in dialogue systems. In: Proceedings of SIGdial (2010)
Asri, L.E., Laroche, R., Pietquin, O., Khouzaimi, H.: NASTIA: negotiating appointment setting interface. In: Proceedings of LREC, pp. 266–271 (2014)
Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A.: Towards human-like spoken dialogue systems. Speech Commun. 50(8–9), 630–645 (2008)
Aist, G., Allen, J., Campana, E., Gallo, C.G., Stoness, S., Swift, M.: Incremental understanding in human-computer dialogue and experimental evidence for advantages over nonincremental methods. In: Proceedings of SemDial, pp. 149–154 (2007)
Baumann, T., Atterer, M., Schlangen, D.: Assessing and improving the performance of speech recognition for incremental systems. In: Proceedings of NAACL-HTL 2009, pp. 380–388. ACL (2009)
Selfridge, E.O., Arizmendi, I., Heeman, P.A., Williams, J.D.: Stability and accuracy in incremental speech recognition. In: Proceedings of SigDial, pp. 110–119. ACL (2011)
McGraw, I., Gruenstein, A.: Estimating word-stability during incremental speech recognition. In: Proceedings of Interspeech (2012)
Baumann, T., Schlangen, D.: The inproTK 2012 release. In: Proceedings of SDCTD. ACL (2012)
Shriberg, E.: Disfluencies in switchboard. In: Proceedings of ICSLP (1996)
Fischer, K.: What computer talk is and isn’t: Human-computer conversation as intercultural communication. In: Linguistics—Computational Linguistics, vol. 17. AQ-Verlag (2006)
Paul, D.B., Baker, J.M.: The design for the wall street journal-based CSR corpus. In: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362. ACL (1992)
Morbini, F., Audhkhasi, K., Sagae, K., Artstein, R., Can, D., Georgiou, P., Narayanan, S., Leuski, A., Traum, D.: Which ASR should i choose for my dialogue system? In: Proceedings of SigDial, pp. 394–403 (2013)
Ginzburg, J., Tian, Y., Amsili, P., Beyssade, C., Hemforth, B., Mathieu, Y., Saillard, C., Hough, J., Kousidis, S., Schlangen, D.: The disfluency, exclamation and laughter in dialogue (DUEL) project. In: Proceedings of SemDial, pp. 176–178 (2014)
Meteer, M., Taylor, A., MacIntyre, R., Iyer, R.: Disfluency annotation stylebook for the switchboard corpus. ms. Technical report, Department of Computer and Information Science, University of Pennsylvania (1995)
Brennan, S., Schober, M.: How listeners compensate for disfluencies in spontaneous speech. J. Memory Lang. 44(2), 274–296 (2001)
Hough, J., Purver, M.: Strongly incremental repair detection. In: Proceedings of EMNLP, pp. 78–89. ACL (2014)
Clark, H.H.: Using Language. Cambridge University Press (1996)
Core, M.G., Schubert, L.K.: A syntactic framework for speech repairs and other disruptions. In: Proceedings of ACL, pp. 413–420 (1999)
Clark, H.H., Fox Tree, J.E.: Using uh and um in spontaneous speaking. Cognition 84(1), 73–111 (2002)
Ginzburg, J., Fernández, R., Schlangen, D.: Disfluencies as intra-utterance dialogue moves. Seman. Pragmat. 7(9), 1–64 (2014)
von der Malsburg, T., Baumann, T., Schlangen, D.: TELIDA: a package for manipulation and visualisation of timed linguistic data. In: Proceedings of SigDial (2009)
Baumann, T., Buß, O., Schlangen, D.: Evaluation and optimisation of incremental processors. Dial. Discourse 2(1), 113–141 (2011)
Fernández, R., Lucht, T., Schlangen, D.: Referring under restricted interactivity conditions. In: Proceedings of SIGdial, pp. 136–139. ACL (2007)
Kousidis, S., Kennington, C., Schlangen, D.: Investigating speaker gaze and pointing behaviour in human-computer interaction with the mint.tools collection. In: Proceedings of SIGdial (2013)
Kennington, C., Schlangen, D.: Simple learning and compositional application of perceptually grounded word meanings for incremental reference resolution. In: Proceedings of ACL (2015)
Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., Woelfel, J.: Sphinx-4: a flexible open source framework for speech recognition. Technical Report SMLI TR2004-0811, Sun Microsystems Inc. (2004)
Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Kamvar, M., Strope, B.: Your word is my command: Google search by voice: a case study. In: Advances in Speech Recognition, pp. 61–90. Springer (2010)
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU (2011)
Plátek, O., Jurčíček, F.: Free on-line speech recogniser based on Kaldi ASR toolkit producing word posterior lattices. In: Proceedings of SIGdial, pp. 108–112. ACL (2014)
Acknowledgements
This work is supported by a Daimler and Benz Foundation PostDoc Grant to the first author, by the BMBF KogniHome project, DFG DUEL project (grant SCHL 845/5-1) and the Cluster of Excellence Cognitive Interaction Technology ‘CITEC’ (EXC 277) at Bielefeld University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media Singapore
About this chapter
Cite this chapter
Baumann, T., Kennington, C., Hough, J., Schlangen, D. (2017). Recognising Conversational Speech: What an Incremental ASR Should Do for a Dialogue System and How to Get There. In: Jokinen, K., Wilcock, G. (eds) Dialogues with Social Robots. Lecture Notes in Electrical Engineering, vol 427. Springer, Singapore. https://doi.org/10.1007/978-981-10-2585-3_35
Download citation
DOI: https://doi.org/10.1007/978-981-10-2585-3_35
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2584-6
Online ISBN: 978-981-10-2585-3
eBook Packages: EngineeringEngineering (R0)