Skip to main content

A Salience-Driven Approach to Speech Recognition for Human-Robot Interaction

  • Conference paper
  • 496 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6211))

Abstract

We present an implemented model for speech recognition in natural environments which relies on contextual information about salient entities to prime utterance recognition. The hypothesis underlying our approach is that, in situated human-robot interaction, speech recognition performance can be significantly enhanced by exploiting knowledge about the immediate physical environment and the dialogue history. To this end, visual salience (objects perceived in the physical scene) and linguistic salience (previously referred-to objects within the current dialogue) are integrated into a single cross-modal salience model. The model is dynamically updated as the environment evolves, and is used to establish expectations about uttered words which are most likely to be heard given the context. The update is realised by continously adapting the word-class probabilities specified in the statistical language model. The present article discusses the motivations behind our approach, describes our implementation as part of a distributed, cognitive architecture for mobile robots, and reports the evaluation results on a test suite.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Langley, P., Laird, J.E., Rogers, S.: Cognitive architectures: Research issues and challenges. Technical report, Institute for the Study of Learning and Expertise, Palo Alto, CA (2005)

    Google Scholar 

  2. Moore, R.K.: Spoken language processing: piecing together the puzzle. Speech Communication: Special Issue on Bridging the Gap Between Human and Automatic Speech Processing 49, 418–435 (2007)

    Google Scholar 

  3. Gruenstein, A., Wang, C., Seneff, S.: Context-sensitive statistical language modeling. In: Proceedings of INTERSPEECH 2005, pp. 17–20 (2005)

    Google Scholar 

  4. Chai, J.Y., Qu, S.: A salience driven approach to robust input interpretation in multimodal conversational systems. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing 2005, Vancouver, Canada, October 2005, pp. 217–224. Association for Computational Linguistics (2005)

    Google Scholar 

  5. Qu, S., Chai, J.: An exploration of eye gaze in spoken language processing for multimodal conversational interfaces. In: Proceedings of the Conference of the North America Chapter of the Association of Computational Linguistics, pp. 284–291 (2007)

    Google Scholar 

  6. Roy, D., Mukherjee, N.: Towards situated speech understanding: visual context priming of language models. Computer Speech & Language 19(2), 227–248 (2005)

    Article  Google Scholar 

  7. Hawes, N., Sloman, A., Wyatt, J., Zillich, M., Jacobsson, H., Kruijff, G.M., Brenner, M., Berginc, G., Skocaj, D.: Towards an integrated robot with multiple cognitive functions. In: AAAI, pp. 1548–1553. AAAI Press, Menlo Park (2007)

    Google Scholar 

  8. Steedman, M., Baldridge, J.: Combinatory categorial grammar. In: Borsley, R., Börjars, K. (eds.) Nontransformational Syntax: A Guide to Current Models. Blackwell, Oxford (2009)

    Google Scholar 

  9. Baldridge, J., Kruijff, G.J.M.: Coupling CCG and hybrid logic dependency semantics. In: ACL’02: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, pp. 319–326. Association for Computational Linguistics (2002)

    Google Scholar 

  10. Carroll, J., Oepen, S.: High efficiency realization for a wide-coverage unification grammar. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 165–176. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  11. Kruijff, G., Lison, P., Benjamin, T., Jacobsson, H., Hawes, N.: Incremental, multi-level processing for comprehending situated dialogue in human-robot interaction. In: Language and Robots: Proceedings from the Symposium (LangRo’2007), Aveiro, Portugal, December 2007, pp. 55–64 (2007)

    Google Scholar 

  12. Asher, N., Lascarides, A.: Logics of Conversation. Cambridge University Press, Cambridge (2003)

    Google Scholar 

  13. Jacobsson, H., Hawes, N., Kruijff, G.J., Wyatt, J.: Crossmodal content binding in information-processing architectures. In: Proceedings of the 3rd ACM/IEEE International Conference on Human-Robot Interaction (HRI), Amsterdam, The Netherlands, March 12-15 (2008)

    Google Scholar 

  14. Knoeferle, P., Crocker, M.: The coordinated interplay of scene, utterance, and world knowledge: evidence from eye tracking. Cognitive Science (2006)

    Google Scholar 

  15. Van Berkum, J.: Sentence comprehension in a wider discourse: Can we use ERPs to keep track of things? In: Carreiras Jr., M., Chiarcos, C. (eds.) The on-line study of sentence comprehension: Eyetracking, ERPs and beyond, pp. 229–270. Psychology Press, New York (2004)

    Google Scholar 

  16. Lison, P.: Robust processing of situated spoken dialogue. In: Chiarcos, C., de Castilho, R.E., Stede, M. (eds.) Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically, Proceedings of the Biennial GSCL Conference 2009, Potsdam, Germany. Narr Verlag (2009)

    Google Scholar 

  17. Roy, D.: Semiotic schemas: A framework for grounding language in action and perception. Artificial Intelligence 167(1-2), 170–205 (2005)

    Article  Google Scholar 

  18. Brick, T., Scheutz, M.: Incremental natural language processing for HRI. In: Proceeding of the ACM/IEEE international conference on Human-Robot Interaction (HRI’07), pp. 263–270 (2007)

    Google Scholar 

  19. Zender, H., Kruijff, G.J.M.: Towards generating referring expressions in a mobile robot scenario. In: Language and Robots: Proceedings of the Symposium, Aveiro, Portugal, December 2007, pp. 101–106 (2007)

    Google Scholar 

  20. Landragin, F.: Visual perception, language and gesture: A model for their understanding in multimodal dialogue systems. Signal Processing 86(12), 3578–3595 (2006)

    Article  MATH  Google Scholar 

  21. Grosz, B.J., Sidner, C.L.: Attention, intentions, and the structure of discourse. Computational Linguistics 12(3), 175–204 (1986)

    Google Scholar 

  22. Grosz, B.J., Weinstein, S., Joshi, A.K.: Centering: a framework for modeling the local coherence of discourse. Computational Linguistics 21(2), 203–225 (1995)

    Google Scholar 

  23. Kelleher, J.: Integrating visual and linguistic salience for reference resolution. In: Creaney, N. (ed.) Proceedings of the 16th Irish conference on Artificial Intelligence and Cognitive Science (AICS-05), Portstewart, Northern Ireland (2005)

    Google Scholar 

  24. Weilhammer, K., Stuttle, M.N., Young, S.: Bootstrapping language models for dialogue systems. In: Proceedings of INTERSPEECH 2006, Pittsburgh, PA (2006)

    Google Scholar 

  25. Lison, P.: A salience-driven approach to speech recognition for human-robot interaction. In: Proceedings of the 13th ESSLLI student session, Hamburg, Germany (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lison, P. (2010). A Salience-Driven Approach to Speech Recognition for Human-Robot Interaction. In: Icard, T., Muskens, R. (eds) Interfaces: Explorations in Logic, Language and Computation. ESSLLI ESSLLI 2008 2009. Lecture Notes in Computer Science(), vol 6211. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14729-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14729-6_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14728-9

  • Online ISBN: 978-3-642-14729-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics