Abstract
This paper presents the results of an experiment comparing two different designs of an automated dialog interface. We compare a multimodal design utilizing text displays coordinated with spoken prompts to a voice-only version of the same application. Our results show that the text-coordinated version is more efficient in terms of word recognition and number of out-of-grammar responses, and is equal to the voice-only version in terms of user satisfaction. We argue that this type of multimodal dialog interface effectively constrains user response to allow for better speech recognition without increasing cognitive load or compromising the naturalness of the interaction.
Similar content being viewed by others
References
Baber, C., Johnson, G., and Cleaver, D. (1997).Factors affecting users' choice of words in speech-based interaction with public technology. International Journal of Speech Technology, 2(1):45–49.
Baca, J. (1998). Comparing effects of navigational interface modalities on speaker prosodics. Assets '98, Proceedings of the Third International ACMConference on Assistive Technologies. Marina del Rey: ACM, pp. 3–10.
Baddeley, A. (1992). Working memory. Science, 255(5044):556–559.
Balentine, B. (1999). Re-engineering the speech menu. In D. Gardner-Bonneau (Ed.), Human Factors and Voice Interactive Systems. Boston: Kluwer, pp. 205–235.
Becchetti, C. and Ricotti, L.P. (1999). Speech Recognition: Theory and C++Implementation.West Sussex, England: JohnWiley and Sons.
Boyce, S. (1999).Spoken natural language dialog systems: User interface issues for the future. In D. Gardner-Bonneau (Ed.), Human Factors and Voice Interactive Systems. Boston: Kluwer, pp. 37–61.
Boyce, S. (2000). Natural spoken dialog systems for telephony applications. Communications of the ACM, 43(9):29–34.
David, P. and Hirshman, E. (1998). Dual-mode presentation and its effect on implicit and explicit memory. American Journal of Psychology, 111(1):77–88.
Gardner-Bonneau, D. (1999). Guidelines for speech-enabled IVR application design. In D. Gardner-Bonneau (Ed.), Human Factors and Voice Interactive Systems. Boston: Kluwer, pp. 147–162.
Goolkasian, P. (2000). Pictures, words, and sounds: From which format are we best able to reason? The Journal of General Psychology, 127(4):439–459.
Grasso, M. and Finin, T. (1997). Task integration in multimodal speech recognition environments. Crossroads, 3(3):19–22.
Hardy, H., Baker, K., Devillers, L., Lamel, L., Rosset, S., Strzalkowski, T., Ursu, C., and Webb, N. (2002). Multi-layer dialogue annotation for automated multilingual customer service. Proceedings of the ISLEWorkshop on Dialogue Tagging for Multi-Modal Human Computer Interaction. Edinburgh.
Karsenty, L. (2002). Shifting the design philosophy of spoken natural language dialog: From invisible to transparent systems. International Journal of Speech Technology, 5:147–157.
Martin, A. and Przybocki, M. (2001). Analysis of results. 2001 NIST Large Vocabulary Conversational Speech Recognition Workshop.
Mayer, R. and Moreno, R. (1998). A split-attention effect in multimedia learning: Evidence for dual processing systems in 258. working memory. Journal of Educational Psychology, 90(2):312–320.
Mayer, R., Moreno, R., Borrie, M., and Vagge, S. (1999). Maximizing constructivist learning from multimedia communications by minimizing cognitive load. Journal of Educational Psychology, 91(4):638–643.
Mousavi, S.Y., Low, R., and Sweller, J. (1995). Reducing cognitive load by mixing auditory and visual presentation modes. Journal of Educational Psychology, 87(2):319–334.
Novick, D., Hansen, B., Sutton, S., and Marshall, C. (1999). Limiting factors of automated telephone dialogs. In D. Gardner-Bonneau (Ed.), Human Factors and Voice Interactive Systems. Boston: Kluwer, pp. 163–186.
Shneiderman, B. (1997). Designing the User Interface. 3rd ed. Reading, MA: Addison-Wesley.
Velayo, R.S. and Quirk, C. (2000).How do presentation modality and strategy use influence memory for paired concepts? Journal of Instructional Psychology, 27(6):126–135.
Walker, M., Fromer, J., Di Fabbrizio, G., Mestel, C., and Hindle, D. (1998). What can I say?: Evaluating a spoken language interface to email. Proceedings of the Conference on Human Factors in Computing Systems. NY: ACM, pp. 582–589.
Yeung, A. (1999). Cognitive load and learner expertise: Splitattention and redundancy effects in reading comprehension tasks with vocabulary definitions. The Journal of Experimental Education, 67(3):197–212.
Rights and permissions
About this article
Cite this article
Baker, K., Mckenzie, A., Biermann, A. et al. Constraining User Response via Multimodal Dialog Interface. International Journal of Speech Technology 7, 251–258 (2004). https://doi.org/10.1023/B:IJST.0000037069.82313.57
Issue Date:
DOI: https://doi.org/10.1023/B:IJST.0000037069.82313.57