Abstract
Human-computer conversations have attracted a great deal of interest especially in virtual worlds. In fact, research gave rise to spoken dialogue systems by taking advantage of speech recognition, language understanding and speech synthesis advances. This work surveys the state of the art of speech dialogue systems. Current dialogue system technologies and approaches are first introduced emphasizing differences between them, then, speech recognition and synthesis and language understanding are introduced as complementary and necessary modules. On the other hand, as the development of spoken dialogue systems becomes more complex, it is necessary to define some processes to evaluate their performance. Wizard-of-Oz techniques play an important role to achieve this task. Thanks to this technique is obtained a suitable dialogue corpus necessary to achieve good performance. A description of this technique is given in this work together with perspectives on multimodal dialogue systems in virtual worlds.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abe, M.: A segment-based approach to voice conversion, Acoustics, Speech, and Signal Processing. In: IEEE International Conference on ICASSP, pp. 765–768 (1991)
Ailomaa, M., Melichar, M., Rajman, M., Lisowska, A.: Archivus, a multimodal system for multimedia meeting browsing and retrieval. In: Proceedings of the COLING/ACL on Interactive presentation sessions, Morristown, NJ, USA, pp. 49–52 (2006)
Arnaldi, B., Fuchs, P., Tisseau, J.: Chapitre 1 du volume 1 du traité de la réalité virtuelle, Les presses de l’école de Mines de Paris (2003)
Béchet, F.: Processing spontaneous speech in deployed spoken language understanding systems: a survey, SLT (December 2008)
Bimbot, F., Chollet, G., Deleglise, P., Montacie, C.: Temporal decomposition and acoustic-phonetic decoding of speech. In: ICASSP, pp. 445–448 (1988)
Boda, P.P.: From stochastic speech recognition to understanding: an hmm based approach. In: Proc. IEEE ASRU, pp. 57–64 (1997)
Bonneau-Maynard, H., Ayache, C., Béchet, F., Denis, A., Kuhn, A., Lefevre1, F., Mostefa, D., Quignard, M., Rosset1, S., Servan, C., Villaneau, J.: Results of the french evalda-media evaluation campaign for literal understanding. In: Proceedings of the 5th international Conference on Language Resources and Evaluation (LREC), Genoa, Italy, pp. 2054–2059 (2002)
Bremer, D., Johnson, J., Jones, H., Liu, Y., May, D., Meredith, J., Veydia, S.: Application kernels on graphics processing units. In: Workshop on High Performance Embedded Computing (2005)
Camelin, N.: Stratégies robustes de compréhension de la parole basées sur des méthodes de classification automatique, Ph.D thesis, Université d’avignon (2007)
Cardinal, P., Dumouchel, P., Boulianne, G., Comeau, M.: Gpu accelerated acoustic likelihood computations. In: InterSpeech (2008)
Childers, D.G.: Glottal source modeling for voice conversion. Speech Communication, 127–138 (1995)
Chollet, G., Cernocký, J., Constantinescu, A., Deligne, S., Bimbot, F.: Toward alisp: Automatic language independent speech processing. Springer, Heidelberg (1998)
Chomsky, N.: Syntactic structures. Mouton, The Hague (1957)
Chong, J., Yi, Y., Faria, A., Satish, N., Keutzer, K.: Data-parallel large vocabulary continuous speech recognition on graphics processors, Technical report, University of California at Berkeley (2008)
Chu, S., Neill, I., Hanna, P., McTear, M.: An approach to multi-strategy dialogue management. In: INTERSPEECH, pp. 865–868 (2005)
Corradini, A., Mehta, M., Bernsen, N.O., Martin, J., Abrilian, S.: Multimodal input fusion in human-computer interactio. In: Proceedings of the NATO ASI 2003 Conference, NAREK center of Yerevan University, Tsakhkadzor, Armenia (2003)
Dahlback, N., Jönsson, A., Ahrenberg, L.: Wizard of oz studies: why and how. Knowl.-Based Syst., 258–266 (1993)
Damasio, A.R.: Descartes’ error: emotion, reason, and the human brain. Grosset/Putnam, New York (1994)
Dixon, P.R., Caseiro, D.A., Oonishi, T., Furui, S.: The titech large vocabulary wfst speech recognition system. In: ASRU, pp. 443–448 (2007)
Dixon, P.R., Oonishi, T., Furui, S.: Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition. Comp. Speech and Language 23, 510–526 (2009)
Dowding, J., Moore, R., Andry, F., Moran, D.: Interleaving syntax and semantics in an efficient bottom-up parser. In: 32nd Annual Meeting of the Association for Computational Linguistics, New Maxico (June 1994)
Dybkjaer, L., Bernsen, N.O.: The disc approach to spoken language system development and evaluation. In: LREC (1998)
Fang, X.W., Zheng, F., Xu, M.: Topic forest: A plan-based dialog management structure. In: Proceedings of ICASSP 2001, Salt Lake City (2001)
Fleury, M., Downton, A.C., Clark, A.F.: Parallel structure in an integrated speech-recognition network. In: Amestoy, P.R., Berger, P., Daydé, M., Duff, I.S., Frayssé, V., Giraud, L., Ruiz, D. (eds.) Euro-Par 1999. LNCS, vol. 1685, pp. 995–1004. Springer, Heidelberg (1999)
Foster, M.E., White, M., Setzer, A., Catizone, R.: Multimodal generation in the comic dialogue system. In: ACL 2005: Proceedings of the ACL 2005 on Interactive poster and demonstration sessions, pp. 45–48 (2005)
Fuchs, P., Nashashibi, F., Lourdeaux, D.: A theoretical approach of the design and evaluation of a virtual reality device. In: Virtual reality and prototyping, Laval, France, pp. 11–20 (1999)
Goddeau, D., Meng, H., Polifroni, J., Seneff, S., Busayapongchaiy, S.: A form-based dialogue manager for spoken language applications. In: Proc. ICSLP, pp. 701–704 (1996)
Gorniak, P., Roy, D.: Situated language understanding as filtering perceived affordances. Cognitive Science, 197–231 (2007)
Guedj, R.: Human-machine interaction and digital signal processing. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP, vol. 7, pp. 17–19 (1982)
He, Y., Young, S.: A data driven spoken language understanding system. In: IEEE Automatic Speech Recognition and Understanding Workshop, St. Thomas, U.S. Virgin Islands (December 2003)
He, Y., Young, S.: Hidden vector state hierarchical semantic parsing. In: IEEE ICASSP, Hong Kong, pp. 555–558 (2003)
Kain, A., Maccon, M.W.: Spectral voice conversion for text to speech synthesis. In: ICASSP, pp. 285–288 (1998)
Karam, W., Bredin, H., Greige, H., Chollet, G., Mokbel, C.: Talking-face identity verification, audiovisual forgery and robustness issues. EURASIP Journal on Advances in Signal Processing, Special Issue on Recent Advances in Biometric Systems: A Signal Processing Perspective (2009)
Kawamoto, S., Shimodaira, H., Nitta, T., Nishimoto, T., Nakamura, S., Itou, K., Morishima, S., Yotsukura, T., Kai, A., Lee, A., Yamashita, Y., Kobayashi, T., Tokuda, K., Hirose, K., Minematsu, N., Yamada, A., Den, Y., Utsuro, T., Sagayama, S.: Glatea: Open-source software for developing anthropomorphic spoken dialog agents incorporating voice dialogs in a multi-user virtual environment. In: Prendinger, H., Ishizuka, M. (eds.) Life-Like Characters, pp. 187–212. Springer, Berlin (2004)
Kim, K., Lee, C., Jung, S., Lee, G.G.: A frame-based probabilistic framework for spoken dialog management using dialog examples. In: Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, pp. 120–127 (2008)
Lamere, P., et al.: Design of the cmu sphinx-4 decoder. In: Proc. EUROSPEECH, pp. 1181–1184 (2003)
Landragin, F.: Physical, semantic and pragmatic levels for multimodal fusion and fission. In: Proceedings of the Seventh International Workshop on Computational Semantics (IWCS 2007), pp. 346–350. Universitätsverlag Tilburg (January 2007)
Lane, I., Ueno, S., Kawahara, T.: Cooperative dialogue planning with user and situation models via example-based training. In: Proc. Workshop on Man-Machine Symbiotic Systems, Kyoto, Japan, pp. 93–102 (2004)
Lee, A., Kawahara, T., Shikano, K.: Julius - an open source real-time large vocabulary recognition engine. In: Proc. EUROSPEECH, Aalborg, pp. 1691–1694 (2001)
Lee, C., Jung, S., Kim, S., Lee, G.G.: Example-based dialog modeling for practical multi-domain dialog system. Speech Commun. 51(5), 466–484 (2009)
Levin, E., Pieraccini, R.: CHRONUS, the next generation. In: DARPA Speech and Natural Language Workshop, January 1995, pp. 269–271 (1995)
Levin, E., Pieraccini, R.: Concept-based spontaneous speech understanding system. In: Eurospeech, pp. 555–558 (1995)
Levin, E., Pieraccini, R., Eckert, W.: Learning dialogue strategies within the markov decision process framework. In: Proc. IEEE ASRU, pp. 72–79 (1997)
Lin, E., Yu, K., Rutenbar, R.A., Chen, T.: Moving speech recognition from software to silicon: the in silico vox project. In: Interspeech (2006)
Lin, E., Yu, K., Rutenbar, R.A., Chen, T.: A 1000-word vocabulary, speaker independent, continuous live-mode speech recognizer implemented in a single fpga. In: Int. Symposium on Field-Programmable Gate Arrays, FPGA (2007)
Litman, D.J., Allen, J.F.: A plan recognition model for subdialogues in conversations. Cognitive Science 11(2), 163–200 (1987)
Madcow: Multi-site data collection for a spoken language corpus. In: DARPA Speech and Natural Language Workshop (1992)
Maynard, H., McTait, K., Mostefa, D., Devillers, L., Rosset, S., Paroubek, P., Bousquet, C., Choukri, K., Goulian, J., Antoine, J.-Y., Béchet, F., Bontron, O., Charnay, L., Romary, L., Vergnes, M.: Constitution d’un corpus de dialogue oral pour l’évaluation automatique de la compréhension hors et en contexte du dialogue. In: JEP (2004)
Miller, S., Bates, M., Bobrow, R., Ingria, R., Makhoul, J., Schwartz, R.: Recent progress in hidden understanding models. In: DARPA Speech and Natural Language Workshop, Austin, January 1995, pp. 276–280. Morgan Kaufman, San Francisco (1995)
MITRE, ARPA communicator homepage (2003)
Mohamed El Hadj, Y.O., Revol, N., Meziane, A.: Parallelization of automatic speech recognition, Research report 4110, INRIA (2001)
De Mori, R.: Spoken dialogues with computers. Academic Press, London (1998)
Nguyen, A., Wayne, W.: An agent-based approach to dialogue management in personal assistants. In: IUI 2005: Intelligent User Interfaces, pp. 137–144 (2005)
Noda, H., Shirazi, M.N., Zhang, B.: A parallel processing algorithm for speech recognition using markov random fields. Communication Research Laboratory 41(2), 87–100 (1994)
Orkin, J., Roy, D.: The restaurant game: Learning social behavior and language from thousands of players online. Journal of Game Development, 39–60 (December 2007)
Perrot, P., Morel, M., Razik, J., Chollet, G.: Vocal forgery in forensic sciences.In: International Conference on Forensic Applications and Techniques in Telecommunications. Information and Multimedia, e-Forensics 2009, 7p. (2009)
Phillips, S., Rogers, A.: Parallel speech recognition. In: InterSpeech 1997, pp. 135–138 (1997)
Rajman, M., Ailomaa, M., Lisowska, A., Melichar, M., Armstrong, S.: Extending the wizard of oz methodology for language-enabled multimodal systems. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), Genoa, Italy, May 22-28, pp. 2539–2543 (2006)
Raymond, C., Bechet, F., Camelin, N., de Mori, R., Damnati, G.: Semantic interpretation with error correction. In: IEEE ICASSP, Montreal (2005)
Lee, C., Lee, S., Lee, G.: Example-based dialog modeling for english conversation tutoring. In: Proceedings of the 2nd International Conference on Next Generation Computing (2007)
Salber, D., Coutaz, J.: Applying the wizard of oz technique to the study of multimodal systems. In: Human-Computer Interaction Selected Papers. LNCS, pp. 219–230. Springer, Heidelberg (1993)
Seneff, S.: Robust parsing for spoken language systems. In: IEEE International Confrence on Acoustics, Speech and Signal Processing, San Francisco (1992)
Singh, S., Kearns, M., Litman, D., Walker, M.: Reinforcement learning for spoken dialogue systems. In: Proc. NIPS (1999)
Stallard, D.: Evaluation results for the talk’n’travel system. In: Human Language Technology Conference, San Diego, California (Mars 2001)
Sutton, S., et al.: Universal speech tools: The cslu toolkit. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP), pp. 3221–3224 (1998)
Tokuda, K., Zen, H., Black, A.W.: An hmm-based speech synthesis system applied to english. In: Proceedings of IEEE Speech Synthesis Workshop, pp. 227–230 (2002)
Wahlster, W., Reithinger, N., Blocher, A.: Smartkom: Towards multimodal dialogues with anthropomorphic interface agents. In: MTI Status Conference, Saarbrücken, Germany, October 26-27 (2001)
Walker, M., Passonneau, R., Boland, J.: Quantitative and qualitative evaluation of darpa communicator spoken dialog systems. In: ACL/EACL Workshop (2002)
Ward, W., Issar, S.: Recent improvements in the CMU spoken language understanding system. In: ARPA Human Language Technology Workshop, pp. 213–216. Morgan Kaufman, San Francisco (1996)
Williams, J.D., Young, S.J.: Partially observable markov decision process for spoken dialog systems. Computer Speech and Language 21, 231–422 (2007)
Young, S.: The htk hidden markov model toolkit: Design and philosophy, Technical report, Cambridge University Engineering Department, UK (1994)
Zeigler, B., Mazor, B.: Dialog design for a speech-interactive automation system. In: Proc. EUROSPEECH 1995, Madrid, Spain, pp. 113–116 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Chollet, G., Amehraye, A., Razik, J., Zouari, L., Khemiri, H., Mokbel, C. (2010). Spoken Dialogue in Virtual Worlds. In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds) Development of Multimodal Interfaces: Active Listening and Synchrony. Lecture Notes in Computer Science, vol 5967. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12397-9_36
Download citation
DOI: https://doi.org/10.1007/978-3-642-12397-9_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12396-2
Online ISBN: 978-3-642-12397-9
eBook Packages: Computer ScienceComputer Science (R0)