skip to main content
10.1145/1040830.1040851acmconferencesArticle/Chapter ViewAbstractPublication PagesiuiConference Proceedingsconference-collections
Article

Multimodal new vocabulary recognition through speech and handwriting in a whiteboard scheduling application

Published:10 January 2005Publication History

ABSTRACT

Our goal is to automatically recognize and enroll new vocabulary in a multimodal interface. To accomplish this our technique aims to leverage the mutually disambiguating aspects of co-referenced, co-temporal handwriting and speech. The co-referenced semantics are spatially and temporally determined by our multimodal interface for schedule chart creation. This paper motivates and describes our technique for recognizing out-of-vocabulary (OOV) terms and enrolling them dynamically in the system. We report results for the detection and segmentation of OOV words within a small multimodal test set. On the same test set we also report utterance, word and pronunciation level error rates both over individual input modes and multimodally. We show that combining information from handwriting and speech yields significantly better results than achievable by either mode alone.

References

  1. Breazeal, C., A. Brooks, J. Gray, G. Hoffman, C. Kidd, H. Lee, J. Lieberman, A. Lockerd, and D. Mulanda, Humanoid Robots as Cooperative Partners for People. International Journal of Humanoid Robots (Forthcoming), 2004. 1(2).Google ScholarGoogle Scholar
  2. Atkeson, C.G., J.G. Hale, F. Pollick, M. Riley, S. Kotosaka, S. Schaal, T. Shibata, G. Tevatia, A. Ude, S. Vijayakumar, and M. Kawato, Using Humanoid Robots to Study Human Behavior. IEEE Intelligent Systems, 2000. 16(4): p. 46--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bluethmann, W., R.O. Ambrose, M. Diftler, S. Askew, E. Huber, M. Goza, F. Rehnmark, C. Lovchik, and D. Magruder, Robonaut: A Robot Designed to Work with Humans in Space. Autonomous Robots, 2003. 14(2--3): p. 179--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Franklin, D. and K. Hammond. The Intelligent Classroom: Providing Competent Assistance. In Proceedings of International Comference on Autonomous Agents (Agents-2001). 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Kaiser, E., A. Olwal, D. McGee, H. Benko, A. Corradini, X. Li, P. Cohen, and S. Feiner. Mutual Disambiguation of 3D Multimodal Interaction in Augmented and Virtual Reality. In International Conference on Mutimodal Interfaces (ICMI'03). 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chung, G., C. Wang, S. Seneff, E. FIlisko, and M. Tang. Combining Linguistic Knowledge and Acoustic Information in Automatic Pronunciation Lexicon Generation. In Interspeech '04. 2004. Jeju Island, Korea.Google ScholarGoogle ScholarCross RefCross Ref
  7. Chung, G., S. Seneff, and C. Wang. Automatic Acquistion of Names Using Speak and Spell Mode in Spoken Dialogue Systems. In Proceedings of HLT-NAACL 2003. 2003. Edmonton, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chung, G., S. Seneff, C. Wang, and L. Hetherington. A Dynamic Vocabulary Spoken Dialogue Interface. In Interspeech '04. 2004. Jeju Island, Korea.Google ScholarGoogle Scholar
  9. Roy, D. and N. Mukherjee, Visual Context Driven Semantic Priming of Speech Recognition and Understanding. Computer Speech and Language (In press).Google ScholarGoogle Scholar
  10. Kara, L.B. and T.F. Stahovich. An Image-Based Trainable Symbol Recognizer for Sketch-Based Interfaces. In AAAI Fall Symposium Series 2004: Making Pen-Based Interaction Intelligent and Natural. 2004. Arlington, Virginia.Google ScholarGoogle Scholar
  11. Porzel, R. and M. Strube, Towards Context-adaptive Natural Language Processing Systems, in Computational Linguistics for the New Millenium: Divergence or Synergy, M. Klenner and H. Visser, Editors. 2002: Lang, Frankfurt am Main.Google ScholarGoogle Scholar
  12. Chronis, G. and M. Skubic. Sketched-Based Navigation for Mobile Robots. In Proceedings of the 2003 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2003). 2003. St. Louis, MO.Google ScholarGoogle Scholar
  13. Saund, E. and J. Mahoney. Perceptual Support of Diagram Creation and Editing. In Diagrams 2004 - International Conference on the Theory and Applications of Diagrams. 2004. Cambridge, England.Google ScholarGoogle Scholar
  14. Landay, J.A. and B.A. Myers, Sketching Interfaces: Toward More Human Interface Design. IEEE Computer, 2001. 34(3): p. 56--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Pook, P.K. and D.H. Ballard. Deictic Teleassistance. In Proc. IEEE/RSJ/GI Int'l Conf. on Intelligent Robots and Systems. 1994. Muenchen, Germany.Google ScholarGoogle Scholar
  16. Tenenbaum, J.B. and F. Xu. Word learning as Bayesian inference. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society. 2000.Google ScholarGoogle Scholar
  17. Furnas, G.W., T.K. Landauer, L.M. Gomez, and S.T. Dumais, The vocabulary problem in human-system communication. Communications of the Association for Computing Machinery, 1987. 30(11): p. 964--971. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Asadi, A.O., Automatic Detection and Modeling of New Words in a Large Vocabulary Continuous Speech Recognition System, in Department of Electrical and Computer Engineering. 1991, Northeastern University: Boston. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Bazzi, I. and J.R. Glass. Modeling Out-of-Vocabulary Words for Robust Speech Recognition. In Proceedings of the 6th International Conference on Spoken Language Processing. 2000. Beijing, China.Google ScholarGoogle Scholar
  20. Galescu, L., Sub-lexical language models for unlimited vocabulary speech recognition. 2002, ATR: Kyoto, Japan.Google ScholarGoogle Scholar
  21. Meliani, R.E. and D. O'Shaughnessy. New efficient fillers for unlimited word recognition and keyword spotting. In ICSLP'96. 1996. Philadelphia, Pennsylvania, USA.Google ScholarGoogle ScholarCross RefCross Ref
  22. Gorniak, P. and D.K. Roy. Augmenting User Interfaces with Adaptive Speech Commands. In Proceedings of the International Conference for Multimodal Interfaces. 2003. Vancouver, B.C., Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Roy, D. and A. Pentland, Learning Words from Sights and Sounds: A Computational Model. Cognitive Science, 2002. 26(1): p. 113--146.Google ScholarGoogle ScholarCross RefCross Ref
  24. Yu, C. and D.H. Ballard. A Multimodal Learning Interface for Grounding Spoken Language in Sensory Perceptions. In International Conference on Multimodal Interfaces (ICMI'03). 2003. Vancouver, B.C., Canada: ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kaiser, E., D. Demirdjian, A. Gruenstein, X. Li, J. Niekrasz, M. Wesson, and S. Kumar. Demo: A Multimodal Learning Interface for Sketch, Speak and Point Creation of a Schedule Chart. In International Conference on Multimodal Interfaces (ICMI'04). 2004. State College, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Oviatt, S. and E. Olsen. Integration Themes in Multimodal Human-Computer Interaction. In International Conference on Spoken Language Processing (ICSLP'94). 1994.Google ScholarGoogle Scholar
  27. Oviatt, S.L., A. DeAngeli, and K. Kuhn. Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of Conference on Human Factors in Computing Systems: CHI'97. 1997. New York:: ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Anderson, R.J., R. Anderson, C. Hoyer, and S.A. Wolfman. A Study of Digital Ink in Lecture Presentation. In CHI 2004: The 2004 Conference on Human Factors in Computing Systems. 2004. Vienna, Austria. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Anderson, R., C. Hoyer, C. Prince, J. Su, F. Videon, and S. Wolfman. Speech, Ink and Slides: The Interaction of Content Channels. In ACM Multimedia. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Neti, C., G. Potamianos, J. Luettin, I. Matthews, H. Glotin, and D. Vergyri. Large-vocabulary audio-visual speech recognition: A summary of the Johns Hopkins Summer 2000 Workshop. In Proc. IEEE Workshop on Multimedia Signal Processing. 2001. Cannes.Google ScholarGoogle ScholarCross RefCross Ref
  31. Kaiser, E.C. and P.R. Cohen. Implementation Testing of a Hybrid Symbolic/Statistical Multimodal Architecture. In Proceedings of the International Conference on Spoken Language Processing (ICSLP 2002). 2002. Denver.Google ScholarGoogle Scholar
  32. Bazzi, I., Modelling Out-of-Vocabulary Words for Robust Speech Recognition, In Electrical Engineering and Computer Science. 2002, Massachusetts Institute of Technology. p. 153.Google ScholarGoogle Scholar
  33. Roy, D., Grounded Spoken Language Acquisition: Experiments in Word Learning. IEEE Transactions on Multimedia., 2003. 5(2): p. 197--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Demirdjian, D., T. Ko, and T. Darrell. Constraining Human Body Tracking. In Proceedings of the International Conference on Computer Vision. 2003. Nice, France. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Oviatt, S.L. Mutual Disambiguation of Recognition Errors in a Multimodal Architecture. In Proceedings of the ACM Conference on Human Factors in Computing Systems. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Black, A.W. and K.A. Lenzo. Flite: a small fast run-time synthesis engine. In The 4th ISCA Worskop on Speech Synthesis. 2001. Perthshire, Scotland.Google ScholarGoogle Scholar
  37. Yu, C. and D.H. Ballard, A Computational Model of Embodied Language Learning. 2003, Computer Science Deptartment, University of Rochester: Rochester, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Gogate, L.J., A.S. Walker-Andrews, and L.E. Bahrick, The Intersensory Origins of Word Comprehension: an Ecological-Dynamic Systems View. Development Science, 2001. 4(1): p. 1--37.Google ScholarGoogle ScholarCross RefCross Ref
  39. Yu, C., D.H. Ballard, and R.N. Aslin. The Role of Embodied Intention in Early Lexical Acquisition. In 25th Annual Meeting of Cognitive Science Society (CogSci 2003). 2003. Boston, MA.Google ScholarGoogle Scholar

Index Terms

  1. Multimodal new vocabulary recognition through speech and handwriting in a whiteboard scheduling application

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            IUI '05: Proceedings of the 10th international conference on Intelligent user interfaces
            January 2005
            344 pages
            ISBN:1581138946
            DOI:10.1145/1040830

            Copyright © 2005 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 10 January 2005

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate746of2,811submissions,27%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader