Article

Multimodal new vocabulary recognition through speech and handwriting in a whiteboard scheduling application

Author:
Edward C. Kaiser

Center for Human Computer Communication, Beaverton, OR

Center for Human Computer Communication, Beaverton, OR
View Profile

IUI '05: Proceedings of the 10th international conference on Intelligent user interfacesJanuary 2005Pages 51–58https://doi.org/10.1145/1040830.1040851

Published:10 January 2005Publication History

IUI '05: Proceedings of the 10th international conference on Intelligent user interfaces

Pages 51–58

ABSTRACT

Our goal is to automatically recognize and enroll new vocabulary in a multimodal interface. To accomplish this our technique aims to leverage the mutually disambiguating aspects of co-referenced, co-temporal handwriting and speech. The co-referenced semantics are spatially and temporally determined by our multimodal interface for schedule chart creation. This paper motivates and describes our technique for recognizing out-of-vocabulary (OOV) terms and enrolling them dynamically in the system. We report results for the detection and segmentation of OOV words within a small multimodal test set. On the same test set we also report utterance, word and pronunciation level error rates both over individual input modes and multimodally. We show that combining information from handwriting and speech yields significantly better results than achievable by either mode alone.

References

Breazeal, C., A. Brooks, J. Gray, G. Hoffman, C. Kidd, H. Lee, J. Lieberman, A. Lockerd, and D. Mulanda, Humanoid Robots as Cooperative Partners for People. International Journal of Humanoid Robots (Forthcoming), 2004. 1(2).Google Scholar
Atkeson, C.G., J.G. Hale, F. Pollick, M. Riley, S. Kotosaka, S. Schaal, T. Shibata, G. Tevatia, A. Ude, S. Vijayakumar, and M. Kawato, Using Humanoid Robots to Study Human Behavior. IEEE Intelligent Systems, 2000. 16(4): p. 46--56. Google ScholarDigital Library
Bluethmann, W., R.O. Ambrose, M. Diftler, S. Askew, E. Huber, M. Goza, F. Rehnmark, C. Lovchik, and D. Magruder, Robonaut: A Robot Designed to Work with Humans in Space. Autonomous Robots, 2003. 14(2--3): p. 179--197. Google ScholarDigital Library
Franklin, D. and K. Hammond. The Intelligent Classroom: Providing Competent Assistance. In Proceedings of International Comference on Autonomous Agents (Agents-2001). 2001. Google ScholarDigital Library
Kaiser, E., A. Olwal, D. McGee, H. Benko, A. Corradini, X. Li, P. Cohen, and S. Feiner. Mutual Disambiguation of 3D Multimodal Interaction in Augmented and Virtual Reality. In International Conference on Mutimodal Interfaces (ICMI'03). 2003. Google ScholarDigital Library
Chung, G., C. Wang, S. Seneff, E. FIlisko, and M. Tang. Combining Linguistic Knowledge and Acoustic Information in Automatic Pronunciation Lexicon Generation. In Interspeech '04. 2004. Jeju Island, Korea.Google ScholarCross Ref
Chung, G., S. Seneff, and C. Wang. Automatic Acquistion of Names Using Speak and Spell Mode in Spoken Dialogue Systems. In Proceedings of HLT-NAACL 2003. 2003. Edmonton, Canada. Google ScholarDigital Library
Chung, G., S. Seneff, C. Wang, and L. Hetherington. A Dynamic Vocabulary Spoken Dialogue Interface. In Interspeech '04. 2004. Jeju Island, Korea.Google Scholar
Roy, D. and N. Mukherjee, Visual Context Driven Semantic Priming of Speech Recognition and Understanding. Computer Speech and Language (In press).Google Scholar
Kara, L.B. and T.F. Stahovich. An Image-Based Trainable Symbol Recognizer for Sketch-Based Interfaces. In AAAI Fall Symposium Series 2004: Making Pen-Based Interaction Intelligent and Natural. 2004. Arlington, Virginia.Google Scholar
Porzel, R. and M. Strube, Towards Context-adaptive Natural Language Processing Systems, in Computational Linguistics for the New Millenium: Divergence or Synergy, M. Klenner and H. Visser, Editors. 2002: Lang, Frankfurt am Main.Google Scholar
Chronis, G. and M. Skubic. Sketched-Based Navigation for Mobile Robots. In Proceedings of the 2003 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2003). 2003. St. Louis, MO.Google Scholar
Saund, E. and J. Mahoney. Perceptual Support of Diagram Creation and Editing. In Diagrams 2004 - International Conference on the Theory and Applications of Diagrams. 2004. Cambridge, England.Google Scholar
Landay, J.A. and B.A. Myers, Sketching Interfaces: Toward More Human Interface Design. IEEE Computer, 2001. 34(3): p. 56--64. Google ScholarDigital Library
Pook, P.K. and D.H. Ballard. Deictic Teleassistance. In Proc. IEEE/RSJ/GI Int'l Conf. on Intelligent Robots and Systems. 1994. Muenchen, Germany.Google Scholar
Tenenbaum, J.B. and F. Xu. Word learning as Bayesian inference. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society. 2000.Google Scholar
Furnas, G.W., T.K. Landauer, L.M. Gomez, and S.T. Dumais, The vocabulary problem in human-system communication. Communications of the Association for Computing Machinery, 1987. 30(11): p. 964--971. Google ScholarDigital Library
Asadi, A.O., Automatic Detection and Modeling of New Words in a Large Vocabulary Continuous Speech Recognition System, in Department of Electrical and Computer Engineering. 1991, Northeastern University: Boston. Google ScholarDigital Library
Bazzi, I. and J.R. Glass. Modeling Out-of-Vocabulary Words for Robust Speech Recognition. In Proceedings of the 6th International Conference on Spoken Language Processing. 2000. Beijing, China.Google Scholar
Galescu, L., Sub-lexical language models for unlimited vocabulary speech recognition. 2002, ATR: Kyoto, Japan.Google Scholar
Meliani, R.E. and D. O'Shaughnessy. New efficient fillers for unlimited word recognition and keyword spotting. In ICSLP'96. 1996. Philadelphia, Pennsylvania, USA.Google ScholarCross Ref
Gorniak, P. and D.K. Roy. Augmenting User Interfaces with Adaptive Speech Commands. In Proceedings of the International Conference for Multimodal Interfaces. 2003. Vancouver, B.C., Canada. Google ScholarDigital Library
Roy, D. and A. Pentland, Learning Words from Sights and Sounds: A Computational Model. Cognitive Science, 2002. 26(1): p. 113--146.Google ScholarCross Ref
Yu, C. and D.H. Ballard. A Multimodal Learning Interface for Grounding Spoken Language in Sensory Perceptions. In International Conference on Multimodal Interfaces (ICMI'03). 2003. Vancouver, B.C., Canada: ACM Press. Google ScholarDigital Library
Kaiser, E., D. Demirdjian, A. Gruenstein, X. Li, J. Niekrasz, M. Wesson, and S. Kumar. Demo: A Multimodal Learning Interface for Sketch, Speak and Point Creation of a Schedule Chart. In International Conference on Multimodal Interfaces (ICMI'04). 2004. State College, PA. Google ScholarDigital Library
Oviatt, S. and E. Olsen. Integration Themes in Multimodal Human-Computer Interaction. In International Conference on Spoken Language Processing (ICSLP'94). 1994.Google Scholar
Oviatt, S.L., A. DeAngeli, and K. Kuhn. Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of Conference on Human Factors in Computing Systems: CHI'97. 1997. New York:: ACM Press. Google ScholarDigital Library
Anderson, R.J., R. Anderson, C. Hoyer, and S.A. Wolfman. A Study of Digital Ink in Lecture Presentation. In CHI 2004: The 2004 Conference on Human Factors in Computing Systems. 2004. Vienna, Austria. Google ScholarDigital Library
Anderson, R., C. Hoyer, C. Prince, J. Su, F. Videon, and S. Wolfman. Speech, Ink and Slides: The Interaction of Content Channels. In ACM Multimedia. 2004. Google ScholarDigital Library
Neti, C., G. Potamianos, J. Luettin, I. Matthews, H. Glotin, and D. Vergyri. Large-vocabulary audio-visual speech recognition: A summary of the Johns Hopkins Summer 2000 Workshop. In Proc. IEEE Workshop on Multimedia Signal Processing. 2001. Cannes.Google ScholarCross Ref
Kaiser, E.C. and P.R. Cohen. Implementation Testing of a Hybrid Symbolic/Statistical Multimodal Architecture. In Proceedings of the International Conference on Spoken Language Processing (ICSLP 2002). 2002. Denver.Google Scholar
Bazzi, I., Modelling Out-of-Vocabulary Words for Robust Speech Recognition, In Electrical Engineering and Computer Science. 2002, Massachusetts Institute of Technology. p. 153.Google Scholar
Roy, D., Grounded Spoken Language Acquisition: Experiments in Word Learning. IEEE Transactions on Multimedia., 2003. 5(2): p. 197--209. Google ScholarDigital Library
Demirdjian, D., T. Ko, and T. Darrell. Constraining Human Body Tracking. In Proceedings of the International Conference on Computer Vision. 2003. Nice, France. Google ScholarDigital Library
Oviatt, S.L. Mutual Disambiguation of Recognition Errors in a Multimodal Architecture. In Proceedings of the ACM Conference on Human Factors in Computing Systems. 1999. Google ScholarDigital Library
Black, A.W. and K.A. Lenzo. Flite: a small fast run-time synthesis engine. In The 4th ISCA Worskop on Speech Synthesis. 2001. Perthshire, Scotland.Google Scholar
Yu, C. and D.H. Ballard, A Computational Model of Embodied Language Learning. 2003, Computer Science Deptartment, University of Rochester: Rochester, New York. Google ScholarDigital Library
Gogate, L.J., A.S. Walker-Andrews, and L.E. Bahrick, The Intersensory Origins of Word Comprehension: an Ecological-Dynamic Systems View. Development Science, 2001. 4(1): p. 1--37.Google ScholarCross Ref
Yu, C., D.H. Ballard, and R.N. Aslin. The Role of Embodied Intention in Early Lexical Acquisition. In 25th Annual Meeting of Cognitive Science Society (CogSci 2003). 2003. Boston, MA.Google Scholar

Index Terms

Multimodal new vocabulary recognition through speech and handwriting in a whiteboard scheduling application
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
    2. Learning settings
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction devices
      1. Touch screens

Recommendations

Combining Spectral Representations for Large-Vocabulary Continuous Speech Recognition

In this paper, we investigate the combination of complementary acoustic feature streams in large-vocabulary continuous speech recognition (LVCSR). We have explored the use of acoustic features obtained using a pitch-synchronous analysis, Straight, in ...
Read More
Pronunciation modeling for large vocabulary speech recognition
Read More
Large vocabulary continuous speech recognition for Urdu
FIT '10: Proceedings of the 8th International Conference on Frontiers of Information Technology

This paper presents the development of acoustic and language models for robust Urdu speech recognition using the CMU Sphinx Open Source Toolkit for speech recognition. Three models have been developed incrementally, with the addition of speech data of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
IUI '05: Proceedings of the 10th international conference on Intelligent user interfaces
January 2005
344 pages
ISBN:1581138946
DOI:10.1145/1040830
General Chair:
Rob St. Amant
North Carolina State University, USA
,
Program Chairs:
John Riedl
University of Minnesota, USA
,
Anthony Jameson
DFKI and International University in Germany
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 January 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
multimodal interaction
mutual disambiguation
vocabulary learning
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate746of2,811submissions,27%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 1,132
  Total Downloads
- Downloads (Last 12 months)48
- Downloads (Last 6 weeks)45
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multimodal new vocabulary recognition through speech and handwriting in a whiteboard scheduling application

IUI '05: Proceedings of the 10th international conference on Intelligent user interfaces

ABSTRACT

References

Cited By

Index Terms

Recommendations

Combining Spectral Representations for Large-Vocabulary Continuous Speech Recognition

Pronunciation modeling for large vocabulary speech recognition

Large vocabulary continuous speech recognition for Urdu