ABSTRACT
Our goal is to automatically recognize and enroll new vocabulary in a multimodal interface. To accomplish this our technique aims to leverage the mutually disambiguating aspects of co-referenced, co-temporal handwriting and speech. The co-referenced semantics are spatially and temporally determined by our multimodal interface for schedule chart creation. This paper motivates and describes our technique for recognizing out-of-vocabulary (OOV) terms and enrolling them dynamically in the system. We report results for the detection and segmentation of OOV words within a small multimodal test set. On the same test set we also report utterance, word and pronunciation level error rates both over individual input modes and multimodally. We show that combining information from handwriting and speech yields significantly better results than achievable by either mode alone.
- Breazeal, C., A. Brooks, J. Gray, G. Hoffman, C. Kidd, H. Lee, J. Lieberman, A. Lockerd, and D. Mulanda, Humanoid Robots as Cooperative Partners for People. International Journal of Humanoid Robots (Forthcoming), 2004. 1(2).Google Scholar
- Atkeson, C.G., J.G. Hale, F. Pollick, M. Riley, S. Kotosaka, S. Schaal, T. Shibata, G. Tevatia, A. Ude, S. Vijayakumar, and M. Kawato, Using Humanoid Robots to Study Human Behavior. IEEE Intelligent Systems, 2000. 16(4): p. 46--56. Google ScholarDigital Library
- Bluethmann, W., R.O. Ambrose, M. Diftler, S. Askew, E. Huber, M. Goza, F. Rehnmark, C. Lovchik, and D. Magruder, Robonaut: A Robot Designed to Work with Humans in Space. Autonomous Robots, 2003. 14(2--3): p. 179--197. Google ScholarDigital Library
- Franklin, D. and K. Hammond. The Intelligent Classroom: Providing Competent Assistance. In Proceedings of International Comference on Autonomous Agents (Agents-2001). 2001. Google ScholarDigital Library
- Kaiser, E., A. Olwal, D. McGee, H. Benko, A. Corradini, X. Li, P. Cohen, and S. Feiner. Mutual Disambiguation of 3D Multimodal Interaction in Augmented and Virtual Reality. In International Conference on Mutimodal Interfaces (ICMI'03). 2003. Google ScholarDigital Library
- Chung, G., C. Wang, S. Seneff, E. FIlisko, and M. Tang. Combining Linguistic Knowledge and Acoustic Information in Automatic Pronunciation Lexicon Generation. In Interspeech '04. 2004. Jeju Island, Korea.Google ScholarCross Ref
- Chung, G., S. Seneff, and C. Wang. Automatic Acquistion of Names Using Speak and Spell Mode in Spoken Dialogue Systems. In Proceedings of HLT-NAACL 2003. 2003. Edmonton, Canada. Google ScholarDigital Library
- Chung, G., S. Seneff, C. Wang, and L. Hetherington. A Dynamic Vocabulary Spoken Dialogue Interface. In Interspeech '04. 2004. Jeju Island, Korea.Google Scholar
- Roy, D. and N. Mukherjee, Visual Context Driven Semantic Priming of Speech Recognition and Understanding. Computer Speech and Language (In press).Google Scholar
- Kara, L.B. and T.F. Stahovich. An Image-Based Trainable Symbol Recognizer for Sketch-Based Interfaces. In AAAI Fall Symposium Series 2004: Making Pen-Based Interaction Intelligent and Natural. 2004. Arlington, Virginia.Google Scholar
- Porzel, R. and M. Strube, Towards Context-adaptive Natural Language Processing Systems, in Computational Linguistics for the New Millenium: Divergence or Synergy, M. Klenner and H. Visser, Editors. 2002: Lang, Frankfurt am Main.Google Scholar
- Chronis, G. and M. Skubic. Sketched-Based Navigation for Mobile Robots. In Proceedings of the 2003 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2003). 2003. St. Louis, MO.Google Scholar
- Saund, E. and J. Mahoney. Perceptual Support of Diagram Creation and Editing. In Diagrams 2004 - International Conference on the Theory and Applications of Diagrams. 2004. Cambridge, England.Google Scholar
- Landay, J.A. and B.A. Myers, Sketching Interfaces: Toward More Human Interface Design. IEEE Computer, 2001. 34(3): p. 56--64. Google ScholarDigital Library
- Pook, P.K. and D.H. Ballard. Deictic Teleassistance. In Proc. IEEE/RSJ/GI Int'l Conf. on Intelligent Robots and Systems. 1994. Muenchen, Germany.Google Scholar
- Tenenbaum, J.B. and F. Xu. Word learning as Bayesian inference. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society. 2000.Google Scholar
- Furnas, G.W., T.K. Landauer, L.M. Gomez, and S.T. Dumais, The vocabulary problem in human-system communication. Communications of the Association for Computing Machinery, 1987. 30(11): p. 964--971. Google ScholarDigital Library
- Asadi, A.O., Automatic Detection and Modeling of New Words in a Large Vocabulary Continuous Speech Recognition System, in Department of Electrical and Computer Engineering. 1991, Northeastern University: Boston. Google ScholarDigital Library
- Bazzi, I. and J.R. Glass. Modeling Out-of-Vocabulary Words for Robust Speech Recognition. In Proceedings of the 6th International Conference on Spoken Language Processing. 2000. Beijing, China.Google Scholar
- Galescu, L., Sub-lexical language models for unlimited vocabulary speech recognition. 2002, ATR: Kyoto, Japan.Google Scholar
- Meliani, R.E. and D. O'Shaughnessy. New efficient fillers for unlimited word recognition and keyword spotting. In ICSLP'96. 1996. Philadelphia, Pennsylvania, USA.Google ScholarCross Ref
- Gorniak, P. and D.K. Roy. Augmenting User Interfaces with Adaptive Speech Commands. In Proceedings of the International Conference for Multimodal Interfaces. 2003. Vancouver, B.C., Canada. Google ScholarDigital Library
- Roy, D. and A. Pentland, Learning Words from Sights and Sounds: A Computational Model. Cognitive Science, 2002. 26(1): p. 113--146.Google ScholarCross Ref
- Yu, C. and D.H. Ballard. A Multimodal Learning Interface for Grounding Spoken Language in Sensory Perceptions. In International Conference on Multimodal Interfaces (ICMI'03). 2003. Vancouver, B.C., Canada: ACM Press. Google ScholarDigital Library
- Kaiser, E., D. Demirdjian, A. Gruenstein, X. Li, J. Niekrasz, M. Wesson, and S. Kumar. Demo: A Multimodal Learning Interface for Sketch, Speak and Point Creation of a Schedule Chart. In International Conference on Multimodal Interfaces (ICMI'04). 2004. State College, PA. Google ScholarDigital Library
- Oviatt, S. and E. Olsen. Integration Themes in Multimodal Human-Computer Interaction. In International Conference on Spoken Language Processing (ICSLP'94). 1994.Google Scholar
- Oviatt, S.L., A. DeAngeli, and K. Kuhn. Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of Conference on Human Factors in Computing Systems: CHI'97. 1997. New York:: ACM Press. Google ScholarDigital Library
- Anderson, R.J., R. Anderson, C. Hoyer, and S.A. Wolfman. A Study of Digital Ink in Lecture Presentation. In CHI 2004: The 2004 Conference on Human Factors in Computing Systems. 2004. Vienna, Austria. Google ScholarDigital Library
- Anderson, R., C. Hoyer, C. Prince, J. Su, F. Videon, and S. Wolfman. Speech, Ink and Slides: The Interaction of Content Channels. In ACM Multimedia. 2004. Google ScholarDigital Library
- Neti, C., G. Potamianos, J. Luettin, I. Matthews, H. Glotin, and D. Vergyri. Large-vocabulary audio-visual speech recognition: A summary of the Johns Hopkins Summer 2000 Workshop. In Proc. IEEE Workshop on Multimedia Signal Processing. 2001. Cannes.Google ScholarCross Ref
- Kaiser, E.C. and P.R. Cohen. Implementation Testing of a Hybrid Symbolic/Statistical Multimodal Architecture. In Proceedings of the International Conference on Spoken Language Processing (ICSLP 2002). 2002. Denver.Google Scholar
- Bazzi, I., Modelling Out-of-Vocabulary Words for Robust Speech Recognition, In Electrical Engineering and Computer Science. 2002, Massachusetts Institute of Technology. p. 153.Google Scholar
- Roy, D., Grounded Spoken Language Acquisition: Experiments in Word Learning. IEEE Transactions on Multimedia., 2003. 5(2): p. 197--209. Google ScholarDigital Library
- Demirdjian, D., T. Ko, and T. Darrell. Constraining Human Body Tracking. In Proceedings of the International Conference on Computer Vision. 2003. Nice, France. Google ScholarDigital Library
- Oviatt, S.L. Mutual Disambiguation of Recognition Errors in a Multimodal Architecture. In Proceedings of the ACM Conference on Human Factors in Computing Systems. 1999. Google ScholarDigital Library
- Black, A.W. and K.A. Lenzo. Flite: a small fast run-time synthesis engine. In The 4th ISCA Worskop on Speech Synthesis. 2001. Perthshire, Scotland.Google Scholar
- Yu, C. and D.H. Ballard, A Computational Model of Embodied Language Learning. 2003, Computer Science Deptartment, University of Rochester: Rochester, New York. Google ScholarDigital Library
- Gogate, L.J., A.S. Walker-Andrews, and L.E. Bahrick, The Intersensory Origins of Word Comprehension: an Ecological-Dynamic Systems View. Development Science, 2001. 4(1): p. 1--37.Google ScholarCross Ref
- Yu, C., D.H. Ballard, and R.N. Aslin. The Role of Embodied Intention in Early Lexical Acquisition. In 25th Annual Meeting of Cognitive Science Society (CogSci 2003). 2003. Boston, MA.Google Scholar
Index Terms
- Multimodal new vocabulary recognition through speech and handwriting in a whiteboard scheduling application
Recommendations
Combining Spectral Representations for Large-Vocabulary Continuous Speech Recognition
In this paper, we investigate the combination of complementary acoustic feature streams in large-vocabulary continuous speech recognition (LVCSR). We have explored the use of acoustic features obtained using a pitch-synchronous analysis, Straight, in ...
Large vocabulary continuous speech recognition for Urdu
FIT '10: Proceedings of the 8th International Conference on Frontiers of Information TechnologyThis paper presents the development of acoustic and language models for robust Urdu speech recognition using the CMU Sphinx Open Source Toolkit for speech recognition. Three models have been developed incrementally, with the addition of speech data of ...
Comments