Using speech to identify gesture pen strokes in collaborative, multimodal device descriptions

James Herold; Thomas F. Stahovich

doi:10.1017/S0890060411000060

Using speech to identify gesture pen strokes in collaborative, multimodal device descriptions

Published online by Cambridge University Press: 11 July 2011

James Herold and

Thomas F. Stahovich

Show author details

James Herold: Affiliation:
Department of Computer Science and Engineering, University of California, Riverside, California, USA
Thomas F. Stahovich: Affiliation:
Department of Mechanical Engineering, University of California, Riverside, California, USA

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

One challenge in building collaborative design tools that use speech and sketch input is distinguishing gesture pen strokes from those representing device structure, that is, object strokes. In previous work, we developed a gesture/object classifier that uses features computed from the pen strokes and the speech aligned with them. Experiments indicated that the speech features were the most important for distinguishing gestures, thus indicating the critical importance of the speech–sketch alignment. Consequently, we have developed a new alignment technique that employs a two-step process: the speech is first explicitly segmented (primarily into clauses), then the segments are aligned with the pen strokes. Our speech segmentation step is unique in that it uses sketch features for locating segment boundaries in multimodal dialog. In addition, it uses a single classifier to directly combine word-based, prosodic (pause), and sketch-based features. In the second step, segments are initially aligned with strokes based on temporal correlation, and then classifiers are used to detect and correct two common alignment errors. Our two-step technique has proven to be substantially more accurate at alignment than the existing technique that lacked explicit segmentation. It is more important that, for nearly all cases, our new technique results in greater gesture classification accuracy than the existing technique, and performed nearly as well as the benchmark manual speech–sketch alignment.

Keywords

Design Descriptions Gesture/Object Stroke Classification Multimodal Dialog Speech Segmentation Speech–Sketch Alignment

Type: Special Issue Articles
Information: AI EDAM , Volume 25 , Issue 3: The Role of Gesture in Designing , August 2011 , pp. 237 - 254

DOI: https://doi.org/10.1017/S0890060411000060 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

REFERENCES

Adler, A., & Davis, R. (2007). Speech and sketching: an empirical study of multimodal interaction. SBIM ‘07: Proc. 4th Eurographics Workshop on Sketch-Based Interfaces and Modeling, pp. 83–90. New York: ACM.CrossRef Google Scholar

Aha, D.W., & Bankert, R.L. (1994). A comparative evaluation of sequential feature selection algorithms. Proc. 5th Int. Workshop on Artificial Intelligence and Statistics, pp. 1–7, Ft. Lauderdale, FL.Google Scholar

Artstein, R., & Poesio, M. (2005). Bias decreases in proportion to the number of annotators. Proc. FG-MoL, pp. 141–150.Google Scholar

Bischel, D., Stahovich, T., Peterson, E., Davis, R., & Adler, A. (2009). Combining speech and sketch to interpret unconstrained descriptions of mechanical devices. IJCAI ‘09: Proc. 21st Int. Joint Conf. Artifical Intelligence, pp. 1401–1406. San Francisco, CA: Morgan Kaufmann.Google Scholar

Bishop, C., Svensen, M., & Hinton, G. (2004). Distinguishing text from graphics in on-line handwritten ink. Proc. Int. Workshop on Frontiers in Handwriting Recognition, pp. 142–147.CrossRef Google Scholar

Bloomenthal, K., & Zeleznik, R. (1998). SKETCH-N-MAKE: automated machining of CAD sketches. Proc. ASME DETC ‘98, pp. 1–11.CrossRef Google Scholar

Bolt, R.A. (1980). “Put-that-there”: voice and gesture at the graphics interface. SIGGRAPH ‘80: Proc. 7th Annual Conf. Computer Graphics and Interactive Techniques, pp. 262–270. New York: ACM.CrossRef Google Scholar

Brown, D.C., Kwasny, S.C., Chandrasekaran, B., & Sondheimer, N.K. (1979). An experimental graphics system with natural language input. Computers & Graphics 4(1), 13–22.CrossRef Google Scholar

Cassell, J. (1998). A framework for gesture generation and interpretation. In Computer Vision in Human–Machine Interaction (Cipolla, R., & Pentland, A., Eds.), pp. 191–215. New York: Cambridge University Press.Google Scholar

Chai, J.Y., Hong, P., & Zhou, M.X. (2004). A probabilistic approach to reference resolution in multimodal user interfaces. IUI ‘04: Proc. 9th Int. Conf. Intelligent User Interfaces, pp. 70–77. New York: ACM Press.CrossRef Google Scholar

Chai, J.Y., Prasov, Z., Blaim, J., & Jin, R. (2005). Linguistic theories in efficient multimodal reference resolution: an empirical investigation. IUI ‘05: Proc. 10th Int. Conf. Intelligent User Interfaces, pp. 43–50. New York: ACM.CrossRef Google Scholar

Cohen, P.R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L., & Clow, J. (1997). Quickset: multimodal interaction for distributed applications. MULTIMEDIA ‘97: Proc. 5th ACM Int. Conf. Multimedia, pp. 31–40. New York: ACM.CrossRef Google Scholar

Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis 1, 131–156.CrossRef Google Scholar

Efron, D. (1941). Gesture and Environment. Morningside Heights, NY: King's Crown Press.Google Scholar

Field, M., Gordon, S., Peterson, E., Robinson, R., Stahovich, T., & Alvarado, C. (2010). Technical section: the effect of task on classification accuracy: using gesture recognition techniques in free-sketch recognition. Computers & Graphics 34, 499–512.CrossRef Google Scholar

Forbus, K.D., Ferguson, R.W., & Usher, J.M. (2001). Towards a computational model of sketching. Proc. 6th Int. Conf. Intelligent User Interfaces, pp. 77–83. New York: ACM Press.Google Scholar

Godfrey, J.J., Holliman, E.C., & McDaniel, J. (1992). SWITCHBOARD: telephone speech corpus for research and development. Proc. ICASSP, Vol. 1, pp. 517–520.Google Scholar

Gotoh, Y., & Renals, S. (2000). Sentence boundary detection in broadcast speech transcripts. Proc. ISCA Workshop: Automatic Speech Recognition: Challenges for the New Millennium ASR-2000, pp. 228–235.Google Scholar

Graham, P. (2004). Hackers and Painters, Big Ideas From the Computer Age. New York: O'Reilly.Google Scholar

Gupta, P., Doermann, D., & DeMenthon, D. (2002). Beam search for feature selection in automatic svm defect classification. Proc. Int. Conf. Pattern Recognition, Vol. 2, p. 20212.Google Scholar

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I.H. (2009). The WEKA data mining software: an update. SIGKDD Explorations Newsletter 11(1), 10–18.CrossRef Google Scholar

Heiser, J., & Tversky, B. (2006). Arrows in comprehending and producing mechanical diagrams. Cognitive Science 30(3), 581–592.CrossRef Google Scholar PubMed

Hoffmann, A.G., Kwok, R.B.H., & Compton, P. (2001). Using subclasses to improve classification learning. EMCL ‘01: Proc. 12th European Conf. Machine Learning, pp. 203–213. London: Springer–Verlag.Google Scholar

Huang, X., Alleva, F., & Hon, H. (1993). The SPHINX-II speech recognition system: an overview. Computer, Speech and Language 7, 137–148.CrossRef Google Scholar

Hwan Kim, J., & Woodland, P.C. (2001). The use of prosody in a combined system for punctuation generation and speech recognition. Proc. EUROSPEECH, pp. 2757–2760.CrossRef Google Scholar

Johnston, M., Bangalore, S., Vasireddy, G., Stent, A., Ehlen, P., Walker, M., Whittaker, S., & Maloor, P. (2002). MATCH: an architecture for multimodal dialogue systems. Proc. 40th Annual Meeting of the Association for Computational Linguistics, pp. 276–383.Google Scholar

Kara, L.B., & Stahovich, T.F. (2005). An image-based, trainable symbol recognizer for hand-drawn sketches. Computer Graphics 29(4), 501–517.CrossRef Google Scholar

Kendon, A. (1997). Gesture. Annual Review of Anthropology 26(1), 109–128.CrossRef Google Scholar

Krahnstoever, N., Kettebekov, S., Yeasin, M., & Sharma, R. (2002). A real-time framework for natural multimodal interaction with large screen displays. Proc. 4th Int. Conf. Multimodal Interfaces (ICMI 2002), Pittsburgh, PA.CrossRef Google Scholar

Liu, Y., Stolcke, A., Shriberg, E., & Harper, M.P. (2004). Comparing and combining generative and posterior probability models: some advances in sentence boundary detection in speech. Proc. Empirical Methods in Natural Language Processing, Barcelona.Google Scholar

Liu, Y., Stolcke, A., Shriberg, E., & Harper, M. (2005). Using conditional random fields for sentence boundary detection in speech. ACL ‘05: Proc. 43rd Annual Meeting on Association for Computational Linguistics, pp. 451–458. Morristown, NJ: Association for Computational Linguistics.CrossRef Google Scholar

Luo, Y. (2008). Can subclasses help a multiclass learning problem? Proc. Intelligent Vehicles Symposium, 2008 IEEE, pp. 214–219.Google Scholar

Masry, M., Kang, D., & Lipson, H. (2005). A freehand sketching interface for progressive construction of 3D objects. Computers & Graphics 29(4), 563–575.CrossRef Google Scholar

McNeill, D. (1992). Hand and Mind: What Gestures Reveal About Thought. Chicago: University of Chicago Press.Google Scholar

Nakai, M., & Shimodaira, H. (1994). Accent phrase segmentation by finding n-best sequences of pitch pattern templates. Proc. 3rd Int. Conf. Spoken Language Processing (ICSLP 94), pp. 347–350.CrossRef Google Scholar

Novak, G.S.J., & Bulko, W.C. (1993). Diagrams and text as computer input. Journal of Visual Languages and Computing 4(2), 161–175.CrossRef Google Scholar

Oltmans, M. (2000). Understanding naturally conveyed explanations of device behavior. MS Thesis. Massachusetts Institute of Technology.Google Scholar

Oviatt, S. (2000). Taming recognition errors with a multimodal interface. Communications of the ACM 43(9), 45–51.Google Scholar

Oviatt, S., Cohen, P., Wu, L., Vergo, J., Duncan, L., Suhm, B., Bers, J., Holzman, T., Winograd, T., Landay, J., Larson, J., & Ferro, D. (2000). Designing the user interface for multimodal speech and pen-based gesture applications: state-of-the-art systems and future research directions. Human–Computer Interaction 15(4), 263–322.CrossRef Google Scholar

Oviatt, S., DeAngeli, A., & Kuhn, K. (1997). Integration and synchronization of input modes during multimodal human–computer interaction. CHI ‘97: Proc. SIGCHI Conf. Human Factors in Computing Systems, pp. 415–422. New York: ACM.CrossRef Google Scholar

Oviatt, S.L. (1999). Mutual disambiguation of recognition errors in a multimodel architecture. Proc. CHI 99 Conf. Human Factors in Computing Systems: The CHI is the Limit, pp. 576–583. New York: ACM.Google Scholar

Patel, R., Plimmer, B., Grundy, J., & Ihaka, R. (2007). Ink features for diagram recognition. Proc. SBIM ‘07, pp. 131–138.Google Scholar

Rubine, D. (1991). Specifying gestures by example. Computer Graphics 25, 329–337.CrossRef Google Scholar

Silva, N., & Cardoso, T. (2004). GIDeS++—Using Constraints to Model Scenes, Technical Report. Information Society Technologies.Google Scholar

Stolcke, A., & Shriberg, E. (1996). Automatic linguistic segmentation of conversational speech. Proc. Int. Conf. Spoken Language Processing (Bunnell, H.. & Idsardi, W., Eds.), Vol. 2, pp. 1005–1008. Philadelphia, PA.Google Scholar

Stolcke, A., Shriberg, E., Bates, R., Ostendorf, M., Hakkani, D., Plauche, M., Tur, G., & Lu, Y. (1998). Automatic detection of sentence boundaries and disfluencies based on recognized words. Proc. Int. Conf. Spoken Language Processing (Mannell, R., & Robert-Ribes, J., Eds.), Vol. 5, pp. 2247–2250. Sydney: Australian Speech Science and Technology Association.Google Scholar

Strassel, S. (2004). Simple metadata annotation specification version 6.2. Linguistic Data Consortium.Google Scholar

Toutanova, K., & Manning, C.D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. Proc. 2000 Joint SIGDAT Conf. Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 63–70. Morristown, NJ: Association for Computational Linguistics.Google Scholar

Ullman, D.G., Wood, S., & Craig, D. (1990). The importance of drawing in the mechanical design process. Computers & Graphics 14(2), 263–274.CrossRef Google Scholar

Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., & Woelfel, J. (2004). Sphinx-4: A Flexible Open Source Framework For Speech Recognition, Technical Report TR-2004-139. Sun Microsystems.Google Scholar

Wauchope, K. (1994). Eucalyptus: Integrating Natural Language Input With a Graphical User Interface, Technical Report NRL/FR/5510-94-9711. US Naval Research Laboratory.Google Scholar

Wobbrock, J.O., Wilson, A.D., & Li, Y. (2007). Gestures without libraries, toolkits or training: a $1 recognizer for user interface prototypes. UIST ‘07: Proc. 20th Annual ACM Symp. Interface Software and Technology, pp. 159–168. New York: ACM.CrossRef Google Scholar

Woods, W., Bates, L., Bobrow, R., Brachman, R., Cohen, P.R., & Klovstad, J. (1979). Research in Natural Language Understanding, Annual Report 4274. Bolt Beranek and Newman.Google Scholar

Xing, E.P., Jordan, M.I., & Karp, R.M. (2001). Feature selection for high-dimensional genomic microarray data. Proc. 18th Int. Conf. Machine Learning, pp. 601–608. San Mateo, CA: Morgan Kaufmann.Google Scholar

Yerazunis, W. (2004). The spam-filtering accuracy plateau at 99.9 percent accuracy and how to get past it. Proc. MIT Spam Conf., 2004.Google Scholar

Article contents

Using speech to identify gesture pen strokes in collaborative, multimodal device descriptions

Abstract

Keywords

Access options

References

REFERENCES

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests