Abstract
Vision-based sign language recognition has attracted more and more interest from researchers in the computer vision field. In this article, we propose a novel algorithm to model and recognize sign language performed in front of a Microsoft Kinect sensor. Under the assumption that some frames are expected to be both discriminative and representative in a sign language video, we first assign a binary latent variable to each frame in training videos for indicating its discriminative capability, then develop a latent support vector machine model to classify the signs, as well as localize the discriminative and representative frames in each video. In addition, we utilize the depth map together with the color image captured by the Kinect sensor to obtain a more effective and accurate feature to enhance the recognition accuracy. To evaluate our approach, we conducted experiments on both word-level sign language and sentence-level sign language. An American Sign Language dataset including approximately 2,000 word-level sign language phrases and 2,000 sentence-level sign language phrases was collected using the Kinect sensor, and each phrase contains color, depth, and skeleton information. Experiments on our dataset demonstrate the effectiveness of the proposed method for sign language recognition.
- Bing-Kun Bao, Guangcan Liu, Changsheng Xu, and Shuicheng Yan. 2012. Inductive robust principal component analysis. IEEE Transactions on Image Processing 21, 8, 3794--3800. Google ScholarDigital Library
- Britta Bauer, Hermann Hienz, and Karl-Friedrich Kraiss. 2000. Video-based continuous sign language recognition using statistical methods. In Proceedings of the 15th International Conference on Pattern Recognition, Vol. 2. IEEE, Los Alamitos, CA, 463--466.Google ScholarCross Ref
- Helene Brashear, Thad Starner, Paul Lukowicz, and Holger Junker. 2003. Using multiple sensors for mobile sign language recognition. In Proceedings of the 7th IEEE International Symposium on Wearable Computers (ISWC’03). 45. Google ScholarDigital Library
- Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20, 3, 273--297. Google ScholarDigital Library
- Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, Los Alamitos, CA, 886--893. Google ScholarDigital Library
- Trinh-Minh-Tri Do and Thierry Artières. 2009. Large margin training for hidden Markov models with partially observed states. In Proceedings of the 26th Annual International Conference on Machine Learning. ACM, New York, NY, 265--272. Google ScholarDigital Library
- Gaolin Fang, Wen Gao, and Debin Zhao. 2003. Large vocabulary sign language recognition based on hierarchical decision trees. In Proceedings of the 5th International Conference on Multimodal Interfaces. ACM, New York, NY, 125--131. Google ScholarDigital Library
- S. Sidney Fels and Geoffrey E. Hinton. 1993. Glove-talk: A neural network interface between a data-glove and a speech synthesizer. IEEE Transactions on Neural Networks 4, 1, 2--8. Google ScholarDigital Library
- Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan. 2010. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 9, 1627--1645. Google ScholarDigital Library
- Pedro F. Felzenszwalb, David McAllester, and Deva Ramanan. 2008. A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08). IEEE, Los Alamitos, CA, 1--8.Google ScholarCross Ref
- Mohammad Hasanuzzaman, Vuthichai Ampornaramveth, Tao Zhang, Mohammad Al-Amin Bhuiyan, Yoshiaki Shirai, and Haruki Ueno. 2004. Real-time vision-based gesture recognition for human robot interaction. In Proceedings of the IEEE International Conference on Robotics and Biomimetics (ROBIO’04). IEEE, Los Alamitos, CA, 413--418.Google ScholarCross Ref
- Jose-Luis Hernandez-Rebollar, Nicholas Kyriakopoulos, and Robert W. Lindeman. 2004. A new instrumented approach for translating American Sign Language into sound and text. In Proceedings of the 6th IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, Los Alamitos, CA, 547--552. Google ScholarDigital Library
- Jose-Luis Hernandez-Rebollar, Robert W. Lindeman, and Nicholas Kyriakopoulos. 2002. A multi-class pattern recognition system for practical finger spelling translation. In Proceedings of the 4th IEEE International Conference on Multimodal Interfaces. IEEE, Los Alamitos, CA, 185. Google ScholarDigital Library
- Hermann Hienz, Britta Bauer, and Karl-Friedrich Kraiss. 1999. HMM-based continuous sign language recognition using stochastic grammars. In Gesture-Based Communication in Human-Computer Interaction. Lecture Notes in Computer Science, Vol. 1739. Springer, 185--196. Google ScholarDigital Library
- Eun-Jung Holden and Robyn Owens. 2001. Visual sign language recognition. In Multi-Image Analysis. Lecture Notes in Computer Science, Vol. 2032. Springer, 270--287. Google ScholarDigital Library
- Kazuyuki Imagawa, Shan Lu, and Seiji Igi. 1998. Color-based hands tracking system for sign language recognition. In Proceedings of the 3rd IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, Los Alamitos, CA, 462--467. Google ScholarDigital Library
- Timor Kadir, Richard Bowden, Eng-Jon Ong, and Andrew Zisserman. 2004. Minimal training, large lexicon, unconstrained sign language recognition. In Proceedings of the British Machine Vision Conference. 939--948.Google ScholarCross Ref
- Mohammed Waleed Kadous. 1996. Machine recognition of Auslan signs using PowerGloves: Towards large-lexicon recognition of sign language. In Proceedings of the Workshop on the Integration of Gesture in Language and Speech. 165--174.Google Scholar
- Cem Keskin, Furkan Kirac, Yunus Emre Kara, and Lale Akarun. 2013. Real time hand pose estimation using depth sensors. In Consumer Depth Cameras for Computer Vision. Advances in Computer Vision and Pattern Recognition 2013. Springer, 119--137.Google Scholar
- Jong-Sung Kim, Won Jang, and Zeungnam Bien. 1996. A dynamic gesture recognition system for the Korean sign language (KSL). IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 26, 2, 354--359. Google ScholarDigital Library
- Tian Lan, Yang Wang, and Greg Mori. 2011. Discriminative figure-centric models for joint action localization and recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’11). IEEE, Los Alamitos, CA, 2003--2010. Google ScholarDigital Library
- Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, Los Alamitos, CA, 2169--2178. Google ScholarDigital Library
- Runghuei Liang and Ming Ouhyoung. 1998. A real-time continuous gesture recognition system for sign language. In Proceedings of the 3rd IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, Los Alamitos, CA, 558--567. Google ScholarDigital Library
- Lingqiao Liu, Lei Wang, and Xinwang Liu. 2011a. In defense of soft-assignment coding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’11). IEEE, Los Alamitos, CA, 2486--2493. Google ScholarDigital Library
- Si Liu, Jiashi Feng, Zheng Song, Tianzhu Zhang, Hanqing Lu, Changsheng Xu, and Shuicheng Yan. 2012a. Hi, magic closet, tell me what to wear! In Proceedings of the 20th ACM International Conference on Multimedia. ACM, New York, NY, 619--628. Google ScholarDigital Library
- Si Liu, Hairong Liu, Longin Jan Latecki, Shuicheng Yan, Changsheng Xu, and Hanqing Lu. 2011b. Size adaptive selection of most informative features. In Proceedings of the 25th AAAI Conference on Artificial Intelligence.Google Scholar
- Si Liu, Zheng Song, Guangcan Liu, Changsheng Xu, Hanqing Lu, and Shuicheng Yan. 2012b. Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). IEEE, Los Alamitos, CA, 3330--3337. Google ScholarDigital Library
- Hideaki Matsuo, Seiji Igi, Shan Lu, Yuji Nagashima, Yuji Takata, and Terutaka Teshima. 1998. The recognition algorithm with non-contact for Japanese Sign Language using morphological analysis. In Gesture and Sign Language in Human-Computer Interaction. Lecture Notes in Computer Science, Vol. 1371. Springer, 273--284. Google ScholarDigital Library
- Kouichi Murakami and Hitomi Taguchi. 1991. Gesture recognition using recurrent neural networks. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Reaching through Technology. ACM, New York, NY, 237--242. Google ScholarDigital Library
- Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. 1999. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. 467--475. Google ScholarDigital Library
- Jakub Segen and Senthil Kumar. 1999. Shadow gestures: 3D hand pose estimation using a single camera. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1. IEEE, Los Alamitos, CA.Google ScholarCross Ref
- Josef Sivic and Andrew Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the 9th IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 1470--1477. Google ScholarDigital Library
- Thad Starner. 1995. Visual Recognition of American Sign Language Using Hidden Markov Models. Technical Report. Massachusetts Institute of Technology, Cambridge, MA.Google Scholar
- Thad Starner, Joshua Weaver, and Alex Pentland. 1998. Real-time American Sign Language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 12, 1371--1375. Google ScholarDigital Library
- Chao Sun, Tianzhu Zhang, Bing-Kun Bao, Changsheng Xu, and Tao Mei. 2013. Discriminative exemplar coding for sign language recognition with Kinect. IEEE Transactions on Cybernetics 43, 1418--1428.Google ScholarCross Ref
- Christian Vogler and Dimitris Metaxas. 1997. Adapting hidden Markov models for ASL recognition by using three-dimensional computer vision methods. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics: Computational Cybernetics and Simulation, Vol. 1. IEEE, Los Alamitos, CA, 156--161.Google ScholarCross Ref
- Christian Vogler and Dimitris Metaxas. 1999. Parallel hidden Markov models for American Sign Language recognition. In Proceedings of the 7th IEEE International Conference on Computer Vision, Vol. 1. IEEE, Los Alamitos, CA, 116--122.Google ScholarCross Ref
- Christian Vogler and Dimitris Metaxas. 2001. A framework for recognizing the simultaneous aspects of American Sign Language. Computer Vision and Image Understanding 81, 3, 358--384. Google ScholarDigital Library
- Ulrich Von Agris, Jorg Zieren, Ulrich Canzler, Britta Bauer, and Karl-Friedrich Kraiss. 2008. Recent developments in visual sign language recognition. Universal Access in the Information Society 6, 4, 323--362. Google ScholarDigital Library
- Ming-Hsuan Yang, Narendra Ahuja, and Mark Tabb. 2002. Extraction of 2D motion trajectories and its application to hand gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 8, 1061--1074. Google ScholarDigital Library
- B. Yao and L. Fei-Fei. 2010. Modeling mutual context of object and human pose in human-object interaction activities. In Proceeding of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10). IEEE, Los Alamitos, CA, 17--24.Google Scholar
- Chun-Nam John Yu and Thorsten Joachims. 2009. Learning structural SVMs with latent variables. In Proceedings of the 26th Annual International Conference on Machine Learning. ACM, New York, NY, 1169--1176. Google ScholarDigital Library
- Zahoor Zafrulla, Helene Brashear, Thad Starner, Harley Hamilton, and Peter Presti. 2011. American Sign Language recognition with the Kinect. In Proceedings of the 13th International Conference on Multimodal Interfaces. ACM, New York, NY, 279--286. Google ScholarDigital Library
- Liang-Guo Zhang, Yiqiang Chen, Gaolin Fang, Xilin Chen, and Wen Gao. 2004. A vision-based sign language recognition system using tied-mixture density HMM. In Proceedings of the 6th International Conference on Multimodal Interfaces. ACM, New York, NY, 198--204. Google ScholarDigital Library
- Tianzhu Zhang, Bernard Ghanemand Si Liu, Changsheng Xu, and Narendra Ahuja. 2013. Low-rank sparse coding for image classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’13). IEEE, Los Alamitos, CA. Google ScholarDigital Library
- Tianzhu Zhang, Jing Liu, Si Liu, Yi Ouyang, and Hanqing Lu. 2009. Boosted exemplar learning for human action recognition. In Proceedings of the IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops’09). IEEE, Los Alamitos, CA, 538--545.Google Scholar
- Tianzhu Zhang, Jing Liu, Si Liu, Changsheng Xu, and Hanqing Lu. 2011. Boosted exemplar learning for action recognition and annotation. IEEE Transactions on Circuits and Systems for Video Technology 21, 7, 853--866. Google ScholarDigital Library
- Tianzhu Zhang, Changsheng Xu, Guangyu Zhu, Si Liu, and Hanqing Lu. 2012. A generic framework for video annotation via semi-supervised learning. In IEEE Transactions on Multimedia 14, 4, 1206--1219. Google ScholarDigital Library
- Zhengyou Zhang. 2012. Microsoft Kinect sensor and its effect. IEEE Multimedia 19, 2, 4--10. Google ScholarDigital Library
Index Terms
- Latent Support Vector Machine Modeling for Sign Language Recognition with Kinect
Recommendations
Robust sign language recognition by combining manual and non-manual features based on conditional random field and support vector machine
The sign language is composed of two categories of signals: manual signals such as signs and fingerspellings and non-manual ones such as body gestures and facial expressions. This paper proposes a new method for recognizing manual signals and facial ...
Subunit sign modeling framework for continuous sign language recognition
AbstractA new framework named three subunit sign modeling is introduced for automatic sign language recognition. This works on continuous video sequences consisting of isolated words, signed sentences under different signer variations and ...
Automatic Sign Language Analysis: A Survey and the Future beyond Lexical Meaning
Research in automatic analysis of sign language has largely focused on recognizing the lexical (or citation) form of sign gestures as they appear in continuous signing, and developing algorithms that scale well to large vocabularies. However, successful ...
Comments