Abstract
From wearable devices to depth cameras, researchers have exploited various multimodal data to recognize human actions for applications, such as video gaming, education, and healthcare. Although there many successful techniques have been presented in the literature, most current approaches have focused on statistical or local spatiotemporal features and do not explicitly explore the temporal dynamics of the sensor data. However, human action data contain rich temporal structure information that can characterize the unique underlying patterns of different action categories. From this perspective, we propose a novel temporal order modeling approach to human action recognition. Specifically, we explore subspace projections to extract the latent temporal patterns from different human action sequences. The temporal order between these patterns are compared, and the index of the pattern that appears first is used to encode the entire sequence. This process is repeated multiple times and produces a compact feature vector representing the temporal dynamics of the sequence. Human action recognition can then be efficiently solved by the nearest neighbor search based on the Hamming distance between these compact feature vectors. We further introduce a sequential optimization algorithm to learn the optimized projections that preserve the pairwise label similarity of the action sequences. Experimental results on two public human action datasets demonstrate the superior performance of the proposed technique in both accuracy and efficiency.
- Kerem Altun, Billur Barshan, and Orkun Tunçel. 2010. Comparative study on classifying human activities with miniature inertial and magnetic sensors. Pattern Recognition 43, 10, 3605--3620. Google ScholarDigital Library
- Billur Barshan and Murat Cihan Yüksek. 2014. Recognizing daily and sports activities in two open source machine learning environments using body-worn sensor units. Computer Journal 57, 11, 1649--1667. Google ScholarCross Ref
- Aaron F. Bobick and James W. Davis. 2001. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 3, 257--267. Google ScholarDigital Library
- Sander Dieleman, Jan Schlüter, Colin Raffel, Eben Olson, Søren Kaae Sønderby, Daniel Nouri, Daniel Maturana, et al. 2015. Lasagne: First Release. Zenodo, Geneva, Switzerland.Google Scholar
- Piotr Dollár, Vincent Rabaud, Garrison Cottrell, and Serge Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In Proceedings of the 2005 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. 65--72. Google ScholarDigital Library
- Yong Du, Wei Wang, and Liang Wang. 2015. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1110--1118.Google Scholar
- Yoav Freund, Robert Schapire, and Naoki Abe. 1999. A short introduction to boosting. Journal of the Japanese Society for Artificial Intelligence 14, 5, 771--780.Google Scholar
- Raj Gupta, Alex Yong-Sang Chia, and Deepu Rajan. 2013. Human activities recognition using depth images. In Proceedings of the 21st ACM International Conference on Multimedia. 283--292. Google ScholarDigital Library
- Lei Han, Xinxiao Wu, Wei Liang, Guangming Hou, and Yunde Jia. 2010. Discriminative human action recognition in the learned hierarchical manifold space. Image and Vision Computing 28, 5, 836--849. Google ScholarDigital Library
- Zhenyu He and Lianwen Jin. 2009. Activity recognition from acceleration data based on discrete consine transform and SVM. In Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics (SMC’09). IEEE, Los Alamitos, CA, 5041--5044. Google ScholarCross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8, 1735--1780. Google ScholarDigital Library
- M. Anwar Hossain, Pradeep K. Atrey, and Abdulmotaleb El Saddik. 2011. Modeling and assessing quality of information in multisensor multimedia monitoring systems. ACM Transactions on Multimedia Computing, Communications, and Applications 7, 1, 3. Google ScholarDigital Library
- Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, and Jianguo Zhang. 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5344--5352. Google ScholarCross Ref
- Ivan Laptev. 2005. On space-time interest points. International Journal of Computer Vision 64, 2--3, 107--123. Google ScholarDigital Library
- Oscar D. Lara and Miguel A. Labrador. 2013. A survey on human activity recognition using wearable sensors. IEEE Communications Surveys and Tutorials 15, 3, 1192--1209. Google ScholarCross Ref
- Wanqing Li, Zhengyou Zhang, and Zicheng Liu. 2010. Action recognition based on a bag of 3D points. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’10). 9--14. Google ScholarCross Ref
- Fengjun Lv and Ramakant Nevatia. 2006. Recognition and segmentation of 3-D human action using HMM and multi-class AdaBoost. In Proceedings of the 9th European Conference on Computer Vision (ECCV’06). 359--372. Google ScholarDigital Library
- Meinard Müller and Tido Röder. 2006. Motion templates for automatic classification and retrieval of motion capture data. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA’06). 137--146. Google ScholarDigital Library
- Francisco Javier Ordóñez and Daniel Roggen. 2016. Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors 16, 1, 115.Google ScholarCross Ref
- Omar Oreifej and Zicheng Liu. 2013. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. 716--723. Google ScholarDigital Library
- Stephen J. Preece, John Yannis Goulermas, Laurence P. J. Kenney, and David Howard. 2009. A comparison of feature extraction methods for the classification of dynamic activities from accelerometer data. IEEE Transactions on Biomedical Engineering 56, 3, 871--879. Google ScholarCross Ref
- Abu Saleh Md Mahfujur Rahman, M. Anwar Hossain, and Abdulmotaleb El Saddik. 2010. Spatial-geometric approach to physical mobile interaction based on accelerometer and IR sensory data fusion. ACM Transactions on Multimedia Computing, Communications, and Applications 6, 4, 28. Google ScholarDigital Library
- Jamie Shotton, Toby Sharp, Alex Kipman, Andrew Fitzgibbon, Mark Finocchio, Andrew Blake, Mat Cook, and Richard Moore. 2013. Real-time human pose recognition in parts from single depth images. Communications of the ACM 56, 1, 116--124. Google ScholarDigital Library
- Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. 1290--1297. Google ScholarDigital Library
- L. Xia, C.-C. Chen, and J. K. Aggarwal. 2012. View invariant human action recognition using histograms of 3D joints. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. 20--27. Google ScholarCross Ref
- Xiaodong Yang and Yingli Tian. 2014. Super normal vector for activity recognition using depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Google ScholarDigital Library
- Xiaodong Yang, Chenyang Zhang, and Yingli Tian. 2012. Recognizing actions using depth motion maps-based histograms of oriented gradients. In Proceedings of the 20th ACM International Conference on Multimedia. 1057--1060. Google ScholarDigital Library
- Jun Ye, Hao Hu, Kai Li, Guo-Jun Qi, and Kien A. Hua. 2015a. First-take-all: Temporal order-preserving hashing for 3D action videos. arXiv:1506.02184.Google Scholar
- Jun Ye, Kai Li, and Kien A. Hua. 2015b. WTA hash-based multimodal feature fusion for 3D human action recognition. In Proceedings of the 2015 IEEE International Symposium on Multimedia (ISM’15). IEEE, Los Alamitos, CA, 184--190. Google ScholarCross Ref
- Jun Ye, Kai Li, Guo-Jun Qi, and Kien A. Hua. 2015c. Temporal order-preserving dynamic quantization for human action recognition from multimodal sensor streams. In Proceedings of the 5th ACM International Conference on Multimedia Retrieval. ACM, New York, NY, 99--106. Google ScholarDigital Library
- Jie Yin, Qiang Yang, and Jeffrey Junfeng Pan. 2008. Sensor-based abnormal human-activity detection. IEEE Transactions on Knowledge and Data Engineering 20, 8, 1082--1090. Google ScholarDigital Library
- Bo Zhang, Nicola Conci, and Francesco G. B. De Natale. 2015. Segmentation of discriminative patches in human activity video. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 1, 4. Google ScholarDigital Library
- Zhengyou Zhang. 2012. Microsoft Kinect sensor and its effect. IEEE MultiMedia 19, 2, 4--10. Google ScholarDigital Library
- Xin Zhao, Xue Li, Chaoyi Pang, Xiaofeng Zhu, and Quan Z. Sheng. 2013. Online human gesture recognition from motion data streams. In Proceedings of the 21st ACM International Conference on Multimedia. 23--32. Google ScholarDigital Library
- Yu Zhu, Wenbin Chen, and Guodong Guo. 2013. Fusing spatiotemporal features and joints for 3D action recognition. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Google ScholarDigital Library
Index Terms
- A Temporal Order Modeling Approach to Human Action Recognition from Multimodal Sensor Data
Recommendations
Temporal Order-Preserving Dynamic Quantization for Human Action Recognition from Multimodal Sensor Streams
ICMR '15: Proceedings of the 5th ACM on International Conference on Multimedia RetrievalRecent commodity depth cameras have been widely used in the applications of video games, business, surveillance and have dramatically changed the way of human-computer interaction. They provide rich multimodal information that can be used to interpret ...
A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector
Over the past two decades, human action recognition from video has been an important area of research in computer vision. Its applications include surveillance systems, human---computer interactions and various real-world applications where one of the ...
Human action recognition using weighted pooling
Pooling strategies, such as max pooling and sum pooling, have been widely used to obtain the global representations for action videos. However, these pooling strategies have several disadvantages. First, they are easily affected by unwanted background ...
Comments