A Temporal Order Modeling Approach to Human Action Recognition from Multimodal Sensor Data

Authors:
Jun Ye

University of Central Florida, Central Florida Blvd., Orlando, FL

University of Central Florida, Central Florida Blvd., Orlando, FL
View Profile

,
Hao Hu

University of Central Florida, Central Florida Blvd., Orlando, FL

University of Central Florida, Central Florida Blvd., Orlando, FL
View Profile

,
Guo-Jun Qi

University of Central Florida, Central Florida Blvd., Orlando, FL

University of Central Florida, Central Florida Blvd., Orlando, FL
View Profile

,
Kien A. Hua

University of Central Florida, Central Florida Blvd., Orlando, FL

University of Central Florida, Central Florida Blvd., Orlando, FL
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 13 Issue 2Article No.: 14pp 1–22https://doi.org/10.1145/3038917

Published:06 March 2017Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

From wearable devices to depth cameras, researchers have exploited various multimodal data to recognize human actions for applications, such as video gaming, education, and healthcare. Although there many successful techniques have been presented in the literature, most current approaches have focused on statistical or local spatiotemporal features and do not explicitly explore the temporal dynamics of the sensor data. However, human action data contain rich temporal structure information that can characterize the unique underlying patterns of different action categories. From this perspective, we propose a novel temporal order modeling approach to human action recognition. Specifically, we explore subspace projections to extract the latent temporal patterns from different human action sequences. The temporal order between these patterns are compared, and the index of the pattern that appears first is used to encode the entire sequence. This process is repeated multiple times and produces a compact feature vector representing the temporal dynamics of the sequence. Human action recognition can then be efficiently solved by the nearest neighbor search based on the Hamming distance between these compact feature vectors. We further introduce a sequential optimization algorithm to learn the optimized projections that preserve the pairwise label similarity of the action sequences. Experimental results on two public human action datasets demonstrate the superior performance of the proposed technique in both accuracy and efficiency.

References

Kerem Altun, Billur Barshan, and Orkun Tunçel. 2010. Comparative study on classifying human activities with miniature inertial and magnetic sensors. Pattern Recognition 43, 10, 3605--3620. Google ScholarDigital Library
Billur Barshan and Murat Cihan Yüksek. 2014. Recognizing daily and sports activities in two open source machine learning environments using body-worn sensor units. Computer Journal 57, 11, 1649--1667. Google ScholarCross Ref
Aaron F. Bobick and James W. Davis. 2001. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 3, 257--267. Google ScholarDigital Library
Sander Dieleman, Jan Schlüter, Colin Raffel, Eben Olson, Søren Kaae Sønderby, Daniel Nouri, Daniel Maturana, et al. 2015. Lasagne: First Release. Zenodo, Geneva, Switzerland.Google Scholar
Piotr Dollár, Vincent Rabaud, Garrison Cottrell, and Serge Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In Proceedings of the 2005 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. 65--72. Google ScholarDigital Library
Yong Du, Wei Wang, and Liang Wang. 2015. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1110--1118.Google Scholar
Yoav Freund, Robert Schapire, and Naoki Abe. 1999. A short introduction to boosting. Journal of the Japanese Society for Artificial Intelligence 14, 5, 771--780.Google Scholar
Raj Gupta, Alex Yong-Sang Chia, and Deepu Rajan. 2013. Human activities recognition using depth images. In Proceedings of the 21st ACM International Conference on Multimedia. 283--292. Google ScholarDigital Library
Lei Han, Xinxiao Wu, Wei Liang, Guangming Hou, and Yunde Jia. 2010. Discriminative human action recognition in the learned hierarchical manifold space. Image and Vision Computing 28, 5, 836--849. Google ScholarDigital Library
Zhenyu He and Lianwen Jin. 2009. Activity recognition from acceleration data based on discrete consine transform and SVM. In Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics (SMC’09). IEEE, Los Alamitos, CA, 5041--5044. Google ScholarCross Ref
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8, 1735--1780. Google ScholarDigital Library
M. Anwar Hossain, Pradeep K. Atrey, and Abdulmotaleb El Saddik. 2011. Modeling and assessing quality of information in multisensor multimedia monitoring systems. ACM Transactions on Multimedia Computing, Communications, and Applications 7, 1, 3. Google ScholarDigital Library
Jian-Fang Hu, Wei-Shi Zheng, Jianhuang Lai, and Jianguo Zhang. 2015. Jointly learning heterogeneous features for RGB-D activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5344--5352. Google ScholarCross Ref
Ivan Laptev. 2005. On space-time interest points. International Journal of Computer Vision 64, 2--3, 107--123. Google ScholarDigital Library
Oscar D. Lara and Miguel A. Labrador. 2013. A survey on human activity recognition using wearable sensors. IEEE Communications Surveys and Tutorials 15, 3, 1192--1209. Google ScholarCross Ref
Wanqing Li, Zhengyou Zhang, and Zicheng Liu. 2010. Action recognition based on a bag of 3D points. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’10). 9--14. Google ScholarCross Ref
Fengjun Lv and Ramakant Nevatia. 2006. Recognition and segmentation of 3-D human action using HMM and multi-class AdaBoost. In Proceedings of the 9th European Conference on Computer Vision (ECCV’06). 359--372. Google ScholarDigital Library
Meinard Müller and Tido Röder. 2006. Motion templates for automatic classification and retrieval of motion capture data. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA’06). 137--146. Google ScholarDigital Library
Francisco Javier Ordóñez and Daniel Roggen. 2016. Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors 16, 1, 115.Google ScholarCross Ref
Omar Oreifej and Zicheng Liu. 2013. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition. 716--723. Google ScholarDigital Library
Stephen J. Preece, John Yannis Goulermas, Laurence P. J. Kenney, and David Howard. 2009. A comparison of feature extraction methods for the classification of dynamic activities from accelerometer data. IEEE Transactions on Biomedical Engineering 56, 3, 871--879. Google ScholarCross Ref
Abu Saleh Md Mahfujur Rahman, M. Anwar Hossain, and Abdulmotaleb El Saddik. 2010. Spatial-geometric approach to physical mobile interaction based on accelerometer and IR sensory data fusion. ACM Transactions on Multimedia Computing, Communications, and Applications 6, 4, 28. Google ScholarDigital Library
Jamie Shotton, Toby Sharp, Alex Kipman, Andrew Fitzgibbon, Mark Finocchio, Andrew Blake, Mat Cook, and Richard Moore. 2013. Real-time human pose recognition in parts from single depth images. Communications of the ACM 56, 1, 116--124. Google ScholarDigital Library
Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. 1290--1297. Google ScholarDigital Library
L. Xia, C.-C. Chen, and J. K. Aggarwal. 2012. View invariant human action recognition using histograms of 3D joints. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. 20--27. Google ScholarCross Ref
Xiaodong Yang and Yingli Tian. 2014. Super normal vector for activity recognition using depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Google ScholarDigital Library
Xiaodong Yang, Chenyang Zhang, and Yingli Tian. 2012. Recognizing actions using depth motion maps-based histograms of oriented gradients. In Proceedings of the 20th ACM International Conference on Multimedia. 1057--1060. Google ScholarDigital Library
Jun Ye, Hao Hu, Kai Li, Guo-Jun Qi, and Kien A. Hua. 2015a. First-take-all: Temporal order-preserving hashing for 3D action videos. arXiv:1506.02184.Google Scholar
Jun Ye, Kai Li, and Kien A. Hua. 2015b. WTA hash-based multimodal feature fusion for 3D human action recognition. In Proceedings of the 2015 IEEE International Symposium on Multimedia (ISM’15). IEEE, Los Alamitos, CA, 184--190. Google ScholarCross Ref
Jun Ye, Kai Li, Guo-Jun Qi, and Kien A. Hua. 2015c. Temporal order-preserving dynamic quantization for human action recognition from multimodal sensor streams. In Proceedings of the 5th ACM International Conference on Multimedia Retrieval. ACM, New York, NY, 99--106. Google ScholarDigital Library
Jie Yin, Qiang Yang, and Jeffrey Junfeng Pan. 2008. Sensor-based abnormal human-activity detection. IEEE Transactions on Knowledge and Data Engineering 20, 8, 1082--1090. Google ScholarDigital Library
Bo Zhang, Nicola Conci, and Francesco G. B. De Natale. 2015. Segmentation of discriminative patches in human activity video. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 1, 4. Google ScholarDigital Library
Zhengyou Zhang. 2012. Microsoft Kinect sensor and its effect. IEEE MultiMedia 19, 2, 4--10. Google ScholarDigital Library
Xin Zhao, Xue Li, Chaoyi Pang, Xiaofeng Zhu, and Quan Z. Sheng. 2013. Online human gesture recognition from motion data streams. In Proceedings of the 21st ACM International Conference on Multimedia. 23--32. Google ScholarDigital Library
Yu Zhu, Wenbin Chen, and Guodong Guo. 2013. Fusing spatiotemporal features and joints for 3D action recognition. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Google ScholarDigital Library

Index Terms

A Temporal Order Modeling Approach to Human Action Recognition from Multimodal Sensor Data
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Temporal Order-Preserving Dynamic Quantization for Human Action Recognition from Multimodal Sensor Streams
ICMR '15: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval

Recent commodity depth cameras have been widely used in the applications of video games, business, surveillance and have dramatically changed the way of human-computer interaction. They provide rich multimodal information that can be used to interpret ...
Read More
A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector

Over the past two decades, human action recognition from video has been an important area of research in computer vision. Its applications include surveillance systems, human---computer interactions and various real-world applications where one of the ...
Read More
Human action recognition using weighted pooling

Pooling strategies, such as max pooling and sum pooling, have been widely used to obtain the global representations for action videos. However, these pooling strategies have several disadvantages. First, they are easily affected by unwanted background ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 13, Issue 2
May 2017
226 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3058792
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 March 2017
- Accepted: 1 December 2016
- Revised: 1 November 2016
- Received: 1 July 2016
Published in tomm Volume 13, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Human action recognition
multimodal sensor data
optimization
temporal order modeling
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 384
  Total Downloads
- Downloads (Last 12 months)39
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Temporal Order Modeling Approach to Human Action Recognition from Multimodal Sensor Data

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Temporal Order-Preserving Dynamic Quantization for Human Action Recognition from Multimodal Sensor Streams

A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector

Human action recognition using weighted pooling