research-article

Recognizing Human Actions with Outlier Frames by Observation Filtering and Completion

Authors:

Yi-Ping HungAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 13, Issue 3

Article No.: 28, Pages 1 - 23

https://doi.org/10.1145/3089250

Published: 28 June 2017 Publication History

Abstract

This article addresses the problem of recognizing partially observed human actions. Videos of actions acquired in the real world often contain corrupt frames caused by various factors. These frames may appear irregularly, and make the actions only partially observed. They change the appearance of actions and degrade the performance of pretrained recognition systems. In this article, we propose an approach to address the corrupt-frame problem without knowing their locations and durations in advance. The proposed approach includes two key components: outlier filtering and observation completion. The former identifies and filters out unobserved frames, and the latter fills up the filtered parts by retrieving coherent alternatives from training data. Hidden Conditional Random Fields (HCRFs) are then used to recognize the filtered and completed actions. Our approach has been evaluated on three datasets, which contain both fully observed actions and partially observed actions with either real or synthetic corrupt frames. The experimental results show that our approach performs favorably against the other state-of-the-art methods, especially when corrupt frames are present.

References

[1]

Elisabeth Andre. 2013. Exploiting unconscious user signals in multimodal human-computer interaction. ACM Trans. Multimedia Comput., Commun., Appl. 9, 1s (2013), 48.

Digital Library

[2]

Alper Ayvaci, Michalis Raptis, and Stefano Soatto. 2012. Sparse occlusion detection with optical flow. Int. J. Comput. Vis. 97, 3 (2012), 322--338.

Digital Library

[3]

Prithviraj Banerjee and Ram Nevatia. 2014. Pose filter based hidden-CRF models for activity detection. In Proc. Euro. Conf. Computer Vision. 711--726.

[4]

Yu Cao, Daniel Barrett, Andrei Barbu, Swaminathan Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, and Song Wang. 2013. Recognize human activities from partially observed videos. In Proc. Conf. Computer Vision and Pattern Recognition. 2658--2665.

Digital Library

[5]

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2016. Realtime multi-person 2D pose estimation using part affinity fields. arXiv Preprint arXiv:1611.08050 (2016).

[6]

Gustavo Carneiro and Jacinto C. Nascimento. 2013. Combining multiple dynamic models and deep learning architectures for tracking the left ventricle endocardium in ultrasound data. IEEE Trans. Pattern Anal. Mach. Intell. 35, 11 (2013), 2592--2607.

Digital Library

[7]

Alexandros Andre Chaaraoui, José Ramón Padilla-López, and Francisco Flórez-Revuelta. 2013. Fusion of skeletal and silhouette-based features for human action recognition with RGB-D devices. In Proc. Int’ Conf. Computer Vision Workshops. 91--97.

Digital Library

[8]

Feng-Ju Chang, Yen-Yu Lin, and Kuang-Jui Hsu. 2014. Multiple structured-instance learning for semantic segmentation with uncertain training data. In Proc. Conf. Computer Vision and Pattern Recognition.

Digital Library

[9]

Kai-Yueh Chang, Tyng-Luh Liu, and Shang-Hong Lai. 2009. Learning partially-observed hidden conditional random fields for facial expression recognition. In Proc. Conf. Computer Vision and Pattern Recognition. 533--540.

[10]

Chia-Chih Chen and J. K. Aggarwal. 2011. Modeling human activities as speech. In Proc. Conf. Computer Vision and Pattern Recognition. 3425--3432.

Digital Library

[11]

Hongzhao Chen, Guijin Wang, and Li He. 2013. Accurate and real-time human action recognition based on 3D skeleton. In Proc. Int’l. Conf. Optical Instruments and Technology.

[12]

Zhuo Chen, Lu Wang, and Nelson H. C. Yung. 2011. Adaptive human motion analysis and prediction. Pattern Recognition 44, 12 (2011), 2902--2914.

Digital Library

[13]

Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. Structured feature learning for pose estimation. In Proc. Conf. Computer Vision and Pattern Recognition.

[14]

James W. Davis and Ambrish Tyagi. 2006. Minimal-latency human action recognition using reliable-inference. Image Vis. Comput. 24, 5 (2006), 455--472.

Digital Library

[15]

Maxime Devanne, Hazem Wannous, Stefano Berretti, Pietro Pala, Meroua Daoudi, and Alberto Del Bimbo. 2015. 3-D human action recognition by shape analysis of motion trajectories on Riemannian manifold. IEEE Trans. Cybernet. 45, 7 (2015), 1340--1352.

[16]

Piotr Dollár, Vincent Rabaud, Garrison Cottrell, and Serge Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In Proc. Int’l. Workshops on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. 65--72.

[17]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proc. Conf. Computer Vision and Pattern Recognition. 2625--2634.

[18]

Li Fei-Fei and Pietro Perona. 2005. A Bayesian hierarchical model for learning natural scene categories. In Proc. Conf. Computer Vision and Pattern Recognition, Vol. 2. 524--531.

Digital Library

[19]

Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proc. Conf. Computer Vision and Pattern Recognition.

[20]

Thomas Finley and Thorsten Joachims. 2008. Training structural SVMs when exact inference is intractable. In Proc. Int'l Conf. Machine Learning.

Digital Library

[21]

Simon Fothergill, Helena M. Mentis, Pushmeet Kohli, and Sebastian Nowozin. 2012. Instructing people for training gestural interactive systems. In Proc. Int’l. Conf. Human Factors in Computing Systems. 1737--1746.

Digital Library

[22]

Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proc. Conf. Computer Vision and Pattern Recognition. 2568--2577.

[23]

Mohammad A. Gowayyed, Marwan Torki, Mohamed E. Hussein, and Motaz El-Saban. 2013. Histogram of oriented displacements (HOD): Describing trajectories of human joints for action recognition. In Proc. Int’l. Joint Conf. Artificial Intelligence. 1351--1357.

Digital Library

[24]

Minh Hoai and Fernando De la Torre. 2014. Max-margin early event detectors. Int. J. Comput. Vis. 107, 2 (2014), 191--202.

Digital Library

[25]

Alexandros Iosifidis, Anastasios Tefas, and Ioannis Pitas. 2013. Dynamic action classification based on iterative data selection and feedforward neural networks. In Proc. Euro. Conf. Signal Processing. 1--5.

[26]

Yun Jiang and Ashutosh Saxena. 2014. Modeling high-dimensional humans for activity anticipation using Gaussian process latent CRFs. In Robotics: Science and Systems.

[27]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems. 1097--1105.

Digital Library

[28]

Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. 2014. A hierarchical representation for future action prediction. In Proc. Euro. Conf. Computer Vision. 689--704.

[29]

Ivan Laptev. 2005. On space-time interest points. Int. J. Comput. Vis. 64, 2--3 (2005), 107--123.

Digital Library

[30]

Chuanjun Li, S. Q. Zheng, and B. Prabhakaran. 2007. Segmentation and recognition of motion streams by similarity search. ACM Trans. Multimedia Comput., Commun., Appl. 3, 3 (2007), 16.

Digital Library

[31]

Kang Li and Yun Fu. 2014. Prediction of human activity by discovering temporal sequence patterns. IEEE Trans. Pattern Anal. Mach. Intell. 36, 8 (2014), 1644--1657.

Digital Library

[32]

Qing Li, Zhaofan Qiu, Ting Yao, Tao Mei, Yong Rui, and Jiebo Luo. 2016. Action recognition by learning deep multi-granular spatio-temporal video representation. In Proc. ACM Conf. Multimedia Retrieval. 159--166.

Digital Library

[33]

Wanqing Li, Zhengyou Zhang, and Zicheng Liu. 2008. Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans. Circuits Syst. Video Technol. 18, 11 (2008), 1499--1510.

Digital Library

[34]

Xiao Li, Min Fang, Ju-Jie Zhang, and Jinqiao Wu. 2017. Learning coupled classifiers with RGB images for RGB-D object recognition. Pattern Recognition 61 (2017), 433--446.

Digital Library

[35]

Shih-Yao Lin, Yen-Yu Lin, Chu-Song Chen, and Yi-Ping Hung. 2017. Learning and inferring human actions with temporal pyramid features based on conditional random fields. In Proc. Int’l. Conf. Acoustics, Speech, and Signal Processing.

[36]

Yen-Yu Lin, Ju-Hsuan Hua, Nick C. Tang, Min-Hung Chen, and Hong-Yuan Mark Liao. 2014. Depth and skeleton associated action recognition without online accessible RGB-D cameras. In Proc. Conf. Computer Vision and Pattern Recognition. 2617--2624.

Digital Library

[37]

Li Liu, Ling Shao, Xuelong Li, and Ke Lu. 2016a. Learning spatio-temporal representations for action recognition: A genetic programming approach. IEEE Trans. Cybernetics 46, 1 (2016), 158--170.

[38]

Li Liu, Yi Zhou, and Ling Shao. 2016b. DAP3D-Net: Where, what and how actions occur in videos? arXiv Preprint arXiv:1602.03346 (2016).

[39]

Fengjun Lv and Ramakant Nevatia. 2006. Recognition and segmentation of 3-D human action using HMM and multi-class adaboost. In Proc. Euro. Conf. Computer Vision. 359--372.

Digital Library

[40]

James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proc. Berkeley Symp. Mathematical Statistics and Probability, Vol. 1. 281--297.

[41]

Subhransu Maji, Lubomir Bourdev, and Jitendra Malik. 2011. Action recognition from a distributed representation of pose and appearance. In Proc. Conf. Computer Vision and Pattern Recognition. 3177--3184.

Digital Library

[42]

James Martens and Ilya Sutskever. 2011. Learning recurrent neural networks with Hessian-free optimization. In Proc. Int’l. Conf. Machine Learning. 1033--1040.

Digital Library

[43]

Omar Oreifej and Zicheng Liu. 2013. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proc. Conf. Computer Vision and Pattern Recognition. 716--723.

Digital Library

[44]

Olusegun Oshin, Andrew Gilbert, and Richard Bowden. 2011. Capturing the relative distribution of features for action recognition. In Proc. Conf. Automatic Face and Gesture Recognition. 111--116.

[45]

Jian Peng, Liefeng Bo, and Jinbo Xu. 2009. Conditional neural fields. In Proc. Advances in Neural Information Processing Systems. 1419--1427.

Digital Library

[46]

Lasitha Piyathilaka and Sarath Kodagoda. 2013. Gaussian mixture based HMM for human daily activity recognition using 3D skeleton features. In Proc. Int’l. Conf. Industrial Electronics and Applications. 567--572.

[47]

Ariadna Quattoni, Sybor Wang, Louis-Philippe Morency, Michael Collins, and Trevor Darrell. 2007. Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. (2007), 1848--1852.

Digital Library

[48]

Michalis Raptis and Leonid Sigal. 2013. Poselet key-framing: A model for human activity recognition. In Proc. Conf. Computer Vision and Pattern Recognition. 2650--2657.

Digital Library

[49]

M. S. Ryoo. 2011. Human activity prediction: Early recognition of ongoing activities from streaming videos. In Proc. Int’l. Conf. Computer Vision. 1036--1043.

Digital Library

[50]

M. S. Ryoo and J. K. Aggarwal. 2010. UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA). (2010).

[51]

Konrad Schindler and Luc Van Gool. 2008. Action snippets: How many frames does human action recognition require?. In Proc. Conf. Computer Vision and Pattern Recognition. 1--8.

[52]

Wei Shen, Ke Deng, Xiang Bai, Tommer Leyvand, Baining Guo, and Zhuowen Tu. 2012. Exemplar-based human action pose correction and tagging. In Proc. Conf. Computer Vision and Pattern Recognition. 1784--1791.

Digital Library

[53]

Ya-Fang Shih, Yang-Ming Yeh, Yen-Yu Lin, Ming-Feng Weng, Yi-Chang Lu, and Yung-Yu Chuang. 2017. Deep co-occurrence feature learning for visual object recognition. In Proc. Conf. Computer Vision and Pattern Recognition.

[54]

Guang Shu, Afshin Dehghan, Omar Oreifej, Emily Hand, and Mubarak Shah. 2012. Part-based multiple-person tracking with partial occlusion handling. In Proc. Conf. Computer Vision and Pattern Recognition. 1815--1821.

Digital Library

[55]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proc. Advances in Neural Information Processing Systems. 568--576.

Digital Library

[56]

Yale Song, Louis-Philippe Morency, and Randall Davis. 2012. Multi-view latent variable discriminative models for action recognition. In Proc. Conf. Computer Vision and Pattern Recognition. 2120--2127.

Digital Library

[57]

Yale Song, Louis-Philippe Morency, and Ronald W. Davis. 2013. Action recognition by hierarchical sequence summarization. In Proc. Conf. Computer Vision and Pattern Recognition. 3562--3569.

Digital Library

[58]

Jaeyong Sung, Colin Ponce, Bart Selman, and Ashutosh Saxena. 2012. Unstructured human activity detection from RGBD images. In Proc. Int’l. Conf. Robotics and Automation. 842--849.

[59]

C. Sutton and A. McCallum. 2007. An Introduction to Conditional Random Fields for Relational Learning. MIT Press.

[60]

Nick C. Tang, Yen-Yu Lin, Ju-Hsuan Hua, Shih-En Wei, Ming-Fang Weng, and Hong-Yuan Mark Liao. 2015. Robust action recognition via borrowing information across video modalities. IEEE Trans. Image Process. 24, 2 (2015), 709--723.

[61]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proc. Int’l. Conf. Computer Vision. 4489--4497.

Digital Library

[62]

Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In Proc. Conf. Computer Vision and Pattern Recognition. 588--595.

Digital Library

[63]

Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras. In Proc. Conf. Computer Vision and Pattern Recognition. 1290--1297.

Digital Library

[64]

Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2014. Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 5 (2014), 914--927.

[65]

Liang Wang and David Suter. 2007. Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model. In Proc. Conf. Computer Vision and Pattern Recognition. 1--8.

[66]

Xiaoyu Wang, Tony X. Han, and Shuicheng Yan. 2009. An HOG-LBP human detector with partial occlusion handling. In Proc. Int’l. Conf. Computer Vision. 32--39.

[67]

Daniel Weinland, Mustafa Özuysal, and Pascal Fua. 2010. Making action recognition robust to occlusions and viewpoint changes. In Proc. Euro. Conf. Computer Vision. 635--648.

Digital Library

[68]

Lu Xia and J. K. Aggarwal. 2013. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In Proc. Conf. Computer Vision and Pattern Recognition. 2834--2841.

Digital Library

[69]

Tong Xiao, Hongsheng Li, Wanli Ouyang, and Xiaogang Wang. 2016. Learning deep feature representations with domain guided dropout for person re-identification. arXiv Preprint arXiv:1604.07528 (2016).

[70]

Gang Yu, Junsong Yuan, and Zicheng Liu. 2012. Predicting human activities using spatio-temporal structure of interest points. In Proc. ACM Conf. Multimedia. 1049--1052.

Digital Library

[71]

Gang Yu, Junsong Yuan, and Zicheng Liu. 2015. Propagative Hough voting for human activity detection and recognition. IEEE Trans. Circ. Syst. Video Technol. 25, 1 (2015), 87--98.

[72]

Bo Zhang, Nicola Conci, and Francesco G. B. De Natale. 2015. Segmentation of discriminative patches in human activity video. ACM Trans. Multimedia Comput., Commun., Appl. 12, 1 (2015), 4.

Digital Library

[73]

Jianguo Zhang and Shaogang Gong. 2010. Action categorization with modified hidden conditional random field. Pattern Recognit. 43, 1 (2010), 197--203.

Digital Library

[74]

Lei Zhang, Zhi Zeng, and Qiang Ji. 2011. Probabilistic image modeling with an extended chain graph for human activity recognition and image segmentation. IEEE Trans. Image Process. 20, 9 (2011), 2401--2413.

Digital Library

[75]

Xin Zhao, Xue Li, Chaoyi Pang, Quan Z. Sheng, Sen Wang, and Mao Ye. 2014. Structured streaming skeleton—A new feature for online human gesture recognition. ACM Trans. Multimedia Comput., Commun., Appl. 11, 1s (2014), 22.

Digital Library

Cited By

Liao TChen JJeng STai C(2022)Cross-Domain Knowledge Transfer for Skeleton-based Action Recognition based on Graph Convolutional Gradient Reversal Layer2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR54900.2022.00076(387-390)Online publication date: Aug-2022
https://doi.org/10.1109/MIPR54900.2022.00076
Ng EXiang DJoo HGrauman K(2020)You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR42600.2020.00991(9887-9897)Online publication date: Jun-2020
https://doi.org/10.1109/CVPR42600.2020.00991
Guo DZhou WLi HWang M(2017)Online Early-Late Fusion Based on Adaptive HMM for Sign Language RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/315212114:1(1-18)Online publication date: 20-Dec-2017
https://dl.acm.org/doi/10.1145/3152121

Index Terms

Recognizing Human Actions with Outlier Frames by Observation Filtering and Completion
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin

We present a discriminative part-based approach for human action recognition from video sequences using motion features. Our model is based on the recently proposed hidden conditional random field (HCRF) for object recognition. Similarly to HCRF for ...
Viewpoint insensitive actions recognition using hidden conditional random fields
KES'10: Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part I

The viewpoint issue has been one of the bottlenecks for research development and practical implementation of human motion analysis. In this paper, we introduce a new method, e.g., hidden conditional random fields(HCRFs) to achieve viewpoint insensitive ...
Human actions recognition: an approach based on stable motion boundary fields

Automatic video action recognition have been a long-standing problem in computer vision. To obtain a scalable solution for actions recognition, it is important to have efficient visual representation of motions. In this paper, we propose a new visual ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 13, Issue 3

August 2017

233 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3104033

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2017

Accepted: 01 April 2017

Revised: 01 April 2017

Received: 01 September 2016

Published in TOMM Volume 13, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Ministry of Science and Technology of the Republic of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
187
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liao TChen JJeng STai C(2022)Cross-Domain Knowledge Transfer for Skeleton-based Action Recognition based on Graph Convolutional Gradient Reversal Layer2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR54900.2022.00076(387-390)Online publication date: Aug-2022
https://doi.org/10.1109/MIPR54900.2022.00076
Ng EXiang DJoo HGrauman K(2020)You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR42600.2020.00991(9887-9897)Online publication date: Jun-2020
https://doi.org/10.1109/CVPR42600.2020.00991
Guo DZhou WLi HWang M(2017)Online Early-Late Fusion Based on Adaptive HMM for Sign Language RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/315212114:1(1-18)Online publication date: 20-Dec-2017
https://dl.acm.org/doi/10.1145/3152121

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents