skip to main content
research-article

Recognizing Human Actions with Outlier Frames by Observation Filtering and Completion

Published: 28 June 2017 Publication History

Abstract

This article addresses the problem of recognizing partially observed human actions. Videos of actions acquired in the real world often contain corrupt frames caused by various factors. These frames may appear irregularly, and make the actions only partially observed. They change the appearance of actions and degrade the performance of pretrained recognition systems. In this article, we propose an approach to address the corrupt-frame problem without knowing their locations and durations in advance. The proposed approach includes two key components: outlier filtering and observation completion. The former identifies and filters out unobserved frames, and the latter fills up the filtered parts by retrieving coherent alternatives from training data. Hidden Conditional Random Fields (HCRFs) are then used to recognize the filtered and completed actions. Our approach has been evaluated on three datasets, which contain both fully observed actions and partially observed actions with either real or synthetic corrupt frames. The experimental results show that our approach performs favorably against the other state-of-the-art methods, especially when corrupt frames are present.

References

[1]
Elisabeth Andre. 2013. Exploiting unconscious user signals in multimodal human-computer interaction. ACM Trans. Multimedia Comput., Commun., Appl. 9, 1s (2013), 48.
[2]
Alper Ayvaci, Michalis Raptis, and Stefano Soatto. 2012. Sparse occlusion detection with optical flow. Int. J. Comput. Vis. 97, 3 (2012), 322--338.
[3]
Prithviraj Banerjee and Ram Nevatia. 2014. Pose filter based hidden-CRF models for activity detection. In Proc. Euro. Conf. Computer Vision. 711--726.
[4]
Yu Cao, Daniel Barrett, Andrei Barbu, Swaminathan Narayanaswamy, Haonan Yu, Aaron Michaux, Yuewei Lin, Sven Dickinson, Jeffrey Mark Siskind, and Song Wang. 2013. Recognize human activities from partially observed videos. In Proc. Conf. Computer Vision and Pattern Recognition. 2658--2665.
[5]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2016. Realtime multi-person 2D pose estimation using part affinity fields. arXiv Preprint arXiv:1611.08050 (2016).
[6]
Gustavo Carneiro and Jacinto C. Nascimento. 2013. Combining multiple dynamic models and deep learning architectures for tracking the left ventricle endocardium in ultrasound data. IEEE Trans. Pattern Anal. Mach. Intell. 35, 11 (2013), 2592--2607.
[7]
Alexandros Andre Chaaraoui, José Ramón Padilla-López, and Francisco Flórez-Revuelta. 2013. Fusion of skeletal and silhouette-based features for human action recognition with RGB-D devices. In Proc. Int’ Conf. Computer Vision Workshops. 91--97.
[8]
Feng-Ju Chang, Yen-Yu Lin, and Kuang-Jui Hsu. 2014. Multiple structured-instance learning for semantic segmentation with uncertain training data. In Proc. Conf. Computer Vision and Pattern Recognition.
[9]
Kai-Yueh Chang, Tyng-Luh Liu, and Shang-Hong Lai. 2009. Learning partially-observed hidden conditional random fields for facial expression recognition. In Proc. Conf. Computer Vision and Pattern Recognition. 533--540.
[10]
Chia-Chih Chen and J. K. Aggarwal. 2011. Modeling human activities as speech. In Proc. Conf. Computer Vision and Pattern Recognition. 3425--3432.
[11]
Hongzhao Chen, Guijin Wang, and Li He. 2013. Accurate and real-time human action recognition based on 3D skeleton. In Proc. Int’l. Conf. Optical Instruments and Technology.
[12]
Zhuo Chen, Lu Wang, and Nelson H. C. Yung. 2011. Adaptive human motion analysis and prediction. Pattern Recognition 44, 12 (2011), 2902--2914.
[13]
Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2016. Structured feature learning for pose estimation. In Proc. Conf. Computer Vision and Pattern Recognition.
[14]
James W. Davis and Ambrish Tyagi. 2006. Minimal-latency human action recognition using reliable-inference. Image Vis. Comput. 24, 5 (2006), 455--472.
[15]
Maxime Devanne, Hazem Wannous, Stefano Berretti, Pietro Pala, Meroua Daoudi, and Alberto Del Bimbo. 2015. 3-D human action recognition by shape analysis of motion trajectories on Riemannian manifold. IEEE Trans. Cybernet. 45, 7 (2015), 1340--1352.
[16]
Piotr Dollár, Vincent Rabaud, Garrison Cottrell, and Serge Belongie. 2005. Behavior recognition via sparse spatio-temporal features. In Proc. Int’l. Workshops on Visual Surveillance and Performance Evaluation of Tracking and Surveillance. 65--72.
[17]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proc. Conf. Computer Vision and Pattern Recognition. 2625--2634.
[18]
Li Fei-Fei and Pietro Perona. 2005. A Bayesian hierarchical model for learning natural scene categories. In Proc. Conf. Computer Vision and Pattern Recognition, Vol. 2. 524--531.
[19]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proc. Conf. Computer Vision and Pattern Recognition.
[20]
Thomas Finley and Thorsten Joachims. 2008. Training structural SVMs when exact inference is intractable. In Proc. Int'l Conf. Machine Learning.
[21]
Simon Fothergill, Helena M. Mentis, Pushmeet Kohli, and Sebastian Nowozin. 2012. Instructing people for training gestural interactive systems. In Proc. Int’l. Conf. Human Factors in Computing Systems. 1737--1746.
[22]
Chuang Gan, Naiyan Wang, Yi Yang, Dit-Yan Yeung, and Alex G Hauptmann. 2015. Devnet: A deep event network for multimedia event detection and evidence recounting. In Proc. Conf. Computer Vision and Pattern Recognition. 2568--2577.
[23]
Mohammad A. Gowayyed, Marwan Torki, Mohamed E. Hussein, and Motaz El-Saban. 2013. Histogram of oriented displacements (HOD): Describing trajectories of human joints for action recognition. In Proc. Int’l. Joint Conf. Artificial Intelligence. 1351--1357.
[24]
Minh Hoai and Fernando De la Torre. 2014. Max-margin early event detectors. Int. J. Comput. Vis. 107, 2 (2014), 191--202.
[25]
Alexandros Iosifidis, Anastasios Tefas, and Ioannis Pitas. 2013. Dynamic action classification based on iterative data selection and feedforward neural networks. In Proc. Euro. Conf. Signal Processing. 1--5.
[26]
Yun Jiang and Ashutosh Saxena. 2014. Modeling high-dimensional humans for activity anticipation using Gaussian process latent CRFs. In Robotics: Science and Systems.
[27]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems. 1097--1105.
[28]
Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. 2014. A hierarchical representation for future action prediction. In Proc. Euro. Conf. Computer Vision. 689--704.
[29]
Ivan Laptev. 2005. On space-time interest points. Int. J. Comput. Vis. 64, 2--3 (2005), 107--123.
[30]
Chuanjun Li, S. Q. Zheng, and B. Prabhakaran. 2007. Segmentation and recognition of motion streams by similarity search. ACM Trans. Multimedia Comput., Commun., Appl. 3, 3 (2007), 16.
[31]
Kang Li and Yun Fu. 2014. Prediction of human activity by discovering temporal sequence patterns. IEEE Trans. Pattern Anal. Mach. Intell. 36, 8 (2014), 1644--1657.
[32]
Qing Li, Zhaofan Qiu, Ting Yao, Tao Mei, Yong Rui, and Jiebo Luo. 2016. Action recognition by learning deep multi-granular spatio-temporal video representation. In Proc. ACM Conf. Multimedia Retrieval. 159--166.
[33]
Wanqing Li, Zhengyou Zhang, and Zicheng Liu. 2008. Expandable data-driven graphical modeling of human actions based on salient postures. IEEE Trans. Circuits Syst. Video Technol. 18, 11 (2008), 1499--1510.
[34]
Xiao Li, Min Fang, Ju-Jie Zhang, and Jinqiao Wu. 2017. Learning coupled classifiers with RGB images for RGB-D object recognition. Pattern Recognition 61 (2017), 433--446.
[35]
Shih-Yao Lin, Yen-Yu Lin, Chu-Song Chen, and Yi-Ping Hung. 2017. Learning and inferring human actions with temporal pyramid features based on conditional random fields. In Proc. Int’l. Conf. Acoustics, Speech, and Signal Processing.
[36]
Yen-Yu Lin, Ju-Hsuan Hua, Nick C. Tang, Min-Hung Chen, and Hong-Yuan Mark Liao. 2014. Depth and skeleton associated action recognition without online accessible RGB-D cameras. In Proc. Conf. Computer Vision and Pattern Recognition. 2617--2624.
[37]
Li Liu, Ling Shao, Xuelong Li, and Ke Lu. 2016a. Learning spatio-temporal representations for action recognition: A genetic programming approach. IEEE Trans. Cybernetics 46, 1 (2016), 158--170.
[38]
Li Liu, Yi Zhou, and Ling Shao. 2016b. DAP3D-Net: Where, what and how actions occur in videos? arXiv Preprint arXiv:1602.03346 (2016).
[39]
Fengjun Lv and Ramakant Nevatia. 2006. Recognition and segmentation of 3-D human action using HMM and multi-class adaboost. In Proc. Euro. Conf. Computer Vision. 359--372.
[40]
James MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In Proc. Berkeley Symp. Mathematical Statistics and Probability, Vol. 1. 281--297.
[41]
Subhransu Maji, Lubomir Bourdev, and Jitendra Malik. 2011. Action recognition from a distributed representation of pose and appearance. In Proc. Conf. Computer Vision and Pattern Recognition. 3177--3184.
[42]
James Martens and Ilya Sutskever. 2011. Learning recurrent neural networks with Hessian-free optimization. In Proc. Int’l. Conf. Machine Learning. 1033--1040.
[43]
Omar Oreifej and Zicheng Liu. 2013. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proc. Conf. Computer Vision and Pattern Recognition. 716--723.
[44]
Olusegun Oshin, Andrew Gilbert, and Richard Bowden. 2011. Capturing the relative distribution of features for action recognition. In Proc. Conf. Automatic Face and Gesture Recognition. 111--116.
[45]
Jian Peng, Liefeng Bo, and Jinbo Xu. 2009. Conditional neural fields. In Proc. Advances in Neural Information Processing Systems. 1419--1427.
[46]
Lasitha Piyathilaka and Sarath Kodagoda. 2013. Gaussian mixture based HMM for human daily activity recognition using 3D skeleton features. In Proc. Int’l. Conf. Industrial Electronics and Applications. 567--572.
[47]
Ariadna Quattoni, Sybor Wang, Louis-Philippe Morency, Michael Collins, and Trevor Darrell. 2007. Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. (2007), 1848--1852.
[48]
Michalis Raptis and Leonid Sigal. 2013. Poselet key-framing: A model for human activity recognition. In Proc. Conf. Computer Vision and Pattern Recognition. 2650--2657.
[49]
M. S. Ryoo. 2011. Human activity prediction: Early recognition of ongoing activities from streaming videos. In Proc. Int’l. Conf. Computer Vision. 1036--1043.
[50]
M. S. Ryoo and J. K. Aggarwal. 2010. UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA). (2010).
[51]
Konrad Schindler and Luc Van Gool. 2008. Action snippets: How many frames does human action recognition require?. In Proc. Conf. Computer Vision and Pattern Recognition. 1--8.
[52]
Wei Shen, Ke Deng, Xiang Bai, Tommer Leyvand, Baining Guo, and Zhuowen Tu. 2012. Exemplar-based human action pose correction and tagging. In Proc. Conf. Computer Vision and Pattern Recognition. 1784--1791.
[53]
Ya-Fang Shih, Yang-Ming Yeh, Yen-Yu Lin, Ming-Feng Weng, Yi-Chang Lu, and Yung-Yu Chuang. 2017. Deep co-occurrence feature learning for visual object recognition. In Proc. Conf. Computer Vision and Pattern Recognition.
[54]
Guang Shu, Afshin Dehghan, Omar Oreifej, Emily Hand, and Mubarak Shah. 2012. Part-based multiple-person tracking with partial occlusion handling. In Proc. Conf. Computer Vision and Pattern Recognition. 1815--1821.
[55]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proc. Advances in Neural Information Processing Systems. 568--576.
[56]
Yale Song, Louis-Philippe Morency, and Randall Davis. 2012. Multi-view latent variable discriminative models for action recognition. In Proc. Conf. Computer Vision and Pattern Recognition. 2120--2127.
[57]
Yale Song, Louis-Philippe Morency, and Ronald W. Davis. 2013. Action recognition by hierarchical sequence summarization. In Proc. Conf. Computer Vision and Pattern Recognition. 3562--3569.
[58]
Jaeyong Sung, Colin Ponce, Bart Selman, and Ashutosh Saxena. 2012. Unstructured human activity detection from RGBD images. In Proc. Int’l. Conf. Robotics and Automation. 842--849.
[59]
C. Sutton and A. McCallum. 2007. An Introduction to Conditional Random Fields for Relational Learning. MIT Press.
[60]
Nick C. Tang, Yen-Yu Lin, Ju-Hsuan Hua, Shih-En Wei, Ming-Fang Weng, and Hong-Yuan Mark Liao. 2015. Robust action recognition via borrowing information across video modalities. IEEE Trans. Image Process. 24, 2 (2015), 709--723.
[61]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proc. Int’l. Conf. Computer Vision. 4489--4497.
[62]
Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In Proc. Conf. Computer Vision and Pattern Recognition. 588--595.
[63]
Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras. In Proc. Conf. Computer Vision and Pattern Recognition. 1290--1297.
[64]
Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2014. Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 5 (2014), 914--927.
[65]
Liang Wang and David Suter. 2007. Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model. In Proc. Conf. Computer Vision and Pattern Recognition. 1--8.
[66]
Xiaoyu Wang, Tony X. Han, and Shuicheng Yan. 2009. An HOG-LBP human detector with partial occlusion handling. In Proc. Int’l. Conf. Computer Vision. 32--39.
[67]
Daniel Weinland, Mustafa Özuysal, and Pascal Fua. 2010. Making action recognition robust to occlusions and viewpoint changes. In Proc. Euro. Conf. Computer Vision. 635--648.
[68]
Lu Xia and J. K. Aggarwal. 2013. Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In Proc. Conf. Computer Vision and Pattern Recognition. 2834--2841.
[69]
Tong Xiao, Hongsheng Li, Wanli Ouyang, and Xiaogang Wang. 2016. Learning deep feature representations with domain guided dropout for person re-identification. arXiv Preprint arXiv:1604.07528 (2016).
[70]
Gang Yu, Junsong Yuan, and Zicheng Liu. 2012. Predicting human activities using spatio-temporal structure of interest points. In Proc. ACM Conf. Multimedia. 1049--1052.
[71]
Gang Yu, Junsong Yuan, and Zicheng Liu. 2015. Propagative Hough voting for human activity detection and recognition. IEEE Trans. Circ. Syst. Video Technol. 25, 1 (2015), 87--98.
[72]
Bo Zhang, Nicola Conci, and Francesco G. B. De Natale. 2015. Segmentation of discriminative patches in human activity video. ACM Trans. Multimedia Comput., Commun., Appl. 12, 1 (2015), 4.
[73]
Jianguo Zhang and Shaogang Gong. 2010. Action categorization with modified hidden conditional random field. Pattern Recognit. 43, 1 (2010), 197--203.
[74]
Lei Zhang, Zhi Zeng, and Qiang Ji. 2011. Probabilistic image modeling with an extended chain graph for human activity recognition and image segmentation. IEEE Trans. Image Process. 20, 9 (2011), 2401--2413.
[75]
Xin Zhao, Xue Li, Chaoyi Pang, Quan Z. Sheng, Sen Wang, and Mao Ye. 2014. Structured streaming skeleton—A new feature for online human gesture recognition. ACM Trans. Multimedia Comput., Commun., Appl. 11, 1s (2014), 22.

Cited By

View all
  • (2022)Cross-Domain Knowledge Transfer for Skeleton-based Action Recognition based on Graph Convolutional Gradient Reversal Layer2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR54900.2022.00076(387-390)Online publication date: Aug-2022
  • (2020)You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR42600.2020.00991(9887-9897)Online publication date: Jun-2020
  • (2017)Online Early-Late Fusion Based on Adaptive HMM for Sign Language RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/315212114:1(1-18)Online publication date: 20-Dec-2017

Index Terms

  1. Recognizing Human Actions with Outlier Frames by Observation Filtering and Completion

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 13, Issue 3
    August 2017
    233 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3104033
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 June 2017
    Accepted: 01 April 2017
    Revised: 01 April 2017
    Received: 01 September 2016
    Published in TOMM Volume 13, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Human action recognition
    2. and conditional random fields
    3. early prediction
    4. gap filling
    5. observation completion
    6. outlier filtering

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Ministry of Science and Technology of the Republic of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Cross-Domain Knowledge Transfer for Skeleton-based Action Recognition based on Graph Convolutional Gradient Reversal Layer2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR54900.2022.00076(387-390)Online publication date: Aug-2022
    • (2020)You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR42600.2020.00991(9887-9897)Online publication date: Jun-2020
    • (2017)Online Early-Late Fusion Based on Adaptive HMM for Sign Language RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/315212114:1(1-18)Online publication date: 20-Dec-2017

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media