Breaking video into pieces for action recognition

Zheng, Ying; Yao, Hongxun; Sun, Xiaoshuai; Jiang, Xuesong; Porikli, Fatih

doi:10.1007/s11042-017-5038-6

Breaking video into pieces for action recognition

Published: 02 August 2017

Volume 76, pages 22195–22212, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Ying Zheng ORCID: orcid.org/0000-0002-8042-6196^1,2,
Hongxun Yao¹,
Xiaoshuai Sun¹,
Xuesong Jiang^1,2 &
…
Fatih Porikli²

392 Accesses
1 Citation
Explore all metrics

Abstract

We present a simple yet effective approach for human action recognition. Most of the existing solutions based on multi-class action classification aim to assign a class label for the input video. However, the variety and complexity of real-life videos make it very challenging to achieve high classification accuracy. To address this problem, we propose to partition the input video into small clips and formulate action recognition as a joint decision-making task. First, we partition all videos into two equal segments that are processed in the same manner. We repeat this procedure to obtain three layers of video subsegments, which are then organized in a binary tree structure. We train separate classifiers for each layer. By applying the corresponding classifiers to video subsegments, we obtain a decision value matrix (DVM). Then, we construct an aggregated representation for the original full-length video by integrating the elements of the DVM. Finally, we train a new action recognition classifier based on the DVM representation. Our extensive experimental evaluations demonstrate that the proposed method achieves significant performance improvement against several compared methods on two benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Action-Gons: Action Recognition with a Discriminative Dictionary of Structured Elements with Varying Granularity

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

Article 26 April 2016

Self-supervised Compressed Video Action Recognition via Temporal-Consistent Sampling

Notes

References

Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) Youtube-8m: a large-scale video classification benchmark. arXiv:1609.08675
Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. ACM Comput Surv 43(3):16
Article Google Scholar
Benmokhtar R (2014) Robust human action recognition scheme based on high-level feature fusion. Multimed Tools Appl 69(2):253–275
Article Google Scholar
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: IEEE conference on computer vision and pattern recognition, pp 3034–3042
Borges PVK, Conci N, Cavallaro A (2013) Video-based human behavior understanding: a survey. IEEE Trans Circuits Syst Video Technol 23(11):1993–2008
Article Google Scholar
Boureau YL, Bach F, LeCun Y, Ponce J (2010) Learning mid-level features for recognition. In: IEEE conference on computer vision and pattern recognition, pp 2559–2566
Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: IEEE conference on computer vision and pattern recognition, pp 961–970
Cao Y, Barrett D, Barbu A, Narayanaswamy S, Yu H, Michaux A, Lin Y, Dickinson S, Siskind JM, Wang S (2013) Recognize human activities from partially observed videos. In: IEEE conference on computer vision and pattern recognition, pp 2658–2665
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27
Article Google Scholar
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, pp 65–72
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. arXiv:1604.06573
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Jain A, Gupta A, Rodriguez M, Davis LS (2013) Representing videos using mid-level discriminative patches. In: IEEE conference on computer vision and pattern recognition, pp 2571–2578
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Kantorov V, Laptev I (2014) Efficient feature extraction, encoding, and classification for action recognition. In: IEEE conference on computer vision and pattern recognition, pp 2593–2600
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE conference on computer vision and pattern recognition, pp 1725–1732
Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: British machine vision conference, vol 275, pp 1–10
Kong Y, Kit D, Fu Y (2014) A discriminative model with multiple temporal scales for action prediction. In: European conference on computer vision, pp 596–611
Kovashka A, Grauman K (2010) Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: IEEE conference on computer vision and pattern recognition, pp 2046–2053
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: IEEE international conference on computer vision, pp 2556–2563
Lan T, Zhu Y, Roshan Zamir A, Savarese S (2015) Action recognition by hierarchical mid-level action elements. In: IEEE international conference on computer vision, pp 4552–4560
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Article Google Scholar
Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition, pp 1–8
Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: IEEE conference on computer vision and pattern recognition, pp 3361–3368
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: IEEE conference on computer vision and pattern recognition, pp 1996–2003
Liu J, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. In: IEEE conference on computer vision and pattern recognition, pp 3337–3344
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Article Google Scholar
Poppe R (2010) A survey on vision-based human action recognition. Image Vis Comput 28(6):976–990
Article Google Scholar
Raptis M, Kokkinos I, Soatto S (2012) Discovering discriminative action parts from mid-level video representations. In: IEEE conference on computer vision and pattern recognition, pp 1242–1249
Reddy KK, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24(5):971–981
Article Google Scholar
Ryoo M (2011) Human activity prediction: early recognition of ongoing activities from streaming videos. In: IEEE international conference on computer vision, pp 1036–1043
Ryoo MS, Aggarwal JK (2010) UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA)
Sadanand S, Corso JJ (2012) Action bank: a high-level representation of activity in video. In: IEEE conference on computer vision and pattern recognition, pp 1234–1241
Schüldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: IEEE international conference on pattern recognition, vol 3, pp 32–36
Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: ACM international conference on multimedia, pp 357–360
Shen H, Yan Y, Xu S, Ballas N, Chen W (2015) Evaluation of semi-supervised learning method on action recognition. Multimed Tools Appl 74 (2):523–542
Article Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Sun L, Jia K, Yeung DY, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: IEEE international conference on computer vision, pp 4597–4605
Tamrakar A, Ali S, Yu Q, Liu J, Javed O, Divakaran A, Cheng H, Sawhney H (2012) Evaluation of low-level features and their combinations for complex event detection in open source videos. In: IEEE conference on computer vision and pattern recognition, pp 3681–3688
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: IEEE international conference on computer vision, pp 4489–4497
Vrigkas M, Nikou C, Kakadiaris IA (2015) A review of human activity recognition methods. Front Robot AI 2:28
Article Google Scholar
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: IEEE conference on computer vision and pattern recognition, pp 3169–3176
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: IEEE international conference on computer vision, pp 3551–3558
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: British machine vision conference, vol 124, pp 1–11
Wang L, Qiao Y, Tang X (2013) Mining motion atoms and phrases for complex action recognition. In: IEEE international conference on computer vision, pp 2680–2687
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE conference on computer vision and pattern recognition, pp 4305–4314
Wang L, Ouyang W, Wang X, Lu H (2015) Visual tracking with fully convolutional networks. In: IEEE international conference on computer vision, pp 3119–3127
Weinland D, Ronfard R, Boyer E (2011) A survey of vision-based methods for action representation, segmentation and recognition. Comput Vis Image Underst 115 (2):224–241
Article Google Scholar
Xu H, Tian Q, Wang Z, Wu J (2016) A survey on aggregating methods for action recognition with dense trajectories. Multimed Tools Appl 75(10):5701–5717
Article Google Scholar
Xu Z, Qing L, Miao J (2015) Activity auto-completion: predicting human activities from partial videos. In: IEEE international conference on computer vision, pp 3191–3199
Xu Z, Yang Y, Hauptmann AG (2015) A discriminative cnn video representation for event detection. In: IEEE conference on computer vision and pattern recognition, pp 1798–1807
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: IEEE conference on computer vision and pattern recognition, pp 4694–4702
Zhang S, Yao H, Sun X, Wang K, Zhang J, Lu X, Zhang Y (2014) Action recognition based on overcomplete independent components analysis. Inf Sci 281:635–647
Article Google Scholar
Zhang S, Zhou H, Yao H, Zhang Y, Wang K, Zhang J (2015) Adaptive normalhedge for robust visual tracking. Signal Process 110:132–142
Article Google Scholar
Zhang S, Lan X, Yao H, Zhou H, Tao D, Li X (2016) A biologically inspired appearance model for robust visual tracking. In: IEEE transactions on neural networks and learning systems
Zhang W, Zhu M, Derpanis KG (2013) From actemes to action: a strongly-supervised representation for detailed action understanding. In: IEEE international conference on computer vision, pp 2248–2255
Zhou Y, Ni B, Hong R, Wang M, Tian Q (2015) Interaction part mining: a mid-level approach for fine-grained action recognition. In: IEEE conference on computer vision and pattern recognition, pp 3323–3331
Zhu J, Wang B, Yang X, Zhang W, Tu Z (2013) Action recognition with actons. In: IEEE international conference on computer vision, pp 3559–3566

Download references

Acknowledgments

The work was supported in part by the National Science Foundation of China (No. 61472103) and Australian Research Council (ARC) grant (DP150104645). We especially would like to thank the China Scholarship Council (CSC) for funding the first author to conduct the partially of this project at Australian National University.

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Ying Zheng, Hongxun Yao, Xiaoshuai Sun & Xuesong Jiang
Research School of Engineering, Australian National University, Canberra, Australia
Ying Zheng, Xuesong Jiang & Fatih Porikli

Authors

Ying Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Hongxun Yao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoshuai Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xuesong Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Fatih Porikli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongxun Yao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, Y., Yao, H., Sun, X. et al. Breaking video into pieces for action recognition. Multimed Tools Appl 76, 22195–22212 (2017). https://doi.org/10.1007/s11042-017-5038-6

Download citation

Received: 30 November 2016
Revised: 10 July 2017
Accepted: 14 July 2017
Published: 02 August 2017
Issue Date: November 2017
DOI: https://doi.org/10.1007/s11042-017-5038-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Breaking video into pieces for action recognition

Abstract

Access this article

Similar content being viewed by others

Action-Gons: Action Recognition with a Discriminative Dictionary of Structured Elements with Varying Granularity

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

Self-supervised Compressed Video Action Recognition via Temporal-Consistent Sampling

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Breaking video into pieces for action recognition

Abstract

Access this article

Similar content being viewed by others

Action-Gons: Action Recognition with a Discriminative Dictionary of Structured Elements with Varying Granularity

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

Self-supervised Compressed Video Action Recognition via Temporal-Consistent Sampling

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation