research-article

Action Recognition by Jointly Using Video Proposal and Trajectory

Authors:

Xuelong LiAuthors Info & Claims

ICVISP 2018: Proceedings of the 2nd International Conference on Vision, Image and Signal Processing

Article No.: 4, Pages 1 - 7

https://doi.org/10.1145/3271553.3271563

Published: 27 August 2018 Publication History

Abstract

As a popular research field in computer vision community, human action recognition in videos is a challenging task. In recent years, trajectory based methods have been proven effective for action recognition. However, because trajectory is generated around motion region, trajectory based methods often only pay attention to regions with high motion salience in video and ignore motionless but semantic objects. To compensate the shortage of trajectory based methods, video proposal is utilized for its ability to discover semantic object in this paper. In the proposed method, video proposal and trajectory are extracted simultaneously to capture motion information and object information. The proposed method can be divided into three steps: 1) trajectories and video proposals are extracted from video to capture motion information and object information respectively; 2) a trained Convolution Neural Network (CNN) model is employed to describe the extracted trajectories and video proposals; 3) the holistic representation of video is constructed by Fisher Vector model and then input to classifier to get the action label. The complementarity between trajectory and video proposal enables the discrimination power of the proposed method for kinds of actions. The proposed method is evaluated on UCF101 and HMDB51, on which the promising results prove the effectiveness of the proposed method.

References

[1]

P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE. T. Pattern. Anal., 33(5):898--916, August 2011.

Digital Library

[2]

X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection. In Proceedings of Advances in Neural Information Processing Systems, pages 424--432, Montreal, December 2015.

Digital Library

[3]

M. Cheng, Z. Zhang, W. Lin, and P. Torr. Bing: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3286--3293, Columbus, May 2014.

Digital Library

[4]

N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In Proceedings of European Conference on Computer Vision, volume 3952, pages 428--441, Graz, May 2006.

Digital Library

[5]

R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871--1874, January 2008.

Digital Library

[6]

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 580--587, Columbus, June 2014.

Digital Library

[7]

S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE. T. Pattern. Anal., 35(1):221--231, March 2013.

Digital Library

[8]

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 2556--2563, Barcelona, November 2011.

Digital Library

[9]

S. Narayan and K. R. Ramakrishnan. A cause and effect analysis of motion trajectories for modeling actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2633--2640, Columbus, June 2014.

Digital Library

[10]

E. Nowak, F. Jurie, and B. Triggs. Sampling strategies for bag-of-features image classification. In Proceedings of European Conference on Computer Vision, pages 490--503, Graz, May 2006.

Digital Library

[11]

K. Ohnishi, M. Hidaka, and T. Harada. Improved dense trajectory with cross streams. In Proceedings of the 2016 ACM on Multimedia Conference, pages 257--261, Amsterdam, November 2016.

Digital Library

[12]

D. Oneata, J. Revaud, J. Verbeek, and C. Schmid. Spatio-temporal object detection proposals. In Proceedings of European Conference on Computer Vision, pages 737--752, Zurich, September 2014.

[13]

S. Sadanand and J. J. Corso. Action bank: A high-level representation of activity in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1234--1241, Providence, June 2012.

Digital Library

[14]

J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: theory and practice. Int. J. Comput. Vision., 105(3):222--245, June 2013.

Digital Library

[15]

P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM International Conference on Multimedia, pages 357--360, Augsburg, September 2007.

Digital Library

[16]

J. J. Seo, H. I. Kim, W. De Neve, and Y. M. Ro. Effective and efficient human action recognition using dynamic frame skipping and trajectory rejection. Image. Vision. Comput., 58:76--85, February 2017.

Digital Library

[17]

Z. Shu, K. Yun, and D. Samaras. Action detection with improved dense trajectories and sliding window. In Workshop at the European Conference on Computer Vision, pages 541--551, Zurich, September 2014.

[18]

K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of Advances in Neural Information Processing Systems, pages 568--576, Montreal, December 2014.

Digital Library

[19]

K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv: 1212.0402, 2012.

[20]

Y. Tian, R. Sukthankar, and M. Shah. Spatiotemporal deformable part models for action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2642--2649, Sydney, December 2013.

Digital Library

[21]

H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3169--3176, Colorado Springs, June 2011.

Digital Library

[22]

H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vision., 103(1):60--79, March 2013.

[23]

H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, pages 3551--3558, Sydney, December 2013.

Digital Library

[24]

L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4305--4314, Boston, June 2015.

[25]

P. Wang, Z. Li, Y. Hou, and W. Li. Action recognition based on joint trajectory maps using convolutional neural networks. In Proceedings of the 2016 ACM on Multimedia Conference, pages 102--106, Amsterdam, November 2016.

Digital Library

[26]

X. Wang, C. Qi, and F. Lin. Combined trajectories for action recognition based on saliency detection and motion boundary. Signal. Process-Image., 57:91--102, May 2017.

[27]

Y. Wang, V. Tran, and M. Hoai. Evolution-preserving dense trajectory descriptors. arXiv preprint arXiv: 1702.04037, 2017.

[28]

G. Yu and J. Yuan. Fast action proposals for human action detection and search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1302--1311, Boston, June 2015.

[29]

G. Yu, J. Yuan, and Z. Liu. Unsupervised random forest indexing for fast action search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 865--872, Portland, December 2011.

Digital Library

Cited By

LI YCHU ZXIN Y(2020)Posture Recognition Technology Based on KinectIEICE Transactions on Information and Systems10.1587/transinf.2019EDP7221E103.D:3(621-630)Online publication date: 1-Mar-2020
https://doi.org/10.1587/transinf.2019EDP7221
Zheng YZheng XLu XWu S(2020)Spatial attention based visual semantic learning for action recognition in still imagesNeurocomputing10.1016/j.neucom.2020.07.016413(383-396)Online publication date: Nov-2020
https://doi.org/10.1016/j.neucom.2020.07.016
Zheng YLi XLu X(2019)Unsupervised Learning of Human Action Categories in Still Images with Deep RepresentationsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/336216115:4(1-20)Online publication date: 16-Dec-2019
https://dl.acm.org/doi/10.1145/3362161

Index Terms

Action Recognition by Jointly Using Video Proposal and Trajectory
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning

Recommendations

Trajectory signature for action recognition in video
MM '12: Proceedings of the 20th ACM international conference on Multimedia

Bag-of-Words representation based on trajectory local features and taking into account the spatio-temporal context through static segmentation grids is currently the leading paradigm to perform action annotation.While providing a coarse localization of ...
Unsupervised video-based action recognition using two-stream generative adversarial network
Abstract
Video-based action recognition faces many challenges, such as complex and varied dynamic motion, spatio-temporal similar action factors, and manual labeling of archived videos over large datasets. How to extract discriminative spatio-temporal ...
Action Recognition with Trajectory and Scene
ICDH '12: Proceedings of the 2012 Fourth International Conference on Digital Home

Trajectory features have recently shown promising results to action recognition in video. Typically, they are extracted by tracking feature points with the KLT tracker or matching SIFT descriptors between frames. However, trajectory can be due to the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICVISP 2018: Proceedings of the 2nd International Conference on Vision, Image and Signal Processing

August 2018

402 pages

ISBN:9781450365291

DOI:10.1145/3271553

Copyright © 2018 ACM.

© 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 August 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICVISP 2018

ICVISP 2018: The 2nd International Conference on Vision, Image and Signal Processing

August 27 - 29, 2018

NV, Las Vegas, USA

Acceptance Rates

Overall Acceptance Rate 186 of 424 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
85
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

LI YCHU ZXIN Y(2020)Posture Recognition Technology Based on KinectIEICE Transactions on Information and Systems10.1587/transinf.2019EDP7221E103.D:3(621-630)Online publication date: 1-Mar-2020
https://doi.org/10.1587/transinf.2019EDP7221
Zheng YZheng XLu XWu S(2020)Spatial attention based visual semantic learning for action recognition in still imagesNeurocomputing10.1016/j.neucom.2020.07.016413(383-396)Online publication date: Nov-2020
https://doi.org/10.1016/j.neucom.2020.07.016
Zheng YLi XLu X(2019)Unsupervised Learning of Human Action Categories in Still Images with Deep RepresentationsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/336216115:4(1-20)Online publication date: 16-Dec-2019
https://dl.acm.org/doi/10.1145/3362161

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten