Complex event detection via attention-based video representation and classification

Zhao, Zhicheng; Xiang, Rui; Su, Fei

doi:10.1007/s11042-017-5058-2

Complex event detection via attention-based video representation and classification

Published: 10 August 2017

Volume 77, pages 3209–3227, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Zhicheng Zhao^1,2,
Rui Xiang¹ &
Fei Su^1,2

392 Accesses
5 Citations
Explore all metrics

Abstract

As an important task in managing unconstrained web videos, multimedia event detection (MED) has attracted wide attention recently. However, due to the complexities such as high abstraction of the events, various scenes and frequent interactions of individuals etc., MED is quite challenging. In this paper, we propose a novel MED algorithm via attention-based video representation and classification. Firstly, inspired by human’s selective attention mechanism, an attention-based saliency localization network (ASLN) is constructed to quickly predict the semantic saliency objects of video frames. Afterwards, in order to complementarily represent salient objects and the surroundings, two Convolutional Neural Networks (CNNs) features, i.e., local saliency feature and global feature are respectively extracted from the salient objects and the whole feature map. Thirdly, after binding two features together, Vector of Locally Aggregated Descriptors (VLAD) is applied to encode them into the video representation. Finally, the linear Support Vector Machine (SVM) classifiers are trained to classify. We extensively evaluate the performance on TRECVID MED14_10Ex, MED14_100Ex and Columbia Consume Video (CCV) datasets. Experimental results show that the proposed single model outperforms state-of-the-art approaches on all three real-world video datasets, and demonstrate the effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A feature temporal attention based interleaved network for fast video object detection

Article 11 May 2021

Compound Memory Networks for Few-Shot Video Classification

Temporal Saliency Query Network for Efficient Video Recognition

References

Arandjelovic R, Zisserman A (2013) All about VLAD. IEEE Conference on Computer Vision and Pattern Recognition, Portland, pp 1578–1585.
Chang X, Yang Y, Xing EP, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic SVM. IEEE Conference on Machine Learning, Lille, pp 1348–1357.
Chang X, Yu Y, Yang Y, Hauptmann A (2015) Searching persuasively: joint event detection and evidence recounting with limited supervision. ACM Multimedia, Brisbane, pp 581–590.
Chang X, Yang Y, Long G, Zhang C, Hauptmann A (2016) Dynamic concept composition for zero-example event detection. AAAI International Conference on Artificial Intelligence, Phoenix, pp 3464–3470.
Chang X, Yu Y, Yang Y, Xing E (2016) They are not equally reliable: semantic event search using differentiated concept classifiers. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, pp 1884–1893.
Chang X, Yu Y, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632
Article Google Scholar
Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann A (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197
Article Google Scholar
Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. Annual Conference on Neural Information Processing Systems, Barcelona, pp 379–387.
Fan R, Chang K, Hsieh C, Wang X, Lin C (2008) LIBLINEAR: a library for large linear classification. Int J Mach Learn Res 9:1871–1874
MATH Google Scholar
Gers FA, Schmidhuber J, Cummins FA (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12(10):2451–2471
Article Google Scholar
Gidaris S, Komodakis N (2015) Object detection via a multi-region and semantic segmentation-aware CNN model. IEEE International Conference on Computer Vision, Santiago, pp 1134–1142.
Gkalelis N, Mezaris V (2014) Video event detection using generalized subclass discriminant analysis and linear support vector machines. ACM International Conference on Multimedia Retrieval, Glasgow, p 25.
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, pp 770–778.
Jia Y, Shelhamer E, Donahue J et al (2014) Caffe: convolutional architecture for fast feature embedding. ACM Multimedia, Orlando, pp 675–678.
Jiang Y, Ye G, Chang S et al (2011) Consumer video understanding: a benchmark database and an evaluation of human and machine performance. ACM International Conference on Multimedia Retrieval, Trento, p 29.
Karpathy A, Toderici G, Shetty S et al (2014) Large-scale video classification with convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, pp 1725–1732.
Karthikeyan S, Ngo T, Eckstein M, Manjunath B (2015) Eye tracking assisted extraction of attentionally important objects from videos. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 3241–3250.
Lai K, Liu D, Chen M, Chang S (2014) Recognizing complex events in videos by learning key static-dynamic evidences. European Conference on Computer Vision, Zurich, pp 675–688.
Chapter Google Scholar
Li W, Yu Q, Divakaran A, Vasconcelos N (2013) Dynamic pooling for complex event recognition. IEEE International Conference on Computer Vision, Sydney, pp 2728–2735.
Liu G, Yan Y, Gao C, Tong W, Hauptmann A, Sebe N (2014) The mystery of faces: investigating face contribution for multimedia event detection. ACM International Conference on Multimedia Retrieval, Glasgow, p 467.
Luisier F, Tickoo M, Andrews W, Ye G, Liu D, Chang S et al (2014) BBN VISER TRECVID 2014 multimedia event detection and multimedia event recounting systems. In TRECVID 2014 workshop, Orlando, FL, USA.
Ma Z, Yang Y, Sebe N, Hauptmann A (2014) Knowledge adaptation with partially shared features for event detection using few exemplars. IEEE Trans Pattern Anal Mach Intell 36(9):1789–1802
Article Google Scholar
Martin M (1979) Local and global precessing: the role of sparisty. Mem Cogn 7(3):476–484
Article Google Scholar
Mettes P, Koelma D, Snoek C (2016) The ImageNet shuffle: reorganized pre-training for video event detection. ACM International Conference on Multimedia Retrieval, New York, pp 175–182.
Ng J, Hausknecht M, Vijayanarasimhan S et al (2015) Beyond short snippets: deep networks for video classification. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 4694–4702.
Pomerantz JR (1983) Global and local precedence: selective attention in form and motion perception. Int J Exp Psychol Gen 112(4):512–540
Google Scholar
Redmon J, Divvala SK, Girshick RB, Farhadi A (2016) You only look once: unified, real-time object detection. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, pp 779–788.
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556.
Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 1–9.
Tang KD, Li F, Koller D (2012) Learning latent temporal structure for complex event detection. IEEE Conference on Computer Vision and Pattern Recognition, Providence, pp 1250–1257.
Treisman AM (1969) Strategies and models of selective attention. Psychol Rev 76(3):282–299
Article Google Scholar
Treue S, Martinez T (1999) Feature-based attention influences motion processing gain in macaque visual cortex. Nature 399:575–579
Article Google Scholar
Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. ACM Multimedia, Firenze, pp 1469–1472.
Wang H, Schmid C (2013) Action recognition with improved trajectories. IEEE International Conference on Computer Vision, Sydney, pp 3551–3558.
Wang H, Kläser A, Schmid C (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Article MathSciNet Google Scholar
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 4305–4314.
Xu Z, Yang Y, Hauptmann A (2015) A discriminative CNN video representation for event detection. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 1798–1807.
Ye G, Jhuo I, Liu D, Jiang Y, Lee D, Chang S (2012) Joint audio-visual bi-modal codewords for video event detection. ACM International Conference on Multimedia Retrieval, Hong Kong, p 39.
Ye H, Wu Z, Zhao R, Wang X, Jiang Y, Xue X (2015) Evaluating two-stream CNN for video classification. . ACM International Conference on Multimedia Retrieval, Shanghai, pp 435–442.
Younessian E, Mitamura T, Hauptmann A (2012) Multimodal knowledge-based analysis in multimedia event detection. ACM International Conference on Multimedia Retrieval, Hong Kong, p 5.
Yu Q, Liu J, Cheng H, Divakaran A, Sawhney H (2013) Semantic pooling for complex event detection. ACM Multimedia, Barcelona, pp 733–736.
Zhao Z, Song Y, Su F (2016) Specific video identification via joint learning of latent semantic concept, scene and temporal structure. Neurocomputing 208:378–386
Article Google Scholar

Download references

Acknowledgements

This work is supported by Chinese National Natural Science Foundation under Grants 61471049, 61372169 and 61532018.

Author information

Authors and Affiliations

School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Zhicheng Zhao, Rui Xiang & Fei Su
Beijing Key Laboratory of Network System and Network Culture, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Zhicheng Zhao & Fei Su

Authors

Zhicheng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Rui Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Fei Su
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhicheng Zhao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, Z., Xiang, R. & Su, F. Complex event detection via attention-based video representation and classification. Multimed Tools Appl 77, 3209–3227 (2018). https://doi.org/10.1007/s11042-017-5058-2

Download citation

Received: 01 May 2017
Revised: 24 July 2017
Accepted: 25 July 2017
Published: 10 August 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s11042-017-5058-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Complex event detection via attention-based video representation and classification

Abstract

Access this article

Similar content being viewed by others

A feature temporal attention based interleaved network for fast video object detection

Compound Memory Networks for Few-Shot Video Classification

Temporal Saliency Query Network for Efficient Video Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Complex event detection via attention-based video representation and classification

Abstract

Access this article

Similar content being viewed by others

A feature temporal attention based interleaved network for fast video object detection

Compound Memory Networks for Few-Shot Video Classification

Temporal Saliency Query Network for Efficient Video Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation