Skip to main content

Advertisement

Log in

Complex event detection via attention-based video representation and classification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

As an important task in managing unconstrained web videos, multimedia event detection (MED) has attracted wide attention recently. However, due to the complexities such as high abstraction of the events, various scenes and frequent interactions of individuals etc., MED is quite challenging. In this paper, we propose a novel MED algorithm via attention-based video representation and classification. Firstly, inspired by human’s selective attention mechanism, an attention-based saliency localization network (ASLN) is constructed to quickly predict the semantic saliency objects of video frames. Afterwards, in order to complementarily represent salient objects and the surroundings, two Convolutional Neural Networks (CNNs) features, i.e., local saliency feature and global feature are respectively extracted from the salient objects and the whole feature map. Thirdly, after binding two features together, Vector of Locally Aggregated Descriptors (VLAD) is applied to encode them into the video representation. Finally, the linear Support Vector Machine (SVM) classifiers are trained to classify. We extensively evaluate the performance on TRECVID MED14_10Ex, MED14_100Ex and Columbia Consume Video (CCV) datasets. Experimental results show that the proposed single model outperforms state-of-the-art approaches on all three real-world video datasets, and demonstrate the effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Arandjelovic R, Zisserman A (2013) All about VLAD. IEEE Conference on Computer Vision and Pattern Recognition, Portland, pp 1578–1585.

  2. Chang X, Yang Y, Xing EP, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic SVM. IEEE Conference on Machine Learning, Lille, pp 1348–1357.

  3. Chang X, Yu Y, Yang Y, Hauptmann A (2015) Searching persuasively: joint event detection and evidence recounting with limited supervision. ACM Multimedia, Brisbane, pp 581–590.

  4. Chang X, Yang Y, Long G, Zhang C, Hauptmann A (2016) Dynamic concept composition for zero-example event detection. AAAI International Conference on Artificial Intelligence, Phoenix, pp 3464–3470.

  5. Chang X, Yu Y, Yang Y, Xing E (2016) They are not equally reliable: semantic event search using differentiated concept classifiers. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, pp 1884–1893.

  6. Chang X, Yu Y, Yang Y, Xing EP (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617–1632

    Article  Google Scholar 

  7. Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann A (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197

    Article  Google Scholar 

  8. Dai J, Li Y, He K, Sun J (2016) R-FCN: object detection via region-based fully convolutional networks. Annual Conference on Neural Information Processing Systems, Barcelona, pp 379–387.

  9. Fan R, Chang K, Hsieh C, Wang X, Lin C (2008) LIBLINEAR: a library for large linear classification. Int J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  10. Gers FA, Schmidhuber J, Cummins FA (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12(10):2451–2471

    Article  Google Scholar 

  11. Gidaris S, Komodakis N (2015) Object detection via a multi-region and semantic segmentation-aware CNN model. IEEE International Conference on Computer Vision, Santiago, pp 1134–1142.

  12. Gkalelis N, Mezaris V (2014) Video event detection using generalized subclass discriminant analysis and linear support vector machines. ACM International Conference on Multimedia Retrieval, Glasgow, p 25.

  13. He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916

    Article  Google Scholar 

  14. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, pp 770–778.

  15. Jia Y, Shelhamer E, Donahue J et al (2014) Caffe: convolutional architecture for fast feature embedding. ACM Multimedia, Orlando, pp 675–678.

  16. Jiang Y, Ye G, Chang S et al (2011) Consumer video understanding: a benchmark database and an evaluation of human and machine performance. ACM International Conference on Multimedia Retrieval, Trento, p 29.

  17. Karpathy A, Toderici G, Shetty S et al (2014) Large-scale video classification with convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, pp 1725–1732.

  18. Karthikeyan S, Ngo T, Eckstein M, Manjunath B (2015) Eye tracking assisted extraction of attentionally important objects from videos. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 3241–3250.

  19. Lai K, Liu D, Chen M, Chang S (2014) Recognizing complex events in videos by learning key static-dynamic evidences. European Conference on Computer Vision, Zurich, pp 675–688.

    Chapter  Google Scholar 

  20. Li W, Yu Q, Divakaran A, Vasconcelos N (2013) Dynamic pooling for complex event recognition. IEEE International Conference on Computer Vision, Sydney, pp 2728–2735.

  21. Liu G, Yan Y, Gao C, Tong W, Hauptmann A, Sebe N (2014) The mystery of faces: investigating face contribution for multimedia event detection. ACM International Conference on Multimedia Retrieval, Glasgow, p 467.

  22. Luisier F, Tickoo M, Andrews W, Ye G, Liu D, Chang S et al (2014) BBN VISER TRECVID 2014 multimedia event detection and multimedia event recounting systems. In TRECVID 2014 workshop, Orlando, FL, USA.

  23. Ma Z, Yang Y, Sebe N, Hauptmann A (2014) Knowledge adaptation with partially shared features for event detection using few exemplars. IEEE Trans Pattern Anal Mach Intell 36(9):1789–1802

    Article  Google Scholar 

  24. Martin M (1979) Local and global precessing: the role of sparisty. Mem Cogn 7(3):476–484

    Article  Google Scholar 

  25. Mettes P, Koelma D, Snoek C (2016) The ImageNet shuffle: reorganized pre-training for video event detection. ACM International Conference on Multimedia Retrieval, New York, pp 175–182.

  26. Ng J, Hausknecht M, Vijayanarasimhan S et al (2015) Beyond short snippets: deep networks for video classification. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 4694–4702.

  27. Pomerantz JR (1983) Global and local precedence: selective attention in form and motion perception. Int J Exp Psychol Gen 112(4):512–540

    Google Scholar 

  28. Redmon J, Divvala SK, Girshick RB, Farhadi A (2016) You only look once: unified, real-time object detection. IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, pp 779–788.

  29. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556.

  30. Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 1–9.

  31. Tang KD, Li F, Koller D (2012) Learning latent temporal structure for complex event detection. IEEE Conference on Computer Vision and Pattern Recognition, Providence, pp 1250–1257.

  32. Treisman AM (1969) Strategies and models of selective attention. Psychol Rev 76(3):282–299

    Article  Google Scholar 

  33. Treue S, Martinez T (1999) Feature-based attention influences motion processing gain in macaque visual cortex. Nature 399:575–579

    Article  Google Scholar 

  34. Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. ACM Multimedia, Firenze, pp 1469–1472.

  35. Wang H, Schmid C (2013) Action recognition with improved trajectories. IEEE International Conference on Computer Vision, Sydney, pp 3551–3558.

  36. Wang H, Kläser A, Schmid C (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79

    Article  MathSciNet  Google Scholar 

  37. Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 4305–4314.

  38. Xu Z, Yang Y, Hauptmann A (2015) A discriminative CNN video representation for event detection. IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp 1798–1807.

  39. Ye G, Jhuo I, Liu D, Jiang Y, Lee D, Chang S (2012) Joint audio-visual bi-modal codewords for video event detection. ACM International Conference on Multimedia Retrieval, Hong Kong, p 39.

  40. Ye H, Wu Z, Zhao R, Wang X, Jiang Y, Xue X (2015) Evaluating two-stream CNN for video classification. . ACM International Conference on Multimedia Retrieval, Shanghai, pp 435–442.

  41. Younessian E, Mitamura T, Hauptmann A (2012) Multimodal knowledge-based analysis in multimedia event detection. ACM International Conference on Multimedia Retrieval, Hong Kong, p 5.

  42. Yu Q, Liu J, Cheng H, Divakaran A, Sawhney H (2013) Semantic pooling for complex event detection. ACM Multimedia, Barcelona, pp 733–736.

  43. Zhao Z, Song Y, Su F (2016) Specific video identification via joint learning of latent semantic concept, scene and temporal structure. Neurocomputing 208:378–386

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by Chinese National Natural Science Foundation under Grants 61471049, 61372169 and 61532018.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhicheng Zhao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, Z., Xiang, R. & Su, F. Complex event detection via attention-based video representation and classification. Multimed Tools Appl 77, 3209–3227 (2018). https://doi.org/10.1007/s11042-017-5058-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-5058-2

Keywords

Navigation