Multi-modal video event recognition based on association rules and decision fusion

Güder, Mennan; Çiçekli, Nihan Kesim

doi:10.1007/s00530-017-0535-z

Multi-modal video event recognition based on association rules and decision fusion

Regular Paper
Published: 11 February 2017

Volume 24, pages 55–72, (2018)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Mennan Güder¹ &
Nihan Kesim Çiçekli²

573 Accesses
16 Citations
Explore all metrics

Abstract

In this paper, we propose a multi-modal event recognition framework based on the integration of feature fusion, deep learning, scene classification and decision fusion. Frames, shots, and scenes are identified through the video decomposition process. Events are modeled utilizing features of and relations between the physical video parts. Event modeling is achieved through visual concept learning, scene segmentation and association rule mining. Visual concept learning is employed to reveal the semantic gap between the visual content and the textual descriptors of the events. Association rules are discovered by a specialized association rule mining algorithm where the proposed strategy integrates temporality into the rule discovery process. In addition to frames, shots and scenes, the concept of scene segment is proposed to define and extract elements of association rules. Various feature sources such as audio, motion, keypoint descriptors, temporal occurrence characteristics and fully connected layer outputs of CNN model are combined into the feature fusion. The proposed decision fusion approach employs logistic regression to formulate the relation between dependent variable (event type) and independent variables (classifiers’ outputs) in terms of decision weights. Multi-modal fusion-based scene classifiers are employed in the event recognition. Rule-based event modeling and multi-modal fusion capability are shown to be promising approaches for event recognition. The decision fusion results are promising and the proposed algorithm is open to the fusion of new sources for further improvements. The proposal is also open to new event type integrations. The accuracy of the proposed methodology is evaluated on the CCV and Hollywood2 dataset for event recognition and results are compared with the benchmark implementations in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Jiang, Yu-Gang, et al.: High-level event recognition in unconstrained videos. Int. J. Multimed. Inf. Retr. 2(2), 73–101 (2013)
Article Google Scholar
Bay, H., et al.: “Speeded-Up robust features (SURF)”. Comput. Vis. Image Underst. 110, 346–359 (2008)
Article Google Scholar
Rublee, E., et al.: “ORB: an efficient alternative to SIFT or SURF” in 2011 IEEE International Conference on Computer Vision, pp. 2564–2571 (2011)
Leutenegger, S., Chli, M., Siegwart, R.Y.: “BRISK: binary robust invariant scalable keypoints” in 2011 IEEE International Conference on Computer Vision, pp. 2548–2555 (2011)
Calonder, M., et al.: “BRIEF: binary robust independent elementary features.” In ECCV (2010)
Jia, Y.: “Caffe: an open source convolutional architecture for fast feature embedding.” http://caffe.berkleyvision.org (2014)
Wang, H., Schmid, C.: “Action recognition with improved trajectories”. In ICCV 2013. Sydney, Australia. IEEE, pp. 3551–3558 (2013)
Lan, Z., et al.: “Beyond Gaussian pyramid: multi-skip feature stacking for action recognition.” arXiv preprint arXiv, vol. 1411, pp. 6660 (2014)
Yeh, J.-B., et al.: “Multiple visual concept discovery using concept-based visual word clustering.” Multimed. Syst. 19(4), 381–393 (2013)
Article Google Scholar
Jia, Y., et al.: “Visual concept learning: Combining machine vision and Bayesian generalization on concept hierarchies.” In Advances in Neural Information Processing Systems, pp. 1842–1850 (2013)
Scherp, A., Mezaris, V.: “Survey on modeling and indexing events in multimedia.” Multimed. Tools Appl. 70(1), 7–23 (2014)
Article Google Scholar
Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62, 107–136 (2006)
Article Google Scholar
Agrawal, R., Imielinkski, T., Swami, A.: Mining association rules between sets of items in large databases. In Proceedings of ACM SIGMOD, pp. 207–216 (1993)
Naqvi, M., et al.: Mining temporal association rules with incremental standing for segment progressive filter. Commun. Comput. Inf. Sci. 136(7), 373–382 (2011)
Google Scholar
Thomason, J., et al.: “Integrating language and vision to generate natural language descriptions of videos in the wild,” in Proceedings of the 25th International Conference on Computational Linguistics (COLING) (2014)
Natsev, A. et.al.: IBM Research TRECVID-2010 video copy detection and multimedia event detection system. In Proceedings of NIST TRECVID. Workshop (2010)
Natarajan, P., et al.: “BBN VISER TRECVID 2011 multimedia event detection system.” In Proceedings of NIST TRECVID. Workshop (2011)
Bucak, S.S., Jin, R., Jain, A.K.: Multiple kernel learning for visual object recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1354–1369 (2014)
Article Google Scholar
Chen, Q., et al.: “Boosting classification with exclusive context.” In Proc. PASCAL Visual Object Classes Challenge Workshop (2010)
Yang, J., et al.: “Group-sensitive multiple kernel learning for object categorization.” In Proc. 12th ICCV. Kyoto. Japan (2009)
Vishwanathan, S.V.N., et al.: “Multiple kernel learning and the SMO algorithm.” In Advances in Neural Information Processing Systems, Vancouver, B. C., Canada (2010)
Varma, M., Babu, B.R.: “More generality in efficient multiple kernel learning.” In ICML, pp. 134 (2009)
Jain, A., Vishwanathan, S.V.N., Varma, M.: “SPG-GMKL: Generalized multiple kernel learning with a million kernels.” In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Beijing, China, August (2012)
Tjondronegoro, D., et al.: “Multi-modal summarization of key events and top players in sports tournament videos.” In WACV. Kona. HI (2011)
Nephade, M.R., Huang, T.S.: “Detecting semantic concepts using context and audio/visual features.” In Proceedings of the IEEE workshop on Detection and Recognition of Events in Video, pp. 92–98 (2001)
Krizhevsky, A., Sutskever, I., Hinton, G.: “ImageNet classification with deep convolutional neural networks.” In NIPS (2012)
Srivastava, N., Salakhutdinow, R.: “Multimodal learning with deep boltzman machines.” In NIPS’12. 25, pp. 2231–2239 (2012)
Vincent, P., et al.: “Stacked denosing autoencoders: learning useful representation in a deep network with a local denosing criterion.” JMLR 11(5):3371–3408 (2010)
MathSciNet MATH Google Scholar
Deng, J., et al.: “ImageNet Large Scale Visual Recognition Competition” In ILSVRC (2012)
Simonyan, K., Zisserman, A.: “Two-stream convolutional networks for action recognition in videos.” in NIPS (2014)
Jiang, Yu-Gang, et al.: “Exploiting feature and class relationships in video categorization with regularized deep neural networks” arXiv abs/1502.07209 (2015)
Sohn, K., Shang, W., Lee, H.: “Improved multimodal deep learning with variation of information.” in NIPS (2014)
Baccouche, M., et al.: “Sequential deep learning for human action recognition.” In Human Behavior Understanding (2011)
Ji, S., Xu, W., Yang, M., Yu, K.: “3D convolutional neural networks for human action recognition.” IEEE Trans. Pattern Anal. Mach. Intell. 35(1):221–231 (2013)
Article Google Scholar
Karpathy, A., et al.: “Large-scale video classification with convolutional neural networks.” In CVPR (2014)
Russakovsky, O., et al.: “ImageNet Large Scale Visual Recognition Challenge.” (2014)
Donahue, J., et al.: “Long-term recurrent convolutional networks for visual recognition and description. CoRR abs/1411.4389 (2014)
Chatfield, K., et al.: “Return of the devil in the details: delving deep into convolutional nets.” british machine vision conference, (arXiv ref. cs1405.3531) (2014)
Girshick, R., et al.: “Rich feature hierarchies for accurate object detection and semantic segmentation.” In IEEE Computer Vision and Pattern Recognition (2014)
Todorov, K. ,et al.: “Multimedia ontology matching by using visual and textual modalities.” Multimed. Tools Appl. 62(2), 401–425 (2013)
Article Google Scholar
Miller, G.: “WordNet: a lexical database for English.” Commun. ACM. 38(11), 39–41 (1995)
Article Google Scholar
Smith, J., Chang, S.: Large-scale concept ontology for multimedia. IEEE Multimed. 13(3), 86–91 (2006)
Article Google Scholar
Russell, B., Torralba, A., Murphy, K., Freeman, W.: “LabelMe: a database and web-based tool for image annotation.” In IJCV 77(1), 157–173 (2008)
Article Google Scholar
Snoek, C.G.M., Smeulders, A.W.M.: Visual-concept search solved? IEEE Comp. 43(6), 76–78 (2010)
Article Google Scholar
van de Sande, K.E.A., Gevers, Th., Snoek, C.G.M.: “Evaluation of Color Descriptors for Object and Scene Recognition.” IEEE Trans on Pattern Analysis and Machine Intelligence (TPAMI) (2010)
Ntalianis, K.,et al.: “Automatic annotation of image databases based on implicit crowdsourcing, visual concept modeling and evolution.” Multimed. Tools Appl. 69(2), 397–421 (2014)
Article Google Scholar
Sun, Y., et al.: “Visual concept detection of web images based on group sparse ensemble learning.” In Multimedia Tools and Applications. Springer, Berlin (2014)
Google Scholar
Natsev, A., et al.: “Semantic concept-based query expansion and re-ranking for multimedia retrieval.” ACM Multimedia. Augsburg, Germany (2007)
Guder, M., Cicekli, N.K.: Dichotomic Decision Cascading for Video Shot Boundary Detection. In International Symposium on Multimedia (ISM) (2013)
Lienhart, R.: “Reliable transition detection in videos: a survey and practitioner’s guide.” In International Journal of Image and Graphics (IJIG). 1(3), 469–486 (2001)
Article Google Scholar
Yu-Gang, J., et al.: “Consumer video understanding: a benchmark database and an evaluation of human and machine performance.” In ACM International Conference on Multimedia Retrieval (ICMR). Trento. Italy, (2011)
Deng, J., et al.: “ImageNet: A large-scale hierarchical image database.” In CVPR (2009)
Google Web Search API (Deprecated). https://developers.google.com/web-search/. 22 Apr. 2015
Bouchard, G., Triggs, B.: “Hierarchical part-based visual object categorization.” In CVPR (2005)
Fergus, R., Perona, P., Zisserman, A.: “Object class recognition by unsupervised scale-invariant learning”. Proc. CVPR 2, 264–271 (2003)
Google Scholar
Miksik, O., Mikolajczyk, K.: “Evaluation of local detectors and descriptors for fast feature matching.” Pattern Recognition (ICPR). 2012 21st International Conference, pp. 2681–2684 (2012)
Heinly, J., Dunn, E., Frahm, J.M.: “Comparative evaluation of binary features.” In Computer Vision ECCV (2012)
Sermanet, P., et al.: “OverFeat: integrated recognition, localization and detection using convolutional networks,” http://arxiv.org/abs/1312.6229, ICLR (2014)
Jolliffe, I.T.: “Principal component analysis. In Springer Series in Statistics.” In 2nd ed. Springer. NY, vol. XXIX. 487 p. 28 illus. ISBN 978-0-387-95442-4 (2002)
Piotr, K., Fei, Y., Krystian, M.: “Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection.” Comp. Vision Image Underst. 117(5), 479–492 (2013)
Article Google Scholar
Chang, Shih-Fu.: “Robust late fusion with rank minimization.” In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3021–3028, June 16–21 2012
Ma, A.J., Yuen, P.C.: “Reduced analytic dependency modeling: Robust fusion for visual recognition.” IJCV (2014)
Lai, Kuan-Ting, et al.: “Learning Sample Specific Weights for Late Fusion.” Image Process. IEEE Trans. 24(9), 2772–2783 (2015)
Article MathSciNet Google Scholar
Marszaek, M., Laptev, I., Schmid, C.: “Actions in Context”, In IEEE Conference on Computer Vision and Pattern Recognition” (2009)

Download references

Author information

Authors and Affiliations

The Scientific and Technological Research Council of Turkey, Kocaeli, Turkey
Mennan Güder
Department of Computer Engineering, Middle East Technical University, Ankara, Turkey
Nihan Kesim Çiçekli

Authors

Mennan Güder
View author publications
You can also search for this author in PubMed Google Scholar
Nihan Kesim Çiçekli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mennan Güder.

Additional information

Communicated by C. Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Güder, M., Çiçekli, N.K. Multi-modal video event recognition based on association rules and decision fusion. Multimedia Systems 24, 55–72 (2018). https://doi.org/10.1007/s00530-017-0535-z

Download citation

Received: 22 April 2015
Accepted: 05 January 2017
Published: 11 February 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s00530-017-0535-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Multi-modal video event recognition based on association rules and decision fusion

Abstract

Access this article

Similar content being viewed by others

Resource Constrained Multimedia Event Detection

Video event detection, classification and retrieval using ensemble feature selection

Automatic Event Detection in User-Generated Video Content: A Survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-modal video event recognition based on association rules and decision fusion

Abstract

Access this article

Similar content being viewed by others

Resource Constrained Multimedia Event Detection

Video event detection, classification and retrieval using ensemble feature selection

Automatic Event Detection in User-Generated Video Content: A Survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation