Abstract
In this paper, we propose a multi-modal event recognition framework based on the integration of feature fusion, deep learning, scene classification and decision fusion. Frames, shots, and scenes are identified through the video decomposition process. Events are modeled utilizing features of and relations between the physical video parts. Event modeling is achieved through visual concept learning, scene segmentation and association rule mining. Visual concept learning is employed to reveal the semantic gap between the visual content and the textual descriptors of the events. Association rules are discovered by a specialized association rule mining algorithm where the proposed strategy integrates temporality into the rule discovery process. In addition to frames, shots and scenes, the concept of scene segment is proposed to define and extract elements of association rules. Various feature sources such as audio, motion, keypoint descriptors, temporal occurrence characteristics and fully connected layer outputs of CNN model are combined into the feature fusion. The proposed decision fusion approach employs logistic regression to formulate the relation between dependent variable (event type) and independent variables (classifiers’ outputs) in terms of decision weights. Multi-modal fusion-based scene classifiers are employed in the event recognition. Rule-based event modeling and multi-modal fusion capability are shown to be promising approaches for event recognition. The decision fusion results are promising and the proposed algorithm is open to the fusion of new sources for further improvements. The proposal is also open to new event type integrations. The accuracy of the proposed methodology is evaluated on the CCV and Hollywood2 dataset for event recognition and results are compared with the benchmark implementations in the literature.
Similar content being viewed by others
References
Jiang, Yu-Gang, et al.: High-level event recognition in unconstrained videos. Int. J. Multimed. Inf. Retr. 2(2), 73–101 (2013)
Bay, H., et al.: “Speeded-Up robust features (SURF)”. Comput. Vis. Image Underst. 110, 346–359 (2008)
Rublee, E., et al.: “ORB: an efficient alternative to SIFT or SURF” in 2011 IEEE International Conference on Computer Vision, pp. 2564–2571 (2011)
Leutenegger, S., Chli, M., Siegwart, R.Y.: “BRISK: binary robust invariant scalable keypoints” in 2011 IEEE International Conference on Computer Vision, pp. 2548–2555 (2011)
Calonder, M., et al.: “BRIEF: binary robust independent elementary features.” In ECCV (2010)
Jia, Y.: “Caffe: an open source convolutional architecture for fast feature embedding.” http://caffe.berkleyvision.org (2014)
Wang, H., Schmid, C.: “Action recognition with improved trajectories”. In ICCV 2013. Sydney, Australia. IEEE, pp. 3551–3558 (2013)
Lan, Z., et al.: “Beyond Gaussian pyramid: multi-skip feature stacking for action recognition.” arXiv preprint arXiv, vol. 1411, pp. 6660 (2014)
Yeh, J.-B., et al.: “Multiple visual concept discovery using concept-based visual word clustering.” Multimed. Syst. 19(4), 381–393 (2013)
Jia, Y., et al.: “Visual concept learning: Combining machine vision and Bayesian generalization on concept hierarchies.” In Advances in Neural Information Processing Systems, pp. 1842–1850 (2013)
Scherp, A., Mezaris, V.: “Survey on modeling and indexing events in multimedia.” Multimed. Tools Appl. 70(1), 7–23 (2014)
Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62, 107–136 (2006)
Agrawal, R., Imielinkski, T., Swami, A.: Mining association rules between sets of items in large databases. In Proceedings of ACM SIGMOD, pp. 207–216 (1993)
Naqvi, M., et al.: Mining temporal association rules with incremental standing for segment progressive filter. Commun. Comput. Inf. Sci. 136(7), 373–382 (2011)
Thomason, J., et al.: “Integrating language and vision to generate natural language descriptions of videos in the wild,” in Proceedings of the 25th International Conference on Computational Linguistics (COLING) (2014)
Natsev, A. et.al.: IBM Research TRECVID-2010 video copy detection and multimedia event detection system. In Proceedings of NIST TRECVID. Workshop (2010)
Natarajan, P., et al.: “BBN VISER TRECVID 2011 multimedia event detection system.” In Proceedings of NIST TRECVID. Workshop (2011)
Bucak, S.S., Jin, R., Jain, A.K.: Multiple kernel learning for visual object recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1354–1369 (2014)
Chen, Q., et al.: “Boosting classification with exclusive context.” In Proc. PASCAL Visual Object Classes Challenge Workshop (2010)
Yang, J., et al.: “Group-sensitive multiple kernel learning for object categorization.” In Proc. 12th ICCV. Kyoto. Japan (2009)
Vishwanathan, S.V.N., et al.: “Multiple kernel learning and the SMO algorithm.” In Advances in Neural Information Processing Systems, Vancouver, B. C., Canada (2010)
Varma, M., Babu, B.R.: “More generality in efficient multiple kernel learning.” In ICML, pp. 134 (2009)
Jain, A., Vishwanathan, S.V.N., Varma, M.: “SPG-GMKL: Generalized multiple kernel learning with a million kernels.” In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Beijing, China, August (2012)
Tjondronegoro, D., et al.: “Multi-modal summarization of key events and top players in sports tournament videos.” In WACV. Kona. HI (2011)
Nephade, M.R., Huang, T.S.: “Detecting semantic concepts using context and audio/visual features.” In Proceedings of the IEEE workshop on Detection and Recognition of Events in Video, pp. 92–98 (2001)
Krizhevsky, A., Sutskever, I., Hinton, G.: “ImageNet classification with deep convolutional neural networks.” In NIPS (2012)
Srivastava, N., Salakhutdinow, R.: “Multimodal learning with deep boltzman machines.” In NIPS’12. 25, pp. 2231–2239 (2012)
Vincent, P., et al.: “Stacked denosing autoencoders: learning useful representation in a deep network with a local denosing criterion.” JMLR 11(5):3371–3408 (2010)
Deng, J., et al.: “ImageNet Large Scale Visual Recognition Competition” In ILSVRC (2012)
Simonyan, K., Zisserman, A.: “Two-stream convolutional networks for action recognition in videos.” in NIPS (2014)
Jiang, Yu-Gang, et al.: “Exploiting feature and class relationships in video categorization with regularized deep neural networks” arXiv abs/1502.07209 (2015)
Sohn, K., Shang, W., Lee, H.: “Improved multimodal deep learning with variation of information.” in NIPS (2014)
Baccouche, M., et al.: “Sequential deep learning for human action recognition.” In Human Behavior Understanding (2011)
Ji, S., Xu, W., Yang, M., Yu, K.: “3D convolutional neural networks for human action recognition.” IEEE Trans. Pattern Anal. Mach. Intell. 35(1):221–231 (2013)
Karpathy, A., et al.: “Large-scale video classification with convolutional neural networks.” In CVPR (2014)
Russakovsky, O., et al.: “ImageNet Large Scale Visual Recognition Challenge.” (2014)
Donahue, J., et al.: “Long-term recurrent convolutional networks for visual recognition and description. CoRR abs/1411.4389 (2014)
Chatfield, K., et al.: “Return of the devil in the details: delving deep into convolutional nets.” british machine vision conference, (arXiv ref. cs1405.3531) (2014)
Girshick, R., et al.: “Rich feature hierarchies for accurate object detection and semantic segmentation.” In IEEE Computer Vision and Pattern Recognition (2014)
Todorov, K. ,et al.: “Multimedia ontology matching by using visual and textual modalities.” Multimed. Tools Appl. 62(2), 401–425 (2013)
Miller, G.: “WordNet: a lexical database for English.” Commun. ACM. 38(11), 39–41 (1995)
Smith, J., Chang, S.: Large-scale concept ontology for multimedia. IEEE Multimed. 13(3), 86–91 (2006)
Russell, B., Torralba, A., Murphy, K., Freeman, W.: “LabelMe: a database and web-based tool for image annotation.” In IJCV 77(1), 157–173 (2008)
Snoek, C.G.M., Smeulders, A.W.M.: Visual-concept search solved? IEEE Comp. 43(6), 76–78 (2010)
van de Sande, K.E.A., Gevers, Th., Snoek, C.G.M.: “Evaluation of Color Descriptors for Object and Scene Recognition.” IEEE Trans on Pattern Analysis and Machine Intelligence (TPAMI) (2010)
Ntalianis, K.,et al.: “Automatic annotation of image databases based on implicit crowdsourcing, visual concept modeling and evolution.” Multimed. Tools Appl. 69(2), 397–421 (2014)
Sun, Y., et al.: “Visual concept detection of web images based on group sparse ensemble learning.” In Multimedia Tools and Applications. Springer, Berlin (2014)
Natsev, A., et al.: “Semantic concept-based query expansion and re-ranking for multimedia retrieval.” ACM Multimedia. Augsburg, Germany (2007)
Guder, M., Cicekli, N.K.: Dichotomic Decision Cascading for Video Shot Boundary Detection. In International Symposium on Multimedia (ISM) (2013)
Lienhart, R.: “Reliable transition detection in videos: a survey and practitioner’s guide.” In International Journal of Image and Graphics (IJIG). 1(3), 469–486 (2001)
Yu-Gang, J., et al.: “Consumer video understanding: a benchmark database and an evaluation of human and machine performance.” In ACM International Conference on Multimedia Retrieval (ICMR). Trento. Italy, (2011)
Deng, J., et al.: “ImageNet: A large-scale hierarchical image database.” In CVPR (2009)
Google Web Search API (Deprecated). https://developers.google.com/web-search/. 22 Apr. 2015
Bouchard, G., Triggs, B.: “Hierarchical part-based visual object categorization.” In CVPR (2005)
Fergus, R., Perona, P., Zisserman, A.: “Object class recognition by unsupervised scale-invariant learning”. Proc. CVPR 2, 264–271 (2003)
Miksik, O., Mikolajczyk, K.: “Evaluation of local detectors and descriptors for fast feature matching.” Pattern Recognition (ICPR). 2012 21st International Conference, pp. 2681–2684 (2012)
Heinly, J., Dunn, E., Frahm, J.M.: “Comparative evaluation of binary features.” In Computer Vision ECCV (2012)
Sermanet, P., et al.: “OverFeat: integrated recognition, localization and detection using convolutional networks,” http://arxiv.org/abs/1312.6229, ICLR (2014)
Jolliffe, I.T.: “Principal component analysis. In Springer Series in Statistics.” In 2nd ed. Springer. NY, vol. XXIX. 487 p. 28 illus. ISBN 978-0-387-95442-4 (2002)
Piotr, K., Fei, Y., Krystian, M.: “Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection.” Comp. Vision Image Underst. 117(5), 479–492 (2013)
Chang, Shih-Fu.: “Robust late fusion with rank minimization.” In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3021–3028, June 16–21 2012
Ma, A.J., Yuen, P.C.: “Reduced analytic dependency modeling: Robust fusion for visual recognition.” IJCV (2014)
Lai, Kuan-Ting, et al.: “Learning Sample Specific Weights for Late Fusion.” Image Process. IEEE Trans. 24(9), 2772–2783 (2015)
Marszaek, M., Laptev, I., Schmid, C.: “Actions in Context”, In IEEE Conference on Computer Vision and Pattern Recognition” (2009)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by C. Xu.
Rights and permissions
About this article
Cite this article
Güder, M., Çiçekli, N.K. Multi-modal video event recognition based on association rules and decision fusion. Multimedia Systems 24, 55–72 (2018). https://doi.org/10.1007/s00530-017-0535-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-017-0535-z