Skip to main content
Log in

Multi-modal video event recognition based on association rules and decision fusion

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

In this paper, we propose a multi-modal event recognition framework based on the integration of feature fusion, deep learning, scene classification and decision fusion. Frames, shots, and scenes are identified through the video decomposition process. Events are modeled utilizing features of and relations between the physical video parts. Event modeling is achieved through visual concept learning, scene segmentation and association rule mining. Visual concept learning is employed to reveal the semantic gap between the visual content and the textual descriptors of the events. Association rules are discovered by a specialized association rule mining algorithm where the proposed strategy integrates temporality into the rule discovery process. In addition to frames, shots and scenes, the concept of scene segment is proposed to define and extract elements of association rules. Various feature sources such as audio, motion, keypoint descriptors, temporal occurrence characteristics and fully connected layer outputs of CNN model are combined into the feature fusion. The proposed decision fusion approach employs logistic regression to formulate the relation between dependent variable (event type) and independent variables (classifiers’ outputs) in terms of decision weights. Multi-modal fusion-based scene classifiers are employed in the event recognition. Rule-based event modeling and multi-modal fusion capability are shown to be promising approaches for event recognition. The decision fusion results are promising and the proposed algorithm is open to the fusion of new sources for further improvements. The proposal is also open to new event type integrations. The accuracy of the proposed methodology is evaluated on the CCV and Hollywood2 dataset for event recognition and results are compared with the benchmark implementations in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Jiang, Yu-Gang, et al.: High-level event recognition in unconstrained videos. Int. J. Multimed. Inf. Retr. 2(2), 73–101 (2013)

    Article  Google Scholar 

  2. Bay, H., et al.: “Speeded-Up robust features (SURF)”. Comput. Vis. Image Underst. 110, 346–359 (2008)

    Article  Google Scholar 

  3. Rublee, E., et al.: “ORB: an efficient alternative to SIFT or SURF” in 2011 IEEE International Conference on Computer Vision, pp. 2564–2571 (2011)

  4. Leutenegger, S., Chli, M., Siegwart, R.Y.: “BRISK: binary robust invariant scalable keypoints” in 2011 IEEE International Conference on Computer Vision, pp. 2548–2555 (2011)

  5. Calonder, M., et al.: “BRIEF: binary robust independent elementary features.” In ECCV (2010)

  6. Jia, Y.: “Caffe: an open source convolutional architecture for fast feature embedding.” http://caffe.berkleyvision.org (2014)

  7. Wang, H., Schmid, C.: “Action recognition with improved trajectories”. In ICCV 2013. Sydney, Australia. IEEE, pp. 3551–3558 (2013)

  8. Lan, Z., et al.: “Beyond Gaussian pyramid: multi-skip feature stacking for action recognition.” arXiv preprint arXiv, vol. 1411, pp. 6660 (2014)

  9. Yeh, J.-B., et al.: “Multiple visual concept discovery using concept-based visual word clustering.” Multimed. Syst. 19(4), 381–393 (2013)

    Article  Google Scholar 

  10. Jia, Y., et al.: “Visual concept learning: Combining machine vision and Bayesian generalization on concept hierarchies.” In Advances in Neural Information Processing Systems, pp. 1842–1850 (2013)

  11. Scherp, A., Mezaris, V.: “Survey on modeling and indexing events in multimedia.” Multimed. Tools Appl. 70(1), 7–23 (2014)

    Article  Google Scholar 

  12. Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62, 107–136 (2006)

    Article  Google Scholar 

  13. Agrawal, R., Imielinkski, T., Swami, A.: Mining association rules between sets of items in large databases. In Proceedings of ACM SIGMOD, pp. 207–216 (1993)

  14. Naqvi, M., et al.: Mining temporal association rules with incremental standing for segment progressive filter. Commun. Comput. Inf. Sci. 136(7), 373–382 (2011)

    Google Scholar 

  15. Thomason, J., et al.: “Integrating language and vision to generate natural language descriptions of videos in the wild,” in Proceedings of the 25th International Conference on Computational Linguistics (COLING) (2014)

  16. Natsev, A. et.al.: IBM Research TRECVID-2010 video copy detection and multimedia event detection system. In Proceedings of NIST TRECVID. Workshop (2010)

  17. Natarajan, P., et al.: “BBN VISER TRECVID 2011 multimedia event detection system.” In Proceedings of NIST TRECVID. Workshop (2011)

  18. Bucak, S.S., Jin, R., Jain, A.K.: Multiple kernel learning for visual object recognition: a review. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1354–1369 (2014)

    Article  Google Scholar 

  19. Chen, Q., et al.: “Boosting classification with exclusive context.” In Proc. PASCAL Visual Object Classes Challenge Workshop (2010)

  20. Yang, J., et al.: “Group-sensitive multiple kernel learning for object categorization.” In Proc. 12th ICCV. Kyoto. Japan (2009)

  21. Vishwanathan, S.V.N., et al.: “Multiple kernel learning and the SMO algorithm.” In Advances in Neural Information Processing Systems, Vancouver, B. C., Canada (2010)

  22. Varma, M., Babu, B.R.: “More generality in efficient multiple kernel learning.” In ICML, pp. 134 (2009)

  23. Jain, A., Vishwanathan, S.V.N., Varma, M.: “SPG-GMKL: Generalized multiple kernel learning with a million kernels.” In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Beijing, China, August (2012)

  24. Tjondronegoro, D., et al.: “Multi-modal summarization of key events and top players in sports tournament videos.” In WACV. Kona. HI (2011)

  25. Nephade, M.R., Huang, T.S.: “Detecting semantic concepts using context and audio/visual features.” In Proceedings of the IEEE workshop on Detection and Recognition of Events in Video, pp. 92–98 (2001)

  26. Krizhevsky, A., Sutskever, I., Hinton, G.: “ImageNet classification with deep convolutional neural networks.” In NIPS (2012)

  27. Srivastava, N., Salakhutdinow, R.: “Multimodal learning with deep boltzman machines.” In NIPS’12. 25, pp. 2231–2239 (2012)

  28. Vincent, P., et al.: “Stacked denosing autoencoders: learning useful representation in a deep network with a local denosing criterion.” JMLR 11(5):3371–3408 (2010)

    MathSciNet  MATH  Google Scholar 

  29. Deng, J., et al.: “ImageNet Large Scale Visual Recognition Competition” In ILSVRC (2012)

  30. Simonyan, K., Zisserman, A.: “Two-stream convolutional networks for action recognition in videos.” in NIPS (2014)

  31. Jiang, Yu-Gang, et al.: “Exploiting feature and class relationships in video categorization with regularized deep neural networks” arXiv abs/1502.07209 (2015)

  32. Sohn, K., Shang, W., Lee, H.: “Improved multimodal deep learning with variation of information.” in NIPS (2014)

  33. Baccouche, M., et al.: “Sequential deep learning for human action recognition.” In Human Behavior Understanding (2011)

  34. Ji, S., Xu, W., Yang, M., Yu, K.: “3D convolutional neural networks for human action recognition.” IEEE Trans. Pattern Anal. Mach. Intell. 35(1):221–231 (2013)

    Article  Google Scholar 

  35. Karpathy, A., et al.: “Large-scale video classification with convolutional neural networks.” In CVPR (2014)

  36. Russakovsky, O., et al.: “ImageNet Large Scale Visual Recognition Challenge.” (2014)

  37. Donahue, J., et al.: “Long-term recurrent convolutional networks for visual recognition and description. CoRR abs/1411.4389 (2014)

  38. Chatfield, K., et al.: “Return of the devil in the details: delving deep into convolutional nets.” british machine vision conference, (arXiv ref. cs1405.3531) (2014)

  39. Girshick, R., et al.: “Rich feature hierarchies for accurate object detection and semantic segmentation.” In IEEE Computer Vision and Pattern Recognition (2014)

  40. Todorov, K. ,et al.: “Multimedia ontology matching by using visual and textual modalities.” Multimed. Tools Appl. 62(2), 401–425 (2013)

    Article  Google Scholar 

  41. Miller, G.: “WordNet: a lexical database for English.” Commun. ACM. 38(11), 39–41 (1995)

    Article  Google Scholar 

  42. Smith, J., Chang, S.: Large-scale concept ontology for multimedia. IEEE Multimed. 13(3), 86–91 (2006)

    Article  Google Scholar 

  43. Russell, B., Torralba, A., Murphy, K., Freeman, W.: “LabelMe: a database and web-based tool for image annotation.” In IJCV 77(1), 157–173 (2008)

    Article  Google Scholar 

  44. Snoek, C.G.M., Smeulders, A.W.M.: Visual-concept search solved? IEEE Comp. 43(6), 76–78 (2010)

    Article  Google Scholar 

  45. van de Sande, K.E.A., Gevers, Th., Snoek, C.G.M.: “Evaluation of Color Descriptors for Object and Scene Recognition.” IEEE Trans on Pattern Analysis and Machine Intelligence (TPAMI) (2010)

  46. Ntalianis, K.,et al.: “Automatic annotation of image databases based on implicit crowdsourcing, visual concept modeling and evolution.” Multimed. Tools Appl. 69(2), 397–421 (2014)

    Article  Google Scholar 

  47. Sun, Y., et al.: “Visual concept detection of web images based on group sparse ensemble learning.” In Multimedia Tools and Applications. Springer, Berlin (2014)

    Google Scholar 

  48. Natsev, A., et al.: “Semantic concept-based query expansion and re-ranking for multimedia retrieval.” ACM Multimedia. Augsburg, Germany (2007)

  49. Guder, M., Cicekli, N.K.: Dichotomic Decision Cascading for Video Shot Boundary Detection. In International Symposium on Multimedia (ISM) (2013)

  50. Lienhart, R.: “Reliable transition detection in videos: a survey and practitioner’s guide.” In International Journal of Image and Graphics (IJIG). 1(3), 469–486 (2001)

    Article  Google Scholar 

  51. Yu-Gang, J., et al.: “Consumer video understanding: a benchmark database and an evaluation of human and machine performance.” In ACM International Conference on Multimedia Retrieval (ICMR). Trento. Italy, (2011)

  52. Deng, J., et al.: “ImageNet: A large-scale hierarchical image database.” In CVPR (2009)

  53. Google Web Search API (Deprecated). https://developers.google.com/web-search/. 22 Apr. 2015

  54. Bouchard, G., Triggs, B.: “Hierarchical part-based visual object categorization.” In CVPR (2005)

  55. Fergus, R., Perona, P., Zisserman, A.: “Object class recognition by unsupervised scale-invariant learning”. Proc. CVPR 2, 264–271 (2003)

    Google Scholar 

  56. Miksik, O., Mikolajczyk, K.: “Evaluation of local detectors and descriptors for fast feature matching.” Pattern Recognition (ICPR). 2012 21st International Conference, pp. 2681–2684 (2012)

  57. Heinly, J., Dunn, E., Frahm, J.M.: “Comparative evaluation of binary features.” In Computer Vision ECCV (2012)

  58. Sermanet, P., et al.: “OverFeat: integrated recognition, localization and detection using convolutional networks,” http://arxiv.org/abs/1312.6229, ICLR (2014)

  59. Jolliffe, I.T.: “Principal component analysis. In Springer Series in Statistics.” In 2nd ed. Springer. NY, vol. XXIX. 487 p. 28 illus. ISBN 978-0-387-95442-4 (2002)

  60. Piotr, K., Fei, Y., Krystian, M.: “Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection.” Comp. Vision Image Underst. 117(5), 479–492 (2013)

    Article  Google Scholar 

  61. Chang, Shih-Fu.: “Robust late fusion with rank minimization.” In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3021–3028, June 16–21 2012

  62. Ma, A.J., Yuen, P.C.: “Reduced analytic dependency modeling: Robust fusion for visual recognition.” IJCV (2014)

  63. Lai, Kuan-Ting, et al.: “Learning Sample Specific Weights for Late Fusion.” Image Process. IEEE Trans. 24(9), 2772–2783 (2015)

    Article  MathSciNet  Google Scholar 

  64. Marszaek, M., Laptev, I., Schmid, C.: “Actions in Context”, In IEEE Conference on Computer Vision and Pattern Recognition” (2009)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mennan Güder.

Additional information

Communicated by C. Xu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Güder, M., Çiçekli, N.K. Multi-modal video event recognition based on association rules and decision fusion. Multimedia Systems 24, 55–72 (2018). https://doi.org/10.1007/s00530-017-0535-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-017-0535-z

Keywords

Navigation