Skip to main content
Log in

Multi-label semantic concept detection in videos using fusion of asymmetrically trained deep convolutional neural networks and foreground driven concept co-occurrence matrix

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Describing visual contents in videos by semantic concepts is an effective and realistic approach that can be used in video applications such as annotation, indexing, retrieval and ranking. In these applications, video data needs to be labelled with some known set of labels or concepts. Assigning semantic concepts manually is not feasible due to the large volume of ever-growing video data. Hence, automatic semantic concept detection of videos is a hot research area. Recently Deep Convolutional Neural Networks (CNNs) used in computer vision tasks are showing remarkable performance. In this paper, we present a novel approach for automatic semantic video concept detection using deep CNN and foreground driven concept co-occurrence matrix (FDCCM) which keeps foreground to background concept co-occurrence values, built by exploiting concept co-occurrence relationship in pre-labelled TRECVID video dataset and from a collection of random images extracted from Google Images. To deal with the dataset imbalance problem, we have extended this approach by making a fusion of two asymmetrically trained deep CNNs and used FDCCM to further improve concept detection. The performance of the proposed approach is compared with state-of-the-art approaches for the video concept detection over the widely used TRECVID data set and is found to be superior to existing approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Abbreviations

CCM:

concept co-occurrence matrix

FDCCM:

foreground driven concept co-occurrence matrix

CNN:

convolutional neural network

References

  1. Feng L, Bhanu B (2016) Semantic concept co-occurrence patterns for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 38(2):785–799

    Article  Google Scholar 

  2. Kuo CH, Chou YH, Chang PC (2016) Using deep convolutional neural networks for image retrieval. Soc Imag Sci Technol. https://doi.org/10.2352/ISSN.2470-1173.2016.2.VIPC-231

  3. Podlesnaya A, Podlesnyy S (2016) Deep learning based semantic video indexing and retrieval. arXiv:1601.07754 [cs.IR]

  4. McCormac J, Handa A, Davison A, Leutenegger S (2016) SemanticFusion: dense 3D semantic mapping with convolutional neural networks. arXiv:1609.05130v2 [cs.CV]

  5. Kikuchi K, Ueki K, Ogawa T, Kobayashi T (2016) Video semantic indexing using object detection-derived features. In: Proc. 24th European signal processing conference (EUSIPCO). Budapest, pp 1288–1292

  6. Awad G, Snoek CGM, Smeaton AF, Quénot G (2016) TRECVid semantic indexing of video: a 6-year retrospective. ITE Trans Med Technol Appl (MTA) 4(1):187–208

    Google Scholar 

  7. Janwe NJ, Bhoyar KK (2016) Neural network based multi-label semantic video concept detection using novel mixed-hybrid-fusion approach. In: Proceedings of the 2nd international conference on communication and information processing, ICCIP 2016. ACM, Singapore, pp 129–133

  8. Vedaldi A, Lenc K (2015) MatConvNet: convolutional neural networks for MATLAB. In: Proc. of the int. conf. on multimedia. ACM, pp 689-692. https://doi.org/10.1145/2733373.2807412

  9. Modiri S, Amir A, Zamir R, Shah M (2014) Video classification using semantic concept co-occurrences. https://doi.org/10.1109/CVPR.2014.324

  10. Li X, Zhao F, Guo Y (2014) Multi-label image classification with a probabilistic label enhancement model. In: UAI’14 Proceedings of the thirtieth conference on uncertainty in artificial intelligence, pp 430-439

  11. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In: Proceedings of the international conference on machine learning, ICML. Beijing, pp 647– 655

  12. Zeiler MD, Fergus R (2013) Visualizing and understanding convolutional networks. arXiv:1311.2901 [cs.CV]

  13. Memar S, Suriani AL (2013) An integrated semantic-based approach in concept based video retrieval. Multimed Tools Appl 64:77–95. 10.1007/s11042-011-0848-4

    Article  Google Scholar 

  14. Oquab M, Bottou L, Laptev I, Sivic J (2013) Learning and transferring mid-level image representations using convolutional neural networks. Technical Report HAL-00911179, INRIA

  15. Ma H, Zhu J, Lyu MRT, King I (2010) Bridging the semantic gap between image contents and tags. IEEE Trans Multimed 12(5):462–473

    Article  Google Scholar 

  16. Jia D, Berg A, Fei-Fei L (2011) Hierarchical semantic indexing for large scale image retrieval. In: Proceedings of the 2011 IEEE conference on computer vision and pattern recognition, CVPR 2011. Colorado Springs, pp 785–792

  17. Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: 2009 IEEE Computer society conference on computer vision and pattern recognition workshops, CVPR Workshops. Miami, pp 1778–1785

  18. Bobick A, Davis J (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(1):257–267

    Article  Google Scholar 

  19. Davis JW, Bobick AF (1997) The representation and recognition of action using temporal templates. In: Proc. IEEE International conference on computer vision and pattern recognition, pp 928–934

  20. Zelnik ML, Irani M (2006) Statistical analysis of dynamic actions. IEEE Trans Pattern Anal Mach Intell 28(9):1530–1535

    Article  Google Scholar 

  21. Dong X, Chang SF (2007) Visual event recognition in news video using kernel methods with multi-level temporal alignment. In: Proc. IEEE international conference on computer vision and pattern recognition. Minneapolis

  22. Zhou X, Zhuang X, Yan S, Chang SF, Hasegawa-Johnson M, Huang TS (2008) Sift-bag kernel for video event analysis. In: Proc. ACM international conference on multimedia. Vancouver, pp 229–238

  23. Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. In: ANIPS, pp 1–8

  24. LeCun L, Bottou Y, Bengio, Haffner P (1998) Gradient based learning applied to document recognition. Proc IEEE 86(5):2278–2324

    Article  Google Scholar 

  25. Dean G, Corrado R, Monga K, Chen M, Devin Q, Le M, Mao M, Ranzato A, Senior P, Tucker K, Yang, Ng A (2012) Large scale distributed deep networks. In: NIPS, pp 1–9

  26. Rumelhart D, Hinton G, Williams R (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536

    Article  MATH  Google Scholar 

  27. Torralba A, Murphy KP, Freeman WT (2004) Contextual models for object detection using boosted random fields. In: Proc. Adv. neural inf. process. syst., pp 1401–1408

  28. Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S (2007) Objects in context. In: Proc. 11th IEEE int. conf. comput. vis., pp 1–8

  29. Galleguillos C, Rabinovich A, Belongie S (2008) Object categorization using co-occurrence, location and appearance. In: Proc. IEEE Conf. comput. vis. pattern recog., pp 1–8

  30. Hwang S, Grauman K (2010) Reading between the lines: object localization using implicit cues from image tags. In: Proc. IEEE Conf. comput. vis. pattern recog., pp 1145–1158

  31. Torralba A (2003) Contextual priming for object detection. Int J Comput Vis 53(2):169–191

    Article  MathSciNet  Google Scholar 

  32. Divvala S, Hoiem D, Hays J, Efros A, Hebert M (2009) An empirical study of context in object detection. In: Proc. IEEE Conf. comput. vis. pattern recog., pp 1271–1278

  33. Feng L, Bhanu B (2012) Semantic-visual concept relatedness and co-occurrences for image retrieval. In: ICIP, pp 2429–2432

  34. Wang J, Zhao Y, Wu X, Hua XS (2011) A transductive multi-label learning approach for video concept detection. Pattern Recogn 44:2274–2286

    Article  MATH  Google Scholar 

  35. Zha ZJ, Liu Y, Mei T, Hua XS (2007) Video concept detection using support vector machines - trecvid 2007 evaluations. Technical report Microsoft Research Lab – Asia

  36. Mazloom M, Li X, Snoek CGM (2016) TagBook: a semantic video representation without supervision for event detection. IEEE Trans Multimed 18(7):1378–1388

    Article  Google Scholar 

  37. Markatopoulou F, Mezaris V, Patras I (2015) Cascade of classifiers based on binary, non-binary and deep convolutional network descriptors for video concept detection. In: Proc. IEEE Int. conf. on image processing. Quebec City, pp 1786–1790

  38. Markatopoulou F, Mezaris V, Patras I (2016) Deep multi-task learning with label correlation constraint for video concept detection. In: Proc. of the ACM multimedia conference. Amsterdam, pp 501–505

  39. Sun Y, Sudo K, Taniguchi Y (2014) TRECVid 2013 semantic video concept detection by NTT-MD-DUT. In: Proc. of Trecvid 2014

  40. Chen X, Chen S, Wu Y (2017) Coverless information hiding method based on the Chinese character encoding. J Int Technol 18(2):91–98. https://doi.org/10.6138/JIT.2017.18.2.20160815

  41. Tian Q, Chen S (2017) Cross-heterogeneous-database age estimation through correlation representation learning. J Neurocomput 238:286–295

    Article  Google Scholar 

  42. Xue Y, Jiang J, Zhao B, Ma T (2017) A self-adaptive artificial bee colony algorithm based on global best for global optimization. Soft Comput 1–18. https://doi.org/10.1007/s00500-017-2547-1

  43. Yuan C, Xia Z, Sun X (2017) Coverless image steganography based on SIFT and BOF. J Int Technol 18(2):209– 216

    Google Scholar 

  44. Wei W, Fan X, Song H, Fan X, Yang J (2016) Imperfect information dynamic stackelberg game based resource allocation using hidden Markov for cloud computing. IEEE Trans Services Comput (99) https://doi.org/10.1109/TSC.2016.2528246

  45. Chen Y, Hao C, Wu W, Wu E (2016) Robust dense reconstruction by range merging based on confidence estimation. Sci Chin Inf Sci 59(9):1–11. https://doi.org/10.1007/s11432-015-0957-4

    Google Scholar 

  46. NIST: http://www.nist.gov

  47. TRECVID: http://www-nlpir.nist.go

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nitin J. Janwe.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Janwe, N.J., Bhoyar, K.K. Multi-label semantic concept detection in videos using fusion of asymmetrically trained deep convolutional neural networks and foreground driven concept co-occurrence matrix. Appl Intell 48, 2047–2066 (2018). https://doi.org/10.1007/s10489-017-1033-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-017-1033-x

Keywords

Navigation