Abstract
Describing visual contents in videos by semantic concepts is an effective and realistic approach that can be used in video applications such as annotation, indexing, retrieval and ranking. In these applications, video data needs to be labelled with some known set of labels or concepts. Assigning semantic concepts manually is not feasible due to the large volume of ever-growing video data. Hence, automatic semantic concept detection of videos is a hot research area. Recently Deep Convolutional Neural Networks (CNNs) used in computer vision tasks are showing remarkable performance. In this paper, we present a novel approach for automatic semantic video concept detection using deep CNN and foreground driven concept co-occurrence matrix (FDCCM) which keeps foreground to background concept co-occurrence values, built by exploiting concept co-occurrence relationship in pre-labelled TRECVID video dataset and from a collection of random images extracted from Google Images. To deal with the dataset imbalance problem, we have extended this approach by making a fusion of two asymmetrically trained deep CNNs and used FDCCM to further improve concept detection. The performance of the proposed approach is compared with state-of-the-art approaches for the video concept detection over the widely used TRECVID data set and is found to be superior to existing approaches.
Similar content being viewed by others
Abbreviations
- CCM:
-
concept co-occurrence matrix
- FDCCM:
-
foreground driven concept co-occurrence matrix
- CNN:
-
convolutional neural network
References
Feng L, Bhanu B (2016) Semantic concept co-occurrence patterns for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 38(2):785–799
Kuo CH, Chou YH, Chang PC (2016) Using deep convolutional neural networks for image retrieval. Soc Imag Sci Technol. https://doi.org/10.2352/ISSN.2470-1173.2016.2.VIPC-231
Podlesnaya A, Podlesnyy S (2016) Deep learning based semantic video indexing and retrieval. arXiv:1601.07754 [cs.IR]
McCormac J, Handa A, Davison A, Leutenegger S (2016) SemanticFusion: dense 3D semantic mapping with convolutional neural networks. arXiv:1609.05130v2 [cs.CV]
Kikuchi K, Ueki K, Ogawa T, Kobayashi T (2016) Video semantic indexing using object detection-derived features. In: Proc. 24th European signal processing conference (EUSIPCO). Budapest, pp 1288–1292
Awad G, Snoek CGM, Smeaton AF, Quénot G (2016) TRECVid semantic indexing of video: a 6-year retrospective. ITE Trans Med Technol Appl (MTA) 4(1):187–208
Janwe NJ, Bhoyar KK (2016) Neural network based multi-label semantic video concept detection using novel mixed-hybrid-fusion approach. In: Proceedings of the 2nd international conference on communication and information processing, ICCIP 2016. ACM, Singapore, pp 129–133
Vedaldi A, Lenc K (2015) MatConvNet: convolutional neural networks for MATLAB. In: Proc. of the int. conf. on multimedia. ACM, pp 689-692. https://doi.org/10.1145/2733373.2807412
Modiri S, Amir A, Zamir R, Shah M (2014) Video classification using semantic concept co-occurrences. https://doi.org/10.1109/CVPR.2014.324
Li X, Zhao F, Guo Y (2014) Multi-label image classification with a probabilistic label enhancement model. In: UAI’14 Proceedings of the thirtieth conference on uncertainty in artificial intelligence, pp 430-439
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In: Proceedings of the international conference on machine learning, ICML. Beijing, pp 647– 655
Zeiler MD, Fergus R (2013) Visualizing and understanding convolutional networks. arXiv:1311.2901 [cs.CV]
Memar S, Suriani AL (2013) An integrated semantic-based approach in concept based video retrieval. Multimed Tools Appl 64:77–95. 10.1007/s11042-011-0848-4
Oquab M, Bottou L, Laptev I, Sivic J (2013) Learning and transferring mid-level image representations using convolutional neural networks. Technical Report HAL-00911179, INRIA
Ma H, Zhu J, Lyu MRT, King I (2010) Bridging the semantic gap between image contents and tags. IEEE Trans Multimed 12(5):462–473
Jia D, Berg A, Fei-Fei L (2011) Hierarchical semantic indexing for large scale image retrieval. In: Proceedings of the 2011 IEEE conference on computer vision and pattern recognition, CVPR 2011. Colorado Springs, pp 785–792
Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: 2009 IEEE Computer society conference on computer vision and pattern recognition workshops, CVPR Workshops. Miami, pp 1778–1785
Bobick A, Davis J (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(1):257–267
Davis JW, Bobick AF (1997) The representation and recognition of action using temporal templates. In: Proc. IEEE International conference on computer vision and pattern recognition, pp 928–934
Zelnik ML, Irani M (2006) Statistical analysis of dynamic actions. IEEE Trans Pattern Anal Mach Intell 28(9):1530–1535
Dong X, Chang SF (2007) Visual event recognition in news video using kernel methods with multi-level temporal alignment. In: Proc. IEEE international conference on computer vision and pattern recognition. Minneapolis
Zhou X, Zhuang X, Yan S, Chang SF, Hasegawa-Johnson M, Huang TS (2008) Sift-bag kernel for video event analysis. In: Proc. ACM international conference on multimedia. Vancouver, pp 229–238
Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. In: ANIPS, pp 1–8
LeCun L, Bottou Y, Bengio, Haffner P (1998) Gradient based learning applied to document recognition. Proc IEEE 86(5):2278–2324
Dean G, Corrado R, Monga K, Chen M, Devin Q, Le M, Mao M, Ranzato A, Senior P, Tucker K, Yang, Ng A (2012) Large scale distributed deep networks. In: NIPS, pp 1–9
Rumelhart D, Hinton G, Williams R (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Torralba A, Murphy KP, Freeman WT (2004) Contextual models for object detection using boosted random fields. In: Proc. Adv. neural inf. process. syst., pp 1401–1408
Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S (2007) Objects in context. In: Proc. 11th IEEE int. conf. comput. vis., pp 1–8
Galleguillos C, Rabinovich A, Belongie S (2008) Object categorization using co-occurrence, location and appearance. In: Proc. IEEE Conf. comput. vis. pattern recog., pp 1–8
Hwang S, Grauman K (2010) Reading between the lines: object localization using implicit cues from image tags. In: Proc. IEEE Conf. comput. vis. pattern recog., pp 1145–1158
Torralba A (2003) Contextual priming for object detection. Int J Comput Vis 53(2):169–191
Divvala S, Hoiem D, Hays J, Efros A, Hebert M (2009) An empirical study of context in object detection. In: Proc. IEEE Conf. comput. vis. pattern recog., pp 1271–1278
Feng L, Bhanu B (2012) Semantic-visual concept relatedness and co-occurrences for image retrieval. In: ICIP, pp 2429–2432
Wang J, Zhao Y, Wu X, Hua XS (2011) A transductive multi-label learning approach for video concept detection. Pattern Recogn 44:2274–2286
Zha ZJ, Liu Y, Mei T, Hua XS (2007) Video concept detection using support vector machines - trecvid 2007 evaluations. Technical report Microsoft Research Lab – Asia
Mazloom M, Li X, Snoek CGM (2016) TagBook: a semantic video representation without supervision for event detection. IEEE Trans Multimed 18(7):1378–1388
Markatopoulou F, Mezaris V, Patras I (2015) Cascade of classifiers based on binary, non-binary and deep convolutional network descriptors for video concept detection. In: Proc. IEEE Int. conf. on image processing. Quebec City, pp 1786–1790
Markatopoulou F, Mezaris V, Patras I (2016) Deep multi-task learning with label correlation constraint for video concept detection. In: Proc. of the ACM multimedia conference. Amsterdam, pp 501–505
Sun Y, Sudo K, Taniguchi Y (2014) TRECVid 2013 semantic video concept detection by NTT-MD-DUT. In: Proc. of Trecvid 2014
Chen X, Chen S, Wu Y (2017) Coverless information hiding method based on the Chinese character encoding. J Int Technol 18(2):91–98. https://doi.org/10.6138/JIT.2017.18.2.20160815
Tian Q, Chen S (2017) Cross-heterogeneous-database age estimation through correlation representation learning. J Neurocomput 238:286–295
Xue Y, Jiang J, Zhao B, Ma T (2017) A self-adaptive artificial bee colony algorithm based on global best for global optimization. Soft Comput 1–18. https://doi.org/10.1007/s00500-017-2547-1
Yuan C, Xia Z, Sun X (2017) Coverless image steganography based on SIFT and BOF. J Int Technol 18(2):209– 216
Wei W, Fan X, Song H, Fan X, Yang J (2016) Imperfect information dynamic stackelberg game based resource allocation using hidden Markov for cloud computing. IEEE Trans Services Comput (99) https://doi.org/10.1109/TSC.2016.2528246
Chen Y, Hao C, Wu W, Wu E (2016) Robust dense reconstruction by range merging based on confidence estimation. Sci Chin Inf Sci 59(9):1–11. https://doi.org/10.1007/s11432-015-0957-4
NIST: http://www.nist.gov
TRECVID: http://www-nlpir.nist.go
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Janwe, N.J., Bhoyar, K.K. Multi-label semantic concept detection in videos using fusion of asymmetrically trained deep convolutional neural networks and foreground driven concept co-occurrence matrix. Appl Intell 48, 2047–2066 (2018). https://doi.org/10.1007/s10489-017-1033-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-017-1033-x