Skip to main content
Log in

Using semantic context for multiple concepts detection in still images

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Multimedia documents indexing systems performances have been improved significantly in recent years, especially after the involvement of deep learning approaches. However, this progress remains insufficient with the evolution of users' needs that become complex in terms of semantics and the number of words that compose their queries. So, it is important to think about indexing images by a group of concepts simultaneously (multi-concepts) and not just single ones. This would allow systems to better respond to queries composed of several terms. This task is much more difficult than indexing images by single concepts. Multi-concepts detection in images has been little dealt in the state of the art compared to the detection of visual single concepts. On the other hand, the use of context has proved its effectiveness in the field of multimedia semantic indexing. In this work, we propose two approaches that consider the semantic context for multi-concepts detection in still images. We tested and evaluated our proposal on the international standard corpus Pascal VOC for the detection of concepts pairs and triplets of concepts. Our contributions have shown that context is useful and improves multi-concepts detection in images. The combination of the use of semantic context and deep learning-based features yielded much better results than those of the state of the art. This difference in performance is estimated by a relative gain on mean average precision reaching + 70% for concepts pairs and + 34% for the case of triplets of concepts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. http://host.robots.ox.ac.uk/pascal/VOC/voc2012/. Last check: November 29th, 2017.

  2. https://www.tensorflow.org/. Last check: November 29th, 2017.

References

  1. https://gist.github.com/ksimonyan/211839e770f7b538e2d8#file-readme-md/. The model used to generate Vgg16\(_{-}\)fc7 learned features. Last check: 24/11/2017

  2. https://github.com/bvlc/caffe/tree/master/models/bvlc_googlenet/. The model used to generate Googlenet learned features. Last check: 24/11/2017

  3. https://github.com/bvlc/caffe/tree/master/models/bvlc_reference_caffenet/. The model used to generate Alexnet\(_{-}\)fc7\(_{-}\)2 learned features. Last check: 24/11/2017

  4. https://www.codeproject.com/articles/619039/bag-of-features-descriptor-on-sift-features-with-o/. The code used to extract Sift1024 features. Last check: 24/11/2017

  5. https://www.codeproject.com/tips/656906/bag-of-features-descriptor-on-surf-and-orb-feature/. The code used to extract Orb1024 and Surf1024 features. Last check: 24/11/2017

  6. https://www.kernix.com/blog/image-classification-with-a-pre-trained-deep-neural-network_p11/. The code used to extract Inceptionv3\(_{-}\)pool3 learned features. Last check: 24/11/2017

  7. https://www.tensorflow.org/tutorials/image_recognition/. The model used to generate Inceptionv3\(_{-}\)pool3 learned features. Last check: 24/11/2017

  8. http://www.marekrei.com/blog/transforming-images-to-feature-vectors/. The code used to extract Googlenet, Alexnet\(_{-}\)fc7, Alexnet\(_{-}\)fc7\(_{-}\)2 and Vgg16\(_{-}\)fc7 learned features. Last check: 24/11/2017

  9. https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/. The model used to generate AlexNet\(_{-}\)Fc7 learned features. Last check: 24/11/2017

  10. Abburu S (2010) Context ontology construction for cricket video. Int J Comput Sci Eng 2:2593–2597

    Google Scholar 

  11. Aly R, Hiemstra D, de Vries A, de Jong F (2008) A probabilistic ranking framework using unobservable binary events for video search. In: 7th ACM international conference on content-based image and video retrieval. CIVR 2008. ACM, New York, pp 349–358

  12. Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (surf). Comput Vis Image Underst 110(3):346–359. https://doi.org/10.1016/j.cviu.2007.09.014

    Article  Google Scholar 

  13. Brézillon P (1999) Context in problem solving: a survey. Knowl Eng Rev 14(1):47–80. https://doi.org/10.1017/S0269888999141018

    Article  MATH  Google Scholar 

  14. Brown L, Cao L, Chang SF, Cheng Y, Choudhary A, Codella N, Cotton C, Ellis D, Fan Q, Feris R, Gong L, Hill M, Hua G, Kender J, Merler M, Mu Y, Pankanti S, Smith JR, Yu FX (2013) Ibm research and columbia university trecvid-2013 multimedia event detection (med), multimedia event recounting (mer), surveillance event detection (sed), and semantic indexing (sin) systems. In: Proceedings of the TRECVID workshop, Gaithersburg

  15. Budnik M, Gutierrez-Gomez E, Safadi B, Pellerin D, Quénot G (2017) Learned features versus engineered features for multimedia indexing. Multimed Tools Appl 76(9):11941–11958. https://doi.org/10.1007/s11042-016-4240-2

    Article  Google Scholar 

  16. Chang SF, Hsu W, Jiang W, Kennedy L, Xu D, Yanagawa A, Zavesky E (2006) Columbia university trecvid-2006 video search and high-level feature extraction. In: Proceedings of the TRECVID workshop

  17. Desvignes M, Porquet C, Spagnou P (1991) The use of context in image sequences interpretation. In: Proceedings of the 8e Congrs AFCET-RFIA, pp 55–61

  18. Dimai A (1999) Rotation invariant texture description using general moment invariants and Gabor filters In: Proceedings of the 11th Scandinavian conference on image analysis, pp 391–398

  19. Divvala SK, Hoiem D, Hays J, Efros AA, Hebert M (2009) An empirical study of context in object detection. In: IEEE computer society conference on computer vision and pattern recognition (CVPR)

  20. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2012) The PASCAL visual object classes challenge 2012 (VOC2012) results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

  21. Galleguillos C, Belongie S (2010) Context based object categorization: a critical survey. Comput Vis Image Underst (CVIU) 114:712–722

    Article  Google Scholar 

  22. Hamadi A, Mulhem P, Quénot G (2015) Extended conceptual feedback for semantic multimedia indexing. Multimed Tools Appl 74(4):1225–1248. https://doi.org/10.1007/s11042-014-1937-y

    Article  Google Scholar 

  23. Hamadi A, Mulhem P, Quénot G (2016) A comparative study for multiple visual concepts detection in images and videos. Multimed Tools Appl 75(15):8973–8997. https://doi.org/10.1007/s11042-015-2730-2

    Article  Google Scholar 

  24. Hamadi A, Quenot G, Mulhem P (2012) Two-layers re-ranking approach based on contextual information for visual concepts detection in videos. In: 10th International workshop on content-based multimedia indexing (CBMI), 2012 , pp 1–6. https://doi.org/10.1109/CBMI.2012.6269837

  25. Hamadi A, Safadi B, Vuong TTT, Han D, Derbas N, Mulhem P, Qunot G (2013) Quaero at TRECVID 2013: semantic indexing and instance search. In: Proceedings of the TRECVID workshop, Gaithersburg

  26. Hauptmann A, y Chen M, Christel M, Huang C, h Lin W, Ng T, Velivelli A, Yang J, Yan R, Yang H, Wactlar HD (2004) Confounded expectations: informedia at trecvid 2004. In: Proceedings of TRECVID

  27. Huang J, Kumar SR, mitra M, Zhu W, Zabih R (1997) Image indexing using color correlograms. In: Proceedings of the conference on computer vision and pattern recognition, Puerto Rico, pp 762–768

  28. Ishikawa S, Koskela M, Sjoberg M, Laaksonen J, Oja E, Amid E, Palomaki K, Mesaros A, Kurimo M (2013) Picsom experiments in trecvid 2013. In: Proceedings of the TRECVID workshop, Gaithersburg

  29. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093

  30. Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676. https://doi.org/10.1109/TPAMI.2016.2598339

    Article  Google Scholar 

  31. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386

    Article  Google Scholar 

  32. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition, 2006, vol 2, pp 2169–2178. https://doi.org/10.1109/CVPR.2006.68

  33. Le Cun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  34. Li X, Snoek CGM, Worring M, Smeulders A (2012) Harvesting social images for bi-concept search. IEEE Trans Multimed 14(4):1091–1104

    Article  Google Scholar 

  35. Li X, Wang D, Li J, Zhang B (2007) Video search in concept subspace: a text-like paradigm. In: Proceeding of CIVR

  36. Lowe D (1999) Object recognition from local scale-invariant features. In: The proceedings of the seventh IEEE international conference on computer vision, 1999, vol 2, pp 1150–1157. https://doi.org/10.1109/ICCV.1999.790410

  37. Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  38. Ma WY, Manjunath BS (1996) Texture features and learning similarity. In: CVPR p. 00:425

  39. Manjunath BS, Ohm J, Vasudevan VV, Yamada A (2001) Color and texture descriptors. Trans Circuits Syst Video Technol 11(6):703–715

    Article  Google Scholar 

  40. Manjunath BS, Wu P, Newsam S, Shin H (2000) A texture descriptor for browsing and image retrieval. Int Commun J 16:33–43

    Google Scholar 

  41. Memar S, Ektefa M, Affendey L (2010) Developing context model supporting spatial relations for semantic video retrieval. In: International conference on information retrieval knowledge management, (CAMP), 2010, pp 40–43. https://doi.org/10.1109/INFRKM.2010.5466951

  42. Min R, Cheng H (2009) Effective image retrieval using dominant color descriptor and fuzzy support vector machine. Pattern Recognit 42:147–157

    Article  Google Scholar 

  43. Pass G, Zabih R, Miller J (1996) Comparing images using color coherence vectors. In: Proceedings of the fourth ACM international conference on multimedia, pp 65–73

  44. Qi GJ, Hua XS, Rui Y, Tang J, Mei T, Zhang HJ (2007) Correlative multi-label video annotation. In: Lienhart R, Prasad AR, Hanjalic A, Choi S, Bailey BP, Sebe N (eds) Proceedings of the 15th international conference on multimedia 2007, Augsburg, 24–29 Sept 2007, pp 17–26. ACM. https://doi.org/10.1145/1291233.1291245

  45. Qi GJ, Hua XS, Rui Y, Tang J, Zhang HJ (2008) Two-dimensional active learning for image classification. In: IEEE computer society conference on computer vision and pattern recognition, vol 0, pp 1–8. https://doi.org/10.1109/CVPR.2008.4587383

  46. Qi GJ, Hua XS, Rui Y, Tang J, Zhang HJ (2010) Image classification with kernelized spatial-context. IEEE Trans Multimed 12(4):278–287. https://doi.org/10.1109/TMM.2010.2046270

    Article  Google Scholar 

  47. Qiu Y, Guan G, Wang Z, Feng D (2010) Improving news video annotation with semantic context. In: International conference on digital image computing: techniques and applications (DICTA), 2010, pp 214–219. https://doi.org/10.1109/DICTA.2010.47

  48. Naphade RM, Kozintsev IV, Huang TS (2002) Factor graph framework for semantic video indexing. IEEE Trans Circuits Syst Video Technol 12(1):40–52. https://doi.org/10.1109/76.981844

    Article  Google Scholar 

  49. Rublee E, Rabaud V, Konolige K, Bradski G (2011) Orb: an efficient alternative to sift or surf. In: 2011 International conference on computer vision, pp 2564–2571. https://doi.org/10.1109/ICCV.2011.6126544

  50. Safadi B, Derbas N, Quénot G (2015) Descriptor optimization for multimedia indexing and retrieval. Multimed Tools Appl 74(4):1267–1290. https://doi.org/10.1007/s11042-014-2071-6

    Article  Google Scholar 

  51. Safadi B, Quénot G (2010) Evaluations of multi-learner approaches for concept indexing in video documents. In: RIAO, pp 88–91

  52. Safadi B, Quénot G (2011) Re-ranking by local re-scoring for video indexing and retrieval. In: Proceedings of the 20th ACM conference on information and knowledge management (CIKM), Glasgow, pp 2081–2084

  53. Safadi B, Qunot G (2011) Re-ranking for multimedia indexing and retrieval. In: Proceedings of the 33rd European conference on IR research (ECIR), Dublin, pp 708–711

  54. Schilit B, Adams N, Want R (1994) Context-aware computing applications. In: Proceedings of the 1994 first workshop on mobile computing systems and applications, WMCSA ’94, pp. 85–90. IEEE Computer Society, Washington. https://doi.org/10.1109/WMCSA.1994.16

  55. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: CoRR, abs/1409.1556

  56. Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380. https://doi.org/10.1109/34.895972

    Article  Google Scholar 

  57. Smith JR, Naphade MR, Natsev A (2003) Multimedia semantic indexing using model vectors. In: ICME, pp 445–448. IEEE

  58. Snoek CG, Huurnink B, Hollink L, de Rijke M, Schreiber G, Worring M (2007) Adding semantics to detectors for video retrieval. Trans Multimed 9(5):975–986

    Article  Google Scholar 

  59. Snoek CGM, Worring M, Smeulders AWM (2005) Early versus late fusion in semantic video analysis. In: Proceedings of the 13th annual ACM international conference on multimedia, MULTIMEDIA ’05, pp. 399–402. ACM, New York. https://doi.org/10.1145/1101149.1101236

  60. Strat TM (1993) Employing contextual information in computer vision. In: Proceedings of ARPA image understanding workshop, pp 217–229

  61. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9. https://doi.org/10.1109/CVPR.2015.7298594

  62. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  63. Torralba A (2003) Contextual priming for object detection. Int J Comput Vis 53(2):169–191. https://doi.org/10.1023/A:1023052124951

    Article  MathSciNet  Google Scholar 

  64. van de Sande KE, Gevers T, Snoek CG (2008) A comparison of color features for visual concept classification. In: Proceedings of the 2008 international conference on Content-based image and video retrieval, CIVR ’08, pp 141–150. ACM, New York. https://doi.org/10.1145/1386352.1386376

  65. Wang F, Merialdo B (2009) Eurecom at trecvid 2009 high-level feature extraction. In: TREC2009 notebook

  66. Wang G, Forsyth DA (2009) Joint learning of visual attributes, object classes and visual saliency. In: ICCV’09, pp 537–544

  67. Wei XY, Jiang YG, Ngo CW (2011) Concept-driven multi-modality fusion for video search. IEEE Trans Circuits Syst Video Technol 21(1):62–73

    Article  Google Scholar 

  68. Wolf L, Bileschi S (2006) A critical view of context. Int J Comput Vis 69(2):251–261

    Article  Google Scholar 

  69. Wu Y, Tseng BL, Smith JR (2004) Ontology-based multi-classification learning for video concept detection. In: IEEE international conference on multimedia and expo (ICME) (IEEE Cat. No.04TH8763), vol 2, Taipei, pp. 1003–1006. https://doi.org/10.1109/ICME.2004.1394372

  70. Yan R, Hauptmann AG (2003) The combination limit in multimedia retrieval. In: In Proceedings of the eleventh ACM international conference on Multimedia (MULTIMEDIA \(\acute{0}\)3, pp 339–342

  71. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Computer vision—ECCV 2014—13th European conference, Zurich, 6–12 Sept 2014, Proceedings, Part I, pp 818–833. https://doi.org/10.1007/978-3-319-10590-1_53

    Chapter  Google Scholar 

  72. Zhang D, Lu G (2003) Evaluation of mpeg-7 shape descriptors against other shape descriptors. ACM J Multimed Syst 9(1):15–30

    Article  Google Scholar 

  73. Zhang DS, Lu GJ (2001) Shape retrieval using fourier descriptors. In: Proceedings of international conference on multimedia and distance education (ICMADE-01), Fargo, pp 1–9

  74. Zhang H, Cao X, Ho JKL, Chow TWS (2017) Object-level video advertising: an optimization framework. IEEE Trans Ind Inform 13(2):520–531. https://doi.org/10.1109/TII.2016.2605629

    Article  Google Scholar 

  75. Zhang H, Wang S, Xu X, Chow TWS, Wu QMJ (2018) Tree2vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Netw Learn Syst https://doi.org/10.1109/TNNLS.2018.2797060

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdelkader Hamadi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hamadi, A., Lattar, H., Khoussa, M.E.B. et al. Using semantic context for multiple concepts detection in still images. Pattern Anal Applic 23, 27–44 (2020). https://doi.org/10.1007/s10044-018-0761-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-018-0761-9

Keywords

Navigation