Skip to main content
Log in

Using deep features for video scene detection and annotation

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

The semantic video indexing problem is still underexplored. Solutions to the problem will significantly enrich the experience of video search, monitoring, and surveillance. This paper concerns scene detection and annotation, and specifically, the task of video structure mining for video indexing using deep features. The paper proposes and implements a pipeline that consists of feature extraction and filtering, shot clustering, and labeling stages. A deep convolutional network is used as the source of the features. The pipeline is evaluated using metrics for both scene detection and annotation. The results obtained show high scene detection and annotation quality estimated with various metrics. Additionally, we performed an overview and analysis of contemporary segmentation and annotation metrics. The outcome of this work can be applied to semantic video annotation in real time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The trained model can be downloaded here http://places.csail.mit.edu/downloadCNN.html.

  2. Source code is available by the link: https://bitbucket.org/compvisioniu/human-activity-recognition/src.

References

  1. Altadmri, A., Ahmed, A.: Automatic semantic video annotation in wide domain videos based on similarity and commonsense knowledgebases. In: 2009 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 74–79 (2009). https://doi.org/10.1109/ICSIPA.2009.5478723

  2. Amatriain, X., Agarwal, D.: Tutorial: Lessons learned from building real-life recommender systems. In: Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, pp. 433–433. ACM, New York (2016). https://doi.org/10.1145/2959100.2959194

  3. Aner, A., Kender, J.R.: Video Summaries Through Mosaic-Based Shot and Scene Clustering, pp. 388–402. Springer, Berlin (2002). https://doi.org/10.1007/3-540-47979-1_26

    MATH  Google Scholar 

  4. Bagdanov, A.D., Bertini, M., Bimbo, A.D., Serra, G., Torniai, C.: Semantic annotation and retrieval of video events using multimedia ontologies. In: International Conference on Semantic Computing (ICSC), pp. 713–720 (2007). https://doi.org/10.1109/ICSC.2007.30

  5. Burt, P.J.: Fast filter transform for image processing. Computer graphics and image processing (1981). http://linkinghub.elsevier.com/retrieve/pii/0146664X81900927

  6. Canny, J.: A computational approach to edge detection. PAMI-8 6, 679–698 (1986). https://doi.org/10.1109/TPAMI.1986.4767851

    Article  Google Scholar 

  7. Chatfield, K., Arandjelović, R., Parkhi, O.M., Zisserman, A.: On-the-fly learning for visual search of large-scale image and video datasets. Int. J. Multimed. Inf. Retr. 4, 75–93 (2015)

    Article  Google Scholar 

  8. Del Fabro, M., Böszörmenyi, L.: State-of-the-art and future challenges in video scene detection: a survey. Multimed. Syst. 19(5), 427–454 (2013). https://doi.org/10.1007/s00530-013-0306-4

    Article  Google Scholar 

  9. Deng, J., Li, K., Do, M., Su, H., Fei-Fei, L.: Construction and Analysis of a Large Scale Image Ontology. Vision Sciences Society, Baltimore (2009)

    Google Scholar 

  10. Hanjalic, A., Lagendijk, R.L., Biemond, J.: Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Trans. Circuits Syst. Video Technol. 9(4), 580–588 (1999). https://doi.org/10.1109/76.767124

    Article  Google Scholar 

  11. Huayong, L., Hui, Z.: The Segmentation of News Video into Story Units, pp. 870–875. Springer, Berlin (2005). https://doi.org/10.1007/11563952_95

    Google Scholar 

  12. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)

  13. Johnson, J., Karpathy, A., Li, F.: Densecap: fully convolutional localization networks for dense captioning. CoRR abs/1511.07571 (2015). http://arxiv.org/abs/1511.07571

  14. Katz, E.: The Film Encyclopedia: Third Edition. HarperCollins, New York (1998). https://books.google.ru/books?id=jhx0QgAACAAJ

  15. Kwon, Y.M., Song, C.J., Kim, I.J.: A new approach for high level video structuring. In: IEEE International Conference on Multimedia and Expo (2000)

  16. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2, 2169–2178 (2006). https://doi.org/10.1109/CVPR.2006.68

    Google Scholar 

  17. Mitrović, D., Hartlieb, S., Zeppelzauer, M., Zaharieva, M.: Scene Segmentation in Artistic Archive Documentaries, pp. 400–410. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-16607-5_27

    Google Scholar 

  18. Odobez, J.M., Gatica-Perez, D., Guillemot, M.: Spectral Structuring of Home Videos, pp. 310–320. Springer, Berlin (2003). https://doi.org/10.1007/3-540-45113-7_31

    MATH  Google Scholar 

  19. Over, P., Awad, G., Michel, M., Fiscus, J., Kraaij, W., Smeaton, A.F., Queenot, G., Ordelman, R.: Trecvid 2015—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2015. NIST, USA (2015)

  20. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  21. Schmidt, J.M.: A simple test on 2-vertex- and 2-edge-connectivity. Inf. Process. Lett. 113(7), 241–244 (2013). https://doi.org/10.1016/j.ipl.2013.01.016. http://www.sciencedirect.com/science/article/pii/S0020019013000288

  22. Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Kittler, J.: Differential edit distance: a metric for scene segmentation evaluation. IEEE Trans. Circuits Syst. Video Technol. 22(6), 904–914 (2012). https://doi.org/10.1109/TCSVT.2011.2181231

    Article  Google Scholar 

  23. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556

  24. Tarjan, R.: A note on finding the bridges of a graph. Inf. Process. Lett. 2(6), 160–161 (1974). https://doi.org/10.1016/0020-0190(74)90003-9. http://www.sciencedirect.com/science/article/pii/0020019074900039

  25. Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Proceedings of the Sixth International Conference on Computer Vision, ICCV ’98, p. 839. IEEE Computer Society, Washington, DC, USA (1998). http://dl.acm.org/citation.cfm?id=938978.939190

  26. Torralba, A., Murphy, K.P., Freeman, W.T., Rubin, M.A.: Context-based vision system for place and object recognition. In: Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2, ICCV ’03, p. 273. IEEE Computer Society, Washington, DC, USA (2003). http://dl.acm.org/citation.cfm?id=946247.946665

  27. Truong, B.T., Venkatesh, S.: Video abstraction: a systematic review and classification. ACM Trans. Multimed. Comput. Commun. Appl. (2007). https://doi.org/10.1145/1198302.1198305

  28. Truong, B.T., Venkatesh, S., Dorai, C.: Scene extraction in motion pictures. IEEE Trans. Ciruits Syst. Video Technol. 13(1), 5–15 (2003). https://doi.org/10.1109/TCSVT.2002.808084

    Article  Google Scholar 

  29. Vendrig, J., Worring, M.: Systematic evaluation of logical story unit segmentation. IEEE Trans. Multimed. 4(4), 492–499 (2002). https://doi.org/10.1109/TMM.2002.802021

    Article  Google Scholar 

  30. Vinciarelli, A., Favre, S.: Broadcast news story segmentation using social network analysis and hidden Markov models. In: Proceedings of the 15th ACM International Conference on Multimedia, MM ’07, pp. 261–264. ACM, New York (2007). https://doi.org/10.1145/1291233.1291287

  31. Yeung, M., Yeo, B.L., Liu, B.: Segmentation of video by clustering and graph analysis. Comput. Vis. Image Underst. 71(1), 94–109 (1998). https://doi.org/10.1006/cviu.1997.0628

    Article  Google Scholar 

  32. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 487–495. Curran Associates, Inc., Dutchess (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stanislav Protasov.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Protasov, S., Khan, A.M., Sozykin, K. et al. Using deep features for video scene detection and annotation. SIViP 12, 991–999 (2018). https://doi.org/10.1007/s11760-018-1244-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-018-1244-6

Keywords

Navigation