Abstract
The semantic video indexing problem is still underexplored. Solutions to the problem will significantly enrich the experience of video search, monitoring, and surveillance. This paper concerns scene detection and annotation, and specifically, the task of video structure mining for video indexing using deep features. The paper proposes and implements a pipeline that consists of feature extraction and filtering, shot clustering, and labeling stages. A deep convolutional network is used as the source of the features. The pipeline is evaluated using metrics for both scene detection and annotation. The results obtained show high scene detection and annotation quality estimated with various metrics. Additionally, we performed an overview and analysis of contemporary segmentation and annotation metrics. The outcome of this work can be applied to semantic video annotation in real time.
Similar content being viewed by others
Notes
The trained model can be downloaded here http://places.csail.mit.edu/downloadCNN.html.
Source code is available by the link: https://bitbucket.org/compvisioniu/human-activity-recognition/src.
References
Altadmri, A., Ahmed, A.: Automatic semantic video annotation in wide domain videos based on similarity and commonsense knowledgebases. In: 2009 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 74–79 (2009). https://doi.org/10.1109/ICSIPA.2009.5478723
Amatriain, X., Agarwal, D.: Tutorial: Lessons learned from building real-life recommender systems. In: Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, pp. 433–433. ACM, New York (2016). https://doi.org/10.1145/2959100.2959194
Aner, A., Kender, J.R.: Video Summaries Through Mosaic-Based Shot and Scene Clustering, pp. 388–402. Springer, Berlin (2002). https://doi.org/10.1007/3-540-47979-1_26
Bagdanov, A.D., Bertini, M., Bimbo, A.D., Serra, G., Torniai, C.: Semantic annotation and retrieval of video events using multimedia ontologies. In: International Conference on Semantic Computing (ICSC), pp. 713–720 (2007). https://doi.org/10.1109/ICSC.2007.30
Burt, P.J.: Fast filter transform for image processing. Computer graphics and image processing (1981). http://linkinghub.elsevier.com/retrieve/pii/0146664X81900927
Canny, J.: A computational approach to edge detection. PAMI-8 6, 679–698 (1986). https://doi.org/10.1109/TPAMI.1986.4767851
Chatfield, K., Arandjelović, R., Parkhi, O.M., Zisserman, A.: On-the-fly learning for visual search of large-scale image and video datasets. Int. J. Multimed. Inf. Retr. 4, 75–93 (2015)
Del Fabro, M., Böszörmenyi, L.: State-of-the-art and future challenges in video scene detection: a survey. Multimed. Syst. 19(5), 427–454 (2013). https://doi.org/10.1007/s00530-013-0306-4
Deng, J., Li, K., Do, M., Su, H., Fei-Fei, L.: Construction and Analysis of a Large Scale Image Ontology. Vision Sciences Society, Baltimore (2009)
Hanjalic, A., Lagendijk, R.L., Biemond, J.: Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Trans. Circuits Syst. Video Technol. 9(4), 580–588 (1999). https://doi.org/10.1109/76.767124
Huayong, L., Hui, Z.: The Segmentation of News Video into Story Units, pp. 870–875. Springer, Berlin (2005). https://doi.org/10.1007/11563952_95
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Johnson, J., Karpathy, A., Li, F.: Densecap: fully convolutional localization networks for dense captioning. CoRR abs/1511.07571 (2015). http://arxiv.org/abs/1511.07571
Katz, E.: The Film Encyclopedia: Third Edition. HarperCollins, New York (1998). https://books.google.ru/books?id=jhx0QgAACAAJ
Kwon, Y.M., Song, C.J., Kim, I.J.: A new approach for high level video structuring. In: IEEE International Conference on Multimedia and Expo (2000)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2, 2169–2178 (2006). https://doi.org/10.1109/CVPR.2006.68
Mitrović, D., Hartlieb, S., Zeppelzauer, M., Zaharieva, M.: Scene Segmentation in Artistic Archive Documentaries, pp. 400–410. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-16607-5_27
Odobez, J.M., Gatica-Perez, D., Guillemot, M.: Spectral Structuring of Home Videos, pp. 310–320. Springer, Berlin (2003). https://doi.org/10.1007/3-540-45113-7_31
Over, P., Awad, G., Michel, M., Fiscus, J., Kraaij, W., Smeaton, A.F., Queenot, G., Ordelman, R.: Trecvid 2015—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2015. NIST, USA (2015)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Schmidt, J.M.: A simple test on 2-vertex- and 2-edge-connectivity. Inf. Process. Lett. 113(7), 241–244 (2013). https://doi.org/10.1016/j.ipl.2013.01.016. http://www.sciencedirect.com/science/article/pii/S0020019013000288
Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Kittler, J.: Differential edit distance: a metric for scene segmentation evaluation. IEEE Trans. Circuits Syst. Video Technol. 22(6), 904–914 (2012). https://doi.org/10.1109/TCSVT.2011.2181231
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556
Tarjan, R.: A note on finding the bridges of a graph. Inf. Process. Lett. 2(6), 160–161 (1974). https://doi.org/10.1016/0020-0190(74)90003-9. http://www.sciencedirect.com/science/article/pii/0020019074900039
Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Proceedings of the Sixth International Conference on Computer Vision, ICCV ’98, p. 839. IEEE Computer Society, Washington, DC, USA (1998). http://dl.acm.org/citation.cfm?id=938978.939190
Torralba, A., Murphy, K.P., Freeman, W.T., Rubin, M.A.: Context-based vision system for place and object recognition. In: Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2, ICCV ’03, p. 273. IEEE Computer Society, Washington, DC, USA (2003). http://dl.acm.org/citation.cfm?id=946247.946665
Truong, B.T., Venkatesh, S.: Video abstraction: a systematic review and classification. ACM Trans. Multimed. Comput. Commun. Appl. (2007). https://doi.org/10.1145/1198302.1198305
Truong, B.T., Venkatesh, S., Dorai, C.: Scene extraction in motion pictures. IEEE Trans. Ciruits Syst. Video Technol. 13(1), 5–15 (2003). https://doi.org/10.1109/TCSVT.2002.808084
Vendrig, J., Worring, M.: Systematic evaluation of logical story unit segmentation. IEEE Trans. Multimed. 4(4), 492–499 (2002). https://doi.org/10.1109/TMM.2002.802021
Vinciarelli, A., Favre, S.: Broadcast news story segmentation using social network analysis and hidden Markov models. In: Proceedings of the 15th ACM International Conference on Multimedia, MM ’07, pp. 261–264. ACM, New York (2007). https://doi.org/10.1145/1291233.1291287
Yeung, M., Yeo, B.L., Liu, B.: Segmentation of video by clustering and graph analysis. Comput. Vis. Image Underst. 71(1), 94–109 (1998). https://doi.org/10.1006/cviu.1997.0628
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 487–495. Curran Associates, Inc., Dutchess (2014)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Protasov, S., Khan, A.M., Sozykin, K. et al. Using deep features for video scene detection and annotation. SIViP 12, 991–999 (2018). https://doi.org/10.1007/s11760-018-1244-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-018-1244-6