Using deep features for video scene detection and annotation

Protasov, Stanislav; Khan, Adil Mehmood; Sozykin, Konstantin; Ahmad, Muhammad

doi:10.1007/s11760-018-1244-6

Using deep features for video scene detection and annotation

Original Paper
Published: 24 January 2018

Volume 12, pages 991–999, (2018)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Stanislav Protasov ORCID: orcid.org/0000-0001-5404-2773¹,
Adil Mehmood Khan¹,
Konstantin Sozykin¹ &
…
Muhammad Ahmad¹

1055 Accesses
32 Citations
1 Altmetric
Explore all metrics

Abstract

The semantic video indexing problem is still underexplored. Solutions to the problem will significantly enrich the experience of video search, monitoring, and surveillance. This paper concerns scene detection and annotation, and specifically, the task of video structure mining for video indexing using deep features. The paper proposes and implements a pipeline that consists of feature extraction and filtering, shot clustering, and labeling stages. A deep convolutional network is used as the source of the features. The pipeline is evaluated using metrics for both scene detection and annotation. The results obtained show high scene detection and annotation quality estimated with various metrics. Additionally, we performed an overview and analysis of contemporary segmentation and annotation metrics. The outcome of this work can be applied to semantic video annotation in real time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The trained model can be downloaded here http://places.csail.mit.edu/downloadCNN.html.
Source code is available by the link: https://bitbucket.org/compvisioniu/human-activity-recognition/src.

References

Altadmri, A., Ahmed, A.: Automatic semantic video annotation in wide domain videos based on similarity and commonsense knowledgebases. In: 2009 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 74–79 (2009). https://doi.org/10.1109/ICSIPA.2009.5478723
Amatriain, X., Agarwal, D.: Tutorial: Lessons learned from building real-life recommender systems. In: Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, pp. 433–433. ACM, New York (2016). https://doi.org/10.1145/2959100.2959194
Aner, A., Kender, J.R.: Video Summaries Through Mosaic-Based Shot and Scene Clustering, pp. 388–402. Springer, Berlin (2002). https://doi.org/10.1007/3-540-47979-1_26
MATH Google Scholar
Bagdanov, A.D., Bertini, M., Bimbo, A.D., Serra, G., Torniai, C.: Semantic annotation and retrieval of video events using multimedia ontologies. In: International Conference on Semantic Computing (ICSC), pp. 713–720 (2007). https://doi.org/10.1109/ICSC.2007.30
Burt, P.J.: Fast filter transform for image processing. Computer graphics and image processing (1981). http://linkinghub.elsevier.com/retrieve/pii/0146664X81900927
Canny, J.: A computational approach to edge detection. PAMI-8 6, 679–698 (1986). https://doi.org/10.1109/TPAMI.1986.4767851
Article Google Scholar
Chatfield, K., Arandjelović, R., Parkhi, O.M., Zisserman, A.: On-the-fly learning for visual search of large-scale image and video datasets. Int. J. Multimed. Inf. Retr. 4, 75–93 (2015)
Article Google Scholar
Del Fabro, M., Böszörmenyi, L.: State-of-the-art and future challenges in video scene detection: a survey. Multimed. Syst. 19(5), 427–454 (2013). https://doi.org/10.1007/s00530-013-0306-4
Article Google Scholar
Deng, J., Li, K., Do, M., Su, H., Fei-Fei, L.: Construction and Analysis of a Large Scale Image Ontology. Vision Sciences Society, Baltimore (2009)
Google Scholar
Hanjalic, A., Lagendijk, R.L., Biemond, J.: Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Trans. Circuits Syst. Video Technol. 9(4), 580–588 (1999). https://doi.org/10.1109/76.767124
Article Google Scholar
Huayong, L., Hui, Z.: The Segmentation of News Video into Story Units, pp. 870–875. Springer, Berlin (2005). https://doi.org/10.1007/11563952_95
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
Johnson, J., Karpathy, A., Li, F.: Densecap: fully convolutional localization networks for dense captioning. CoRR abs/1511.07571 (2015). http://arxiv.org/abs/1511.07571
Katz, E.: The Film Encyclopedia: Third Edition. HarperCollins, New York (1998). https://books.google.ru/books?id=jhx0QgAACAAJ
Kwon, Y.M., Song, C.J., Kim, I.J.: A new approach for high level video structuring. In: IEEE International Conference on Multimedia and Expo (2000)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2, 2169–2178 (2006). https://doi.org/10.1109/CVPR.2006.68
Google Scholar
Mitrović, D., Hartlieb, S., Zeppelzauer, M., Zaharieva, M.: Scene Segmentation in Artistic Archive Documentaries, pp. 400–410. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-16607-5_27
Google Scholar
Odobez, J.M., Gatica-Perez, D., Guillemot, M.: Spectral Structuring of Home Videos, pp. 310–320. Springer, Berlin (2003). https://doi.org/10.1007/3-540-45113-7_31
MATH Google Scholar
Over, P., Awad, G., Michel, M., Fiscus, J., Kraaij, W., Smeaton, A.F., Queenot, G., Ordelman, R.: Trecvid 2015—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2015. NIST, USA (2015)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Schmidt, J.M.: A simple test on 2-vertex- and 2-edge-connectivity. Inf. Process. Lett. 113(7), 241–244 (2013). https://doi.org/10.1016/j.ipl.2013.01.016. http://www.sciencedirect.com/science/article/pii/S0020019013000288
Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Kittler, J.: Differential edit distance: a metric for scene segmentation evaluation. IEEE Trans. Circuits Syst. Video Technol. 22(6), 904–914 (2012). https://doi.org/10.1109/TCSVT.2011.2181231
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556
Tarjan, R.: A note on finding the bridges of a graph. Inf. Process. Lett. 2(6), 160–161 (1974). https://doi.org/10.1016/0020-0190(74)90003-9. http://www.sciencedirect.com/science/article/pii/0020019074900039
Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Proceedings of the Sixth International Conference on Computer Vision, ICCV ’98, p. 839. IEEE Computer Society, Washington, DC, USA (1998). http://dl.acm.org/citation.cfm?id=938978.939190
Torralba, A., Murphy, K.P., Freeman, W.T., Rubin, M.A.: Context-based vision system for place and object recognition. In: Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2, ICCV ’03, p. 273. IEEE Computer Society, Washington, DC, USA (2003). http://dl.acm.org/citation.cfm?id=946247.946665
Truong, B.T., Venkatesh, S.: Video abstraction: a systematic review and classification. ACM Trans. Multimed. Comput. Commun. Appl. (2007). https://doi.org/10.1145/1198302.1198305
Truong, B.T., Venkatesh, S., Dorai, C.: Scene extraction in motion pictures. IEEE Trans. Ciruits Syst. Video Technol. 13(1), 5–15 (2003). https://doi.org/10.1109/TCSVT.2002.808084
Article Google Scholar
Vendrig, J., Worring, M.: Systematic evaluation of logical story unit segmentation. IEEE Trans. Multimed. 4(4), 492–499 (2002). https://doi.org/10.1109/TMM.2002.802021
Article Google Scholar
Vinciarelli, A., Favre, S.: Broadcast news story segmentation using social network analysis and hidden Markov models. In: Proceedings of the 15th ACM International Conference on Multimedia, MM ’07, pp. 261–264. ACM, New York (2007). https://doi.org/10.1145/1291233.1291287
Yeung, M., Yeo, B.L., Liu, B.: Segmentation of video by clustering and graph analysis. Comput. Vis. Image Underst. 71(1), 94–109 (1998). https://doi.org/10.1006/cviu.1997.0628
Article Google Scholar
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 487–495. Curran Associates, Inc., Dutchess (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Innopolis University, Innopolis, Russia
Stanislav Protasov, Adil Mehmood Khan, Konstantin Sozykin & Muhammad Ahmad

Authors

Stanislav Protasov
View author publications
You can also search for this author in PubMed Google Scholar
Adil Mehmood Khan
View author publications
You can also search for this author in PubMed Google Scholar
Konstantin Sozykin
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Ahmad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stanislav Protasov.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Protasov, S., Khan, A.M., Sozykin, K. et al. Using deep features for video scene detection and annotation. SIViP 12, 991–999 (2018). https://doi.org/10.1007/s11760-018-1244-6

Download citation

Received: 21 November 2017
Accepted: 15 January 2018
Published: 24 January 2018
Issue Date: July 2018
DOI: https://doi.org/10.1007/s11760-018-1244-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using deep features for video scene detection and annotation

Abstract

Access this article

Similar content being viewed by others

A State-of-Art Review on Automatic Video Annotation Techniques

Large Scale Holistic Video Understanding

Utilizing Deep Object Detector for Video Surveillance Indexing and Retrieval

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using deep features for video scene detection and annotation

Abstract

Access this article

Similar content being viewed by others

A State-of-Art Review on Automatic Video Annotation Techniques

Large Scale Holistic Video Understanding

Utilizing Deep Object Detector for Video Surveillance Indexing and Retrieval

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation