Video Scene Segmentation Based on Triplet Loss Ranking

Esteve Brotons, Miguel Jose; Carmona Blanco, Jorge; Lucendo, Francisco Javier; García-Rodríguez, José

doi:10.1007/978-3-031-43085-5_24

Miguel Jose Esteve Brotons¹⁰,
Jorge Carmona Blanco¹⁰,
Francisco Javier Lucendo¹⁰ &
…
José García-Rodríguez¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14134))

Included in the following conference series:

International Work-Conference on Artificial Neural Networks

534 Accesses

Abstract

Scene segmentation is the task of segmenting the video in groups of frames with a high degree of semantic similarity. In this paper, we contribute to the task of video scene segmentation with the creation of a novel dataset for temporal scene segmentation. In addition, we propose the combination of two deep models to classify whether two video frames belong to the same or a different scene. The first model consists of a triplet network that is composed of 3 instances of the same 2D convolutional network. These instances correspond to a multi-scale net that performs frame embedding efficiently based on their similarity. We feed this network with an efficient triplet sampling algorithm. The second model is responsible for classifying whether these embeddings correspond to frames from different scenes by fine-tuning a siamese network.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aytar, Y., Vondrick, C., Torralba, A.: Learning sound representations from unlabeled video, Soundnet (2016)
Google Scholar
Baraldi, L., Grana, C., Cucchiara, R.: A deep siamese network for scene detection in broadcast videos. In: Proceedings of the 23rd ACM International Conference on Multimedia, MM ’15, pp. 1199–1202, New York, NY, USA (2015). Association for Computing Machinery
Google Scholar
Baraldi, L., Grana, C., Cucchiara, R.: A deep Siamese network for scene detection in broadcast videos. In: Proceedings of the 23rd ACM International Conference on Multimedia, ACM (2015)
Google Scholar
Baraldi, L., Grana, C., Cucchiara, R.: Shot and scene detection via hierarchical clustering for re-using broadcast video. In: Azzopardi, G., Petkov, N. (eds.) CAIP 2015. LNCS, vol. 9256, pp. 801–811. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23192-1_67
Chapter Google Scholar
Berhe, A., Guinaudeau, C., Barras, C.: Video scene segmentation of tv series using multi-modal neural features (2019)
Google Scholar
Bouyahi, M., Benayed, Y.: Video scenes segmentation based on multimodal genre prediction. Procedia Comput. Sci. 176,10–21 (2020)
Google Scholar
Castellano, B.: PySceneDetect 2014–2022. https://github.com/Breakthrough/PySceneDetect
Chen, S., Nie, X., Fan, D., Zhang, D., Bhat, V., Hamid, R.: Shot contrastive self-supervised learning for scene boundary detection (2021)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Shun ichi Amari: Backpropagation and stochastic gradient descent method. Neurocomputing 5(4), 185–196 (1993)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J., Bottou, L., Weinberger, K.Q., editors, Advances in Neural Information Processing Systems, vol. 25. Curran Associates Inc (2012)
Google Scholar
Maćkiewicz, A., Ratajczak, W.: Principal components analysis (PCA). Comput. Geosci. 19(3), 303–342 (1993)
Article Google Scholar
Mun, J., et al.: Boundary-aware self-supervised learning for video scene segmentation (2022)
Google Scholar
OpenCV. Image Thresholding (2023). https://docs.opencv.org/4.x/d7/d4d/tutorial_py_thresholding.html
OpenCV. Sobel Derivatives (2023). https://docs.opencv.org/3.4/d2/d2c/tutorial_sobel_derivatives.html
Rao, A., et al.: A local-to-global approach to multi-modal movie scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10146–10155 (2020)
Google Scholar
Ruby, U., Yendapalli, V.: Binary cross entropy with deep learning technique for image classification. Int. J. Adv. Trends Comput. Sci. Eng. 9, 10 (2020)
Google Scholar
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Sivaraman, K., Somappa, G.: MovieScope: movie trailer classification using deep neural networks (2017)
Google Scholar
Statista. Digital media - video on demand worldwide (2023). https://www.statista.com/outlook/dmo/digital-media/video-on-demand/worldwide#revenue
Tapu, R., Mocanu, B., Zaharia, T.: DEEP-AD: a multimodal temporal video segmentation framework for online video advertising. IEEE Access 8, 99582–99597 (2020)
Article Google Scholar
Rotten Tomatoes. @movieclips, 2006–2023. https://www.youtube.com/@MOVIECLIPS
Vendrig, J., Worring, M.: Systematic evaluation of logical story unit segmentation. Multimedia, IEEE Trans. 4, 492–499 (2003)
Google Scholar
Wang, J., et al.: Learning fine-grained image similarity with deep ranking. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393 (2014)
Google Scholar
Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, vol. 2, pp. 1398–1402 (2003)
Google Scholar
Wu, H., et al.: Scene consistency representation learning for video scene segmentation (2022)
Google Scholar
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition (2018)
Google Scholar

Download references

Acknowledgment

We would first like to thanks Telefonica I+D for supporting the Industrial Phd of Miguel Esteve Brotons. We would like to thank ‘A way of making Europe” European Regional Development Fund (ERDF) and MCIN/AEI/10.13039/501100011033 for supporting this work under the TED2021-130890B (CHAN-TWIN) research Project funded by MCIN/AEI /10.13039/501100011033 and European Union NextGenerationEU/ PRTR, and AICARE project (grant SPID202200X139779IV0). Also the HORIZON-MSCA-2021-SE-0 action number: 101086387, REMARKABLE, Rural Environmental Monitoring via ultra wide-ARea networKs And distriButed federated Learning. Finally, we also would like to thank Nvidia for their generous hardware donations that made these experiments possible.

Author information

Authors and Affiliations

Telefónica I+D, Madrid, Spain
Miguel Jose Esteve Brotons, Jorge Carmona Blanco & Francisco Javier Lucendo
Computers Technology Department, University of Alicante, Alicante, Spain
José García-Rodríguez

Authors

Miguel Jose Esteve Brotons
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Carmona Blanco
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Javier Lucendo
View author publications
You can also search for this author in PubMed Google Scholar
José García-Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José García-Rodríguez .

Editor information

Editors and Affiliations

University of Granada, Granada, Spain
Ignacio Rojas
University of Malaga, Málaga, Spain
Gonzalo Joya
Polytechnic University of Catalonia, Vilanova i la Geltrú, Spain
Andreu Catala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Esteve Brotons, M.J., Carmona Blanco, J., Lucendo, F.J., García-Rodríguez, J. (2023). Video Scene Segmentation Based on Triplet Loss Ranking. In: Rojas, I., Joya, G., Catala, A. (eds) Advances in Computational Intelligence. IWANN 2023. Lecture Notes in Computer Science, vol 14134. Springer, Cham. https://doi.org/10.1007/978-3-031-43085-5_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-43085-5_24
Published: 30 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43084-8
Online ISBN: 978-3-031-43085-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Video Scene Segmentation Based on Triplet Loss Ranking