Skip to main content

Video Scene Segmentation Based on Triplet Loss Ranking

  • Conference paper
  • First Online:
Advances in Computational Intelligence (IWANN 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14134))

Included in the following conference series:

  • 534 Accesses

Abstract

Scene segmentation is the task of segmenting the video in groups of frames with a high degree of semantic similarity. In this paper, we contribute to the task of video scene segmentation with the creation of a novel dataset for temporal scene segmentation. In addition, we propose the combination of two deep models to classify whether two video frames belong to the same or a different scene. The first model consists of a triplet network that is composed of 3 instances of the same 2D convolutional network. These instances correspond to a multi-scale net that performs frame embedding efficiently based on their similarity. We feed this network with an efficient triplet sampling algorithm. The second model is responsible for classifying whether these embeddings correspond to frames from different scenes by fine-tuning a siamese network.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aytar, Y., Vondrick, C., Torralba, A.: Learning sound representations from unlabeled video, Soundnet (2016)

    Google Scholar 

  2. Baraldi, L., Grana, C., Cucchiara, R.: A deep siamese network for scene detection in broadcast videos. In: Proceedings of the 23rd ACM International Conference on Multimedia, MM ’15, pp. 1199–1202, New York, NY, USA (2015). Association for Computing Machinery

    Google Scholar 

  3. Baraldi, L., Grana, C., Cucchiara, R.: A deep Siamese network for scene detection in broadcast videos. In: Proceedings of the 23rd ACM International Conference on Multimedia, ACM (2015)

    Google Scholar 

  4. Baraldi, L., Grana, C., Cucchiara, R.: Shot and scene detection via hierarchical clustering for re-using broadcast video. In: Azzopardi, G., Petkov, N. (eds.) CAIP 2015. LNCS, vol. 9256, pp. 801–811. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23192-1_67

    Chapter  Google Scholar 

  5. Berhe, A., Guinaudeau, C., Barras, C.: Video scene segmentation of tv series using multi-modal neural features (2019)

    Google Scholar 

  6. Bouyahi, M., Benayed, Y.: Video scenes segmentation based on multimodal genre prediction. Procedia Comput. Sci. 176,10–21 (2020)

    Google Scholar 

  7. Castellano, B.: PySceneDetect 2014–2022. https://github.com/Breakthrough/PySceneDetect

  8. Chen, S., Nie, X., Fan, D., Zhang, D., Bhat, V., Hamid, R.: Shot contrastive self-supervised learning for scene boundary detection (2021)

    Google Scholar 

  9. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning (2020)

    Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

  11. Shun ichi Amari: Backpropagation and stochastic gradient descent method. Neurocomputing 5(4), 185–196 (1993)

    Article  Google Scholar 

  12. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J., Bottou, L., Weinberger, K.Q., editors, Advances in Neural Information Processing Systems, vol. 25. Curran Associates Inc (2012)

    Google Scholar 

  13. Maćkiewicz, A., Ratajczak, W.: Principal components analysis (PCA). Comput. Geosci. 19(3), 303–342 (1993)

    Article  Google Scholar 

  14. Mun, J., et al.: Boundary-aware self-supervised learning for video scene segmentation (2022)

    Google Scholar 

  15. OpenCV. Image Thresholding (2023). https://docs.opencv.org/4.x/d7/d4d/tutorial_py_thresholding.html

  16. OpenCV. Sobel Derivatives (2023). https://docs.opencv.org/3.4/d2/d2c/tutorial_sobel_derivatives.html

  17. Rao, A., et al.: A local-to-global approach to multi-modal movie scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10146–10155 (2020)

    Google Scholar 

  18. Ruby, U., Yendapalli, V.: Binary cross entropy with deep learning technique for image classification. Int. J. Adv. Trends Comput. Sci. Eng. 9, 10 (2020)

    Google Scholar 

  19. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  20. Sivaraman, K., Somappa, G.: MovieScope: movie trailer classification using deep neural networks (2017)

    Google Scholar 

  21. Statista. Digital media - video on demand worldwide (2023). https://www.statista.com/outlook/dmo/digital-media/video-on-demand/worldwide#revenue

  22. Tapu, R., Mocanu, B., Zaharia, T.: DEEP-AD: a multimodal temporal video segmentation framework for online video advertising. IEEE Access 8, 99582–99597 (2020)

    Article  Google Scholar 

  23. Rotten Tomatoes. @movieclips, 2006–2023. https://www.youtube.com/@MOVIECLIPS

  24. Vendrig, J., Worring, M.: Systematic evaluation of logical story unit segmentation. Multimedia, IEEE Trans. 4, 492–499 (2003)

    Google Scholar 

  25. Wang, J., et al.: Learning fine-grained image similarity with deep ranking. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393 (2014)

    Google Scholar 

  26. Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, vol. 2, pp. 1398–1402 (2003)

    Google Scholar 

  27. Wu, H., et al.: Scene consistency representation learning for video scene segmentation (2022)

    Google Scholar 

  28. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition (2018)

    Google Scholar 

Download references

Acknowledgment

We would first like to thanks Telefonica I+D for supporting the Industrial Phd of Miguel Esteve Brotons. We would like to thank ‘A way of making Europe” European Regional Development Fund (ERDF) and MCIN/AEI/10.13039/501100011033 for supporting this work under the TED2021-130890B (CHAN-TWIN) research Project funded by MCIN/AEI /10.13039/501100011033 and European Union NextGenerationEU/ PRTR, and AICARE project (grant SPID202200X139779IV0). Also the HORIZON-MSCA-2021-SE-0 action number: 101086387, REMARKABLE, Rural Environmental Monitoring via ultra wide-ARea networKs And distriButed federated Learning. Finally, we also would like to thank Nvidia for their generous hardware donations that made these experiments possible.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José García-Rodríguez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Esteve Brotons, M.J., Carmona Blanco, J., Lucendo, F.J., García-Rodríguez, J. (2023). Video Scene Segmentation Based on Triplet Loss Ranking. In: Rojas, I., Joya, G., Catala, A. (eds) Advances in Computational Intelligence. IWANN 2023. Lecture Notes in Computer Science, vol 14134. Springer, Cham. https://doi.org/10.1007/978-3-031-43085-5_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43085-5_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43084-8

  • Online ISBN: 978-3-031-43085-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics