Skip to main content

Unsupervised Learning of Visual Representations via Rotation and Future Frame Prediction for Video Retrieval

  • Conference paper
  • First Online:
Advances in Computing and Data Sciences (ICACDS 2021)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1440))

Included in the following conference series:

Abstract

Due to rapid technological advancements, the growth of videos uploaded to the internet has increased exponentially. Most of these videos are free of semantic tags, which makes indexing and retrieval a challenging task, and requires much-needed effective content-based analysis techniques to deal with. On the other hand, supervised representation learning from large-scale labeled dataset demonstrated great success in the image domain. However, creating such a large scale labeled database for videos is expensive and time consuming. To this end, we propose an unsupervised visual representation learning framework, which aims to learn spatiotemporal features by exploiting two pretext tasks i.e. rotation prediction and future frame prediction. The performance of the learned features is analyzed by the nearest neighbor task (video retrieval). For this, we choose the UCF-101 dataset to experiment with. The experimental results shows the competitive performance achieve by our method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Jiang, Y.-G., Ngo, C.-W., Yang, J.: Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval - CIVR 2007. ACM Press (2007)

    Google Scholar 

  2. Asha, S., Sreeraj, M.: Content based video retrieval using SURF descriptor. In: 2013 Third International Conference on Advances in Computing and Communications. IEEE (2013)

    Google Scholar 

  3. Zhu, Y., Huang, X., Huang, Q., Tian, Q.: Large-scale video copy retrieval with temporal-concentration SIFT. Neurocomputing 187, 83–91 (2016)

    Article  Google Scholar 

  4. Brindha, N., Visalakshi, P.: Bridging semantic gap between high-level and low-level features in content-based video retrieval using multi-stage ESN–SVM classifier. Sādhanā 42(1), 1–10 (2016). https://doi.org/10.1007/s12046-016-0574-8

    Article  MathSciNet  MATH  Google Scholar 

  5. Ram, R.S., Prakash, S.A., Balaanand, M., Sivaparthipan, C.B.: Colour and orientation of pixel based video retrieval using IHBM similarity measure. Multimedia Tools Appl. 79(15–16), 10199–10214 (2019). https://doi.org/10.1007/s11042-019-07805-9

    Article  Google Scholar 

  6. Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 584–599. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_38

    Chapter  Google Scholar 

  7. Kumar, V., Tripathi, V., Pant, B.: Content based fine-grained image retrieval using convolutional neural network. In: 2020 7th International Conference on Signal Processing and Integrated Networks (SPIN). IEEE (2020)

    Google Scholar 

  8. Lou, Y., et al.: Compact deep invariant descriptors for video retrieval. In: 2017 Data Compression Conference (DCC). IEEE (2017)

    Google Scholar 

  9. Podlesnaya, A., Podlesnyy, S.: Deep learning based semantic video indexing and retrieval. In: Bi, Y., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2016. LNNS, vol. 16, pp. 359–372. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56991-8_27

  10. Kumar, V., Tripathi, V., Pant, B.: Content based movie scene retrieval using spatio-temporal features. IJEAT 9, 1492–1496 (2019)

    Google Scholar 

  11. Kumar, V., Tripathi, V., Pant, B.: Learning compact spatio-temporal features for fast content based video retrieval. IJITEE 9, 2404–2409 (2019)

    Article  Google Scholar 

  12. Mühling, M., et al.: Deep learning for content-based video retrieval in film and television production. Multimedia Tools Appl. 76, 22169–22194 (2017)

    Google Scholar 

  13. Mühling, M., et al.: Content-based video retrieval in historical collections of the German broadcasting archive. Int. J. Digit. Libr. 20(2), 167–183 (2018). https://doi.org/10.1007/s00799-018-0236-z

    Article  Google Scholar 

  14. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. ACM Press (2014)

    Google Scholar 

  15. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE (2014)

    Google Scholar 

  16. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE (2015)

    Google Scholar 

  17. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE (2018)

    Google Scholar 

  18. Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32

    Chapter  Google Scholar 

  19. Lee, H.-Y., Huang, J.-B., Singh, M., Yang, M.-H.: Unsupervised representation learning by sorting sequences. In: 2017 IEEE International Conference on Computer Vision (ICCV). IEEE (2017)

    Google Scholar 

  20. Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)

    Google Scholar 

  21. Büchler, U., Brattoli, B., Ommer, B.: Improving spatiotemporal self-supervision by deep reinforcement learning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 797–814. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_47

    Chapter  Google Scholar 

  22. Jing, L., Yang, X., Liu, J., Tian, Y.: Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387 (2018)

  23. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (2018)

    Google Scholar 

  24. Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019)

    Google Scholar 

  25. Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)

    Google Scholar 

  26. Benaim, S., et al.: SpeedNet: learning the speediness in videos. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2020)

    Google Scholar 

  27. Soomro, K., Zamir, AR., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  28. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving Jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kumar, V., Tripathi, V., Pant, B. (2021). Unsupervised Learning of Visual Representations via Rotation and Future Frame Prediction for Video Retrieval. In: Singh, M., Tyagi, V., Gupta, P.K., Flusser, J., Ören, T., Sonawane, V.R. (eds) Advances in Computing and Data Sciences. ICACDS 2021. Communications in Computer and Information Science, vol 1440. Springer, Cham. https://doi.org/10.1007/978-3-030-81462-5_61

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-81462-5_61

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-81461-8

  • Online ISBN: 978-3-030-81462-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics