Unsupervised Learning of Visual Representations via Rotation and Future Frame Prediction for Video Retrieval

Kumar, Vidit; Tripathi, Vikas; Pant, Bhaskar

doi:10.1007/978-3-030-81462-5_61

Vidit Kumar¹¹,
Vikas Tripathi¹¹ &
Bhaskar Pant¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1440))

Included in the following conference series:

International Conference on Advances in Computing and Data Sciences

962 Accesses
10 Citations

Abstract

Due to rapid technological advancements, the growth of videos uploaded to the internet has increased exponentially. Most of these videos are free of semantic tags, which makes indexing and retrieval a challenging task, and requires much-needed effective content-based analysis techniques to deal with. On the other hand, supervised representation learning from large-scale labeled dataset demonstrated great success in the image domain. However, creating such a large scale labeled database for videos is expensive and time consuming. To this end, we propose an unsupervised visual representation learning framework, which aims to learn spatiotemporal features by exploiting two pretext tasks i.e. rotation prediction and future frame prediction. The performance of the learned features is analyzed by the nearest neighbor task (video retrieval). For this, we choose the UCF-101 dataset to experiment with. The experimental results shows the competitive performance achieve by our method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Jiang, Y.-G., Ngo, C.-W., Yang, J.: Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval - CIVR 2007. ACM Press (2007)
Google Scholar
Asha, S., Sreeraj, M.: Content based video retrieval using SURF descriptor. In: 2013 Third International Conference on Advances in Computing and Communications. IEEE (2013)
Google Scholar
Zhu, Y., Huang, X., Huang, Q., Tian, Q.: Large-scale video copy retrieval with temporal-concentration SIFT. Neurocomputing 187, 83–91 (2016)
Article Google Scholar
Brindha, N., Visalakshi, P.: Bridging semantic gap between high-level and low-level features in content-based video retrieval using multi-stage ESN–SVM classifier. Sādhanā 42(1), 1–10 (2016). https://doi.org/10.1007/s12046-016-0574-8
Article MathSciNet MATH Google Scholar
Ram, R.S., Prakash, S.A., Balaanand, M., Sivaparthipan, C.B.: Colour and orientation of pixel based video retrieval using IHBM similarity measure. Multimedia Tools Appl. 79(15–16), 10199–10214 (2019). https://doi.org/10.1007/s11042-019-07805-9
Article Google Scholar
Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 584–599. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_38
Chapter Google Scholar
Kumar, V., Tripathi, V., Pant, B.: Content based fine-grained image retrieval using convolutional neural network. In: 2020 7th International Conference on Signal Processing and Integrated Networks (SPIN). IEEE (2020)
Google Scholar
Lou, Y., et al.: Compact deep invariant descriptors for video retrieval. In: 2017 Data Compression Conference (DCC). IEEE (2017)
Google Scholar
Podlesnaya, A., Podlesnyy, S.: Deep learning based semantic video indexing and retrieval. In: Bi, Y., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2016. LNNS, vol. 16, pp. 359–372. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56991-8_27
Kumar, V., Tripathi, V., Pant, B.: Content based movie scene retrieval using spatio-temporal features. IJEAT 9, 1492–1496 (2019)
Google Scholar
Kumar, V., Tripathi, V., Pant, B.: Learning compact spatio-temporal features for fast content based video retrieval. IJITEE 9, 2404–2409 (2019)
Article Google Scholar
Mühling, M., et al.: Deep learning for content-based video retrieval in film and television production. Multimedia Tools Appl. 76, 22169–22194 (2017)
Google Scholar
Mühling, M., et al.: Content-based video retrieval in historical collections of the German broadcasting archive. Int. J. Digit. Libr. 20(2), 167–183 (2018). https://doi.org/10.1007/s00799-018-0236-z
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. ACM Press (2014)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE (2014)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE (2015)
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE (2018)
Google Scholar
Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32
Chapter Google Scholar
Lee, H.-Y., Huang, J.-B., Singh, M., Yang, M.-H.: Unsupervised representation learning by sorting sequences. In: 2017 IEEE International Conference on Computer Vision (ICCV). IEEE (2017)
Google Scholar
Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Google Scholar
Büchler, U., Brattoli, B., Ommer, B.: Improving spatiotemporal self-supervision by deep reinforcement learning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 797–814. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_47
Chapter Google Scholar
Jing, L., Yang, X., Liu, J., Tian, Y.: Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387 (2018)
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (2018)
Google Scholar
Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019)
Google Scholar
Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)
Google Scholar
Benaim, S., et al.: SpeedNet: learning the speediness in videos. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2020)
Google Scholar
Soomro, K., Zamir, AR., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving Jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Graphic Era Deemed to be University, Dehradun, India
Vidit Kumar, Vikas Tripathi & Bhaskar Pant

Authors

Vidit Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Vikas Tripathi
View author publications
You can also search for this author in PubMed Google Scholar
Bhaskar Pant
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of KwaZulu-Natal, Durban, South Africa
Mayank Singh
Jaypee University of Engineering and Technology, Guna, Madhya Pradesh, India
Vipin Tyagi
Jaypee University of Information Technology, Waknaghat, Himachal Pradesh, India
P. K. Gupta
Institute of Information Theory and Automation, Prague, Czech Republic
Jan Flusser
University of Ottawa, Ottawa, ON, Canada
Tuncer Ören
MVPS's Karmaveer Adv. Baburao Ganpatrao Thakare College of Engineering, Nashik, Maharashtra, India
V. R. Sonawane

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, V., Tripathi, V., Pant, B. (2021). Unsupervised Learning of Visual Representations via Rotation and Future Frame Prediction for Video Retrieval. In: Singh, M., Tyagi, V., Gupta, P.K., Flusser, J., Ören, T., Sonawane, V.R. (eds) Advances in Computing and Data Sciences. ICACDS 2021. Communications in Computer and Information Science, vol 1440. Springer, Cham. https://doi.org/10.1007/978-3-030-81462-5_61

Download citation

DOI: https://doi.org/10.1007/978-3-030-81462-5_61
Published: 23 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81461-8
Online ISBN: 978-3-030-81462-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics