Abstract
Due to rapid technological advancements, the growth of videos uploaded to the internet has increased exponentially. Most of these videos are free of semantic tags, which makes indexing and retrieval a challenging task, and requires much-needed effective content-based analysis techniques to deal with. On the other hand, supervised representation learning from large-scale labeled dataset demonstrated great success in the image domain. However, creating such a large scale labeled database for videos is expensive and time consuming. To this end, we propose an unsupervised visual representation learning framework, which aims to learn spatiotemporal features by exploiting two pretext tasks i.e. rotation prediction and future frame prediction. The performance of the learned features is analyzed by the nearest neighbor task (video retrieval). For this, we choose the UCF-101 dataset to experiment with. The experimental results shows the competitive performance achieve by our method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jiang, Y.-G., Ngo, C.-W., Yang, J.: Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval - CIVR 2007. ACM Press (2007)
Asha, S., Sreeraj, M.: Content based video retrieval using SURF descriptor. In: 2013 Third International Conference on Advances in Computing and Communications. IEEE (2013)
Zhu, Y., Huang, X., Huang, Q., Tian, Q.: Large-scale video copy retrieval with temporal-concentration SIFT. Neurocomputing 187, 83–91 (2016)
Brindha, N., Visalakshi, P.: Bridging semantic gap between high-level and low-level features in content-based video retrieval using multi-stage ESN–SVM classifier. Sādhanā 42(1), 1–10 (2016). https://doi.org/10.1007/s12046-016-0574-8
Ram, R.S., Prakash, S.A., Balaanand, M., Sivaparthipan, C.B.: Colour and orientation of pixel based video retrieval using IHBM similarity measure. Multimedia Tools Appl. 79(15–16), 10199–10214 (2019). https://doi.org/10.1007/s11042-019-07805-9
Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 584–599. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_38
Kumar, V., Tripathi, V., Pant, B.: Content based fine-grained image retrieval using convolutional neural network. In: 2020 7th International Conference on Signal Processing and Integrated Networks (SPIN). IEEE (2020)
Lou, Y., et al.: Compact deep invariant descriptors for video retrieval. In: 2017 Data Compression Conference (DCC). IEEE (2017)
Podlesnaya, A., Podlesnyy, S.: Deep learning based semantic video indexing and retrieval. In: Bi, Y., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2016. LNNS, vol. 16, pp. 359–372. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56991-8_27
Kumar, V., Tripathi, V., Pant, B.: Content based movie scene retrieval using spatio-temporal features. IJEAT 9, 1492–1496 (2019)
Kumar, V., Tripathi, V., Pant, B.: Learning compact spatio-temporal features for fast content based video retrieval. IJITEE 9, 2404–2409 (2019)
Mühling, M., et al.: Deep learning for content-based video retrieval in film and television production. Multimedia Tools Appl. 76, 22169–22194 (2017)
Mühling, M., et al.: Content-based video retrieval in historical collections of the German broadcasting archive. Int. J. Digit. Libr. 20(2), 167–183 (2018). https://doi.org/10.1007/s00799-018-0236-z
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. ACM Press (2014)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE (2014)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE (2015)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE (2018)
Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32
Lee, H.-Y., Huang, J.-B., Singh, M., Yang, M.-H.: Unsupervised representation learning by sorting sequences. In: 2017 IEEE International Conference on Computer Vision (ICCV). IEEE (2017)
Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Büchler, U., Brattoli, B., Ommer, B.: Improving spatiotemporal self-supervision by deep reinforcement learning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 797–814. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_47
Jing, L., Yang, X., Liu, J., Tian, Y.: Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387 (2018)
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (2018)
Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019)
Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)
Benaim, S., et al.: SpeedNet: learning the speediness in videos. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2020)
Soomro, K., Zamir, AR., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving Jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kumar, V., Tripathi, V., Pant, B. (2021). Unsupervised Learning of Visual Representations via Rotation and Future Frame Prediction for Video Retrieval. In: Singh, M., Tyagi, V., Gupta, P.K., Flusser, J., Ören, T., Sonawane, V.R. (eds) Advances in Computing and Data Sciences. ICACDS 2021. Communications in Computer and Information Science, vol 1440. Springer, Cham. https://doi.org/10.1007/978-3-030-81462-5_61
Download citation
DOI: https://doi.org/10.1007/978-3-030-81462-5_61
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81461-8
Online ISBN: 978-3-030-81462-5
eBook Packages: Computer ScienceComputer Science (R0)