skip to main content
10.1145/3477495.3532010acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Learn from Unlabeled Videos for Near-duplicate Video Retrieval

Published: 07 July 2022 Publication History

Abstract

Near-duplicate video retrieval (NDVR) aims to find the copies or transformations of the query video from a massive video database. It plays an important role in many video related applications, including copyright protection, tracing, filtering and etc. Video representation and similarity search are crucial to any video retrieval system. To derive effective video representation, most video retrieval systems require a large amount of manually annotated data for training, making it costly inefficient. In addition, most retrieval systems are based on frame-level features for video similarity searching, making it expensive both storage wise and search wise. To address the above issues, we propose a video representation learning (VRL) approach to effectively address the above shortcomings. It first effectively learns video representation from unlabeled videos via contrastive learning to avoid the expensive cost of manual annotation. Then, it exploits transformer structure to aggregate frame-level features into clip-level to reduce both storage space and search complexity. It can learn the complementary and discriminative information from the interactions among clip frames, as well as acquire the frame permutation and missing invariant ability to support more flexible retrieval manners. Comprehensive experiments on two challenging near-duplicate video retrieval datasets, namely FIVR-200K and SVD, verify the effectiveness of our proposed VRL approach, which achieves the best performance of video retrieval on accuracy and efficiency.

References

[1]
Qing-Yuan Jiang, Yi He, Gen Li, Jian Lin, Lei Li, and Wu-Jun Li. Svd: A large-scale short video dataset for near-duplicate video retrieval. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 5281--5289, 2019.
[2]
Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Ioannis Kompatsiaris. Fivr: Fine-grained incident video retrieval. IEEE Transactions on Multimedia (TMM), 21(10):2638--2652, 2019.
[3]
Zhen Han, Xiangteng He, Mingqian Tang, and Yiliang Lv. Video similarity and alignment learning on partial video copy detection. In Proceedings of the 29th ACM International Conference on Multimedia (ACM MM), pages 4165--4173, 2021.
[4]
Xiangming Mu. Content-based video retrieval: Does video's semantic visual feature matter? In Proceedings of the 29th annual international ACM SIGIR conference on Research and Development in Information Retrieval (ACM SIGIR), pages 679--680, 2006.
[5]
Feng He, Qi Wang, Zhifan Feng, Wenbin Jiang, Yajuan Lü, Yong Zhu, and Xiao Tan. Improving video retrieval by adaptive margin. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM SIGIR), pages 1359--1368, 2021.
[6]
Peng Wu, Xiangteng He, Mingqian Tang, Yiliang Lv, and Jing Liu. Hanet: Hier- archical alignment networks for video-text retrieval. In Proceedings of the 29th ACM International Conference on Multimedia (ACM MM), pages 3518--3527, 2021.
[7]
Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Yiannis Kompatsiaris. Near-duplicate video retrieval with deep metric learning. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), pages 347--356, 2017.
[8]
Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Ioannis Kompatsiaris. Visil: Fine-grained spatio-temporal video similarity learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6351--6360, 2019.
[9]
Jie Shao, Xin Wen, Bingchen Zhao, and Xiangyang Xue. Temporal context aggregation for video retrieval with contrastive learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3268--3278, 2021.
[10]
Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Yiannis Kompatsiaris. Near-duplicate video retrieval by aggregating intermediate cnn layers. In International Conference on Multimedia Modeling (MMM), pages 251--263. Springer, 2017.
[11]
Chien-Li Chou, Hua-Tsung Chen, and Suh-Yin Lee. Pattern-based near-duplicate video retrieval and localization on web-scale videos. IEEE Transactions on Multi-media (TMM), 17(3):382--395, 2015.
[12]
Hung-Khoon Tan, Chong-Wah Ngo, Richard Hong, and Tat-Seng Chua. Scalable detection of partial near-duplicate videos by visual-temporal consistency. In Proceedings of the 17th ACM International Conference on Multimedia (ACM MM), pages 145--154, 2009.
[13]
Hao Liu, Qingjie Zhao, Hao Wang, Peng Lv, and Yanming Chen. An image-based near-duplicate video retrieval and localization using improved edit distance. Multimedia Tools and Applications (MTA), 76(22):24435--24456, 2017.
[14]
Yu-Gang Jiang and Jiajun Wang. Partial copy detection in videos: A benchmark and an evaluation of popular methods. IEEE Transactions on Big Data (TBD), 2(1):32--42, 2016.
[15]
Yaocong Hu and Xiaobo Lu. Learning spatial-temporal features for video copy detection by the combination of cnn and rnn. Journal of Visual Communication and Image Representation (JVCIR), 55:21--29, 2018.
[16]
Yu-Gang Jiang, Yudong Jiang, and Jiajun Wang. Vcdb: a large-scale database for partial copy detection in videos. In European Conference on Computer Vision (ECCV), pages 357--371. Springer, 2014.
[17]
Matthijs Douze, Hervé Jégou, and Cordelia Schmid. An image-based approach to video copy detection with spatio-temporal post-filtering. IEEE Transactions on Multimedia (TMM), 12(4):257--266, 2010.
[18]
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. Multi- ple feature hashing for real-time large scale near-duplicate video retrieval. In Proceedings of the 19th ACM International Conference on Multimedia (ACM MM), pages 423--432, 2011.
[19]
Yang Feng, Lin Ma, Wei Liu, Tong Zhang, and Jiebo Luo. Video re-localization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 51--66, 2018.
[20]
Lorenzo Baraldi, Matthijs Douze, Rita Cucchiara, and Hervé Jégou. Lamv: Learn- ing to align and match videos with kernelized temporal layers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7804--7813, 2018.
[21]
Jérôme Revaud, Matthijs Douze, Cordelia Schmid, and Hervé Jégou. Event retrieval in large video collections with circulant temporal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2459--2466, 2013.
[22]
Kaiyang Liao, Hao Lei, Yuanlin Zheng, Guangfeng Lin, Congjun Cao, Mingzhu Zhang, and Jie Ding. Ir feature embedded bof indexing method for near-duplicate video retrieval. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 29(12):3743--3753, 2018.
[23]
Yang Cai, Linjun Yang, Wei Ping, Fei Wang, Tao Mei, Xian-Sheng Hua, and Shipeng Li. Million-scale near-duplicate video retrieval system. In Proceedings of the 19th ACM International Conference on Multimedia (ACM MM), pages 837--838, 2011.
[24]
Zhanning Gao, Gang Hua, Dongqing Zhang, Nebojsa Jojic, Le Wang, Jianru Xue, and Nanning Zheng. Er3: A unified framework for event retrieval, recognition and recounting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2253--2262, 2017.
[25]
Xiao Wu, Alexander G Hauptmann, and Chong-Wah Ngo. Practical elimination of near-duplicates from web video search. In Proceedings of the 15th ACM International Conference on Multimedia (ACM MM), pages 218--227, 2007.
[26]
Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. Deep video hashing. IEEE Transactions on Multimedia (TMM), 19(6):1209--1219, 2016.
[27]
Shuyan Li, Zhixiang Chen, Jiwen Lu, Xiu Li, and Jie Zhou. Neighborhood preserv- ing hashing for scalable video retrieval. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 8212--8221, 2019.
[28]
Yanbin Hao, Tingting Mu, Richang Hong, Meng Wang, Ning An, and John Y Goulermas. Stochastic multiview hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia (TMM), 19(1):1--14, 2016.
[29]
Jingkuan Song, Hanwang Zhang, Xiangpeng Li, Lianli Gao, Meng Wang, and Richang Hong. Self-supervised video hashing with hierarchical binary auto- encoder. IEEE Transactions on Image Processing (TIP), 27(7):3210--3221, 2018.
[30]
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 843--852, 2017.
[31]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[32]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770--778, 2016.
[33]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ?ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pages 5998--6008, 2017.
[34]
Yujie Zhong, Relja Arandjelovic, and Andrew Zisserman. Compact deep aggregation for set retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0--0, 2018.
[35]
Weihao Kong and Wu-Jun Li. Isotropic hashing. In Advances in Neural Information Processing Systems (NeurIPS), pages 1646--1654, 2012.
[36]
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Jiebo Luo. Effective multiple feature hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia (TMM), 15(8):1997--2008, 2013.
[37]
Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5297--5307, 2016.

Cited By

View all
  • (2025)Extremely compact video representation for efficient near-duplicates detectionPattern Recognition10.1016/j.patcog.2024.111016158(111016)Online publication date: Feb-2025
  • (2025)Balancing Efficiency and Accuracy: An Analysis of Sampling for Video Copy DetectionMultiMedia Modeling10.1007/978-981-96-2054-8_9(111-124)Online publication date: 3-Jan-2025
  • (2024)Not All Pairs are Equal: Hierarchical Learning for Average-Precision-Oriented Video RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681110(3828-3837)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2022
3569 pages
ISBN:9781450387323
DOI:10.1145/3477495
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. near-duplicate video retrieval
  2. similarity search
  3. video representation learning

Qualifiers

  • Research-article

Funding Sources

Conference

SIGIR '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)60
  • Downloads (Last 6 weeks)12
Reflects downloads up to 30 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Extremely compact video representation for efficient near-duplicates detectionPattern Recognition10.1016/j.patcog.2024.111016158(111016)Online publication date: Feb-2025
  • (2025)Balancing Efficiency and Accuracy: An Analysis of Sampling for Video Copy DetectionMultiMedia Modeling10.1007/978-981-96-2054-8_9(111-124)Online publication date: 3-Jan-2025
  • (2024)Not All Pairs are Equal: Hierarchical Learning for Average-Precision-Oriented Video RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681110(3828-3837)Online publication date: 28-Oct-2024
  • (2024)A Survey on Self-Supervised Learning: Algorithms, Applications, and Future TrendsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341511246:12(9052-9071)Online publication date: Dec-2024
  • (2024)DRM-SN: Detecting Reused Multimedia Content on Social Networks2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)10.1109/MIPR62202.2024.00033(169-175)Online publication date: 7-Aug-2024
  • (2024)Differentiable Resolution Compression and Alignment for Efficient Video Classification and RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446442(3200-3204)Online publication date: 14-Apr-2024
  • (2024)The 2023 video similarity dataset and challengeComputer Vision and Image Understanding10.1016/j.cviu.2024.103997243:COnline publication date: 1-Jun-2024
  • (2024)Similarity-based ranking of videos from fixed-size one-dimensional video signatureDiscover Computing10.1007/s10791-024-09459-027:1Online publication date: 14-Aug-2024
  • (2024)RaSTFormer: region-aware spatiotemporal transformer for visual homogenization recognition in short videosNeural Computing and Applications10.1007/s00521-024-09633-x36:18(10713-10732)Online publication date: 27-Mar-2024
  • (2023)A Near-Duplicate Video Cleaning Method Based on AFENet Adaptive Clustering2023 8th International Conference on Intelligent Computing and Signal Processing (ICSP)10.1109/ICSP58490.2023.10248727(689-695)Online publication date: 21-Apr-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media