skip to main content
10.1145/3628797.3628950acmotherconferencesArticle/Chapter ViewAbstractPublication PagessoictConference Proceedingsconference-collections
research-article

DoppelSearch: A Novel Approach to Content-Based Video Retrieval for AI Challenge HCMC 2023

Published: 07 December 2023 Publication History

Abstract

Video retrieval, which has been considered as a critical task in the field of computer vision and pattern recognition recently, finds extensive applications in several aspects such as education, entertainment, security, and healthcare. However, it faces challenges due to the complexity of video data, the instability in feature extraction methods, or the semantic disparities between videos and text. In this paper, we present a novel approach for content-based video retrieval, named DoppelSearch, leveraging the CLIP (Contrastive Language-Image Pre-training) model architecture to classify and label video segments, offering users the ability to search for videos based on specific content. Our method capitalizes on the ViT-b/32 model for feature extraction and employs feature embedding, in conjunction with the Faiss library, to enhance search efficiency. Experimental results demonstrate our model’s high accuracy and swift retrieval times, promising new opportunities in content-based video retrieval for researchers, developers, and end-users. This paper not only introduces the application of the CLIP and ViT-b/32 models, but also elaborates on the specific feature extraction process and the utilization of Faiss for optimizing video retrieval. The DoppelSearch method represents a significant stride in the field of video retrieval and holds promise for diverse applications across various industries.

References

[1]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650–9660.
[2]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[3]
Cathal Gurrin, Hideo Joho, Frank Hopfgartner, Liting Zhou, and Rami Albatal. 2016. Ntcir lifelog: The first test collection for lifelog research. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 705–708.
[4]
Cathal Gurrin, Björn Þór Jónsson, Duc Tien Dang Nguyen, Graham Healy, Jakub Lokoc, Liting Zhou, Luca Rossetto, Minh-Triet Tran, Wolfgang Hürst, Werner Bailer, 2023. Introduction to the Sixth Annual Lifelog Search Challenge, LSC’23. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. 678–679.
[5]
Cathal Gurrin, Björn Þór Jónsson, Klaus Schöffmann, Duc-Tien Dang-Nguyen, Jakub Lokoč, Minh-Triet Tran, Wolfgang Hürst, Luca Rossetto, and Graham Healy. 2021. Introduction to the fourth annual lifelog search challenge, LSC’21. In Proceedings of the 2021 International Conference on Multimedia Retrieval. 690–691.
[6]
Cathal Gurrin, Liting Zhou, Graham Healy, Björn Þór Jónsson, Duc-Tien Dang-Nguyen, Jakub Lokoć, Minh-Triet Tran, Wolfgang Hürst, Luca Rossetto, and Klaus Schöffmann. 2022. Introduction to the Fifth Annual Lifelog Search Challenge, LSC’22. In Proceedings of the 2022 International Conference on Multimedia Retrieval. 685–687.
[7]
Bogdan Ionescu, Henning Müller, Renaud Péteri, Asma Ben Abacha, Vivek Datla, Sadid A Hasan, Dina Demner-Fushman, Serge Kozlovski, Vitali Liauchuk, Yashin Dicente Cid, 2020. Overview of the ImageCLEF 2020: Multimedia retrieval in medical, lifelogging, nature, and internet applications. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22–25, 2020, Proceedings 11. Springer, 311–341.
[8]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
[9]
Brett Koonce and Brett Koonce. 2021. MobileNetV3. Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization (2021), 125–144.
[10]
Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7331–7341.
[11]
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200 (2020).
[12]
Jakub Lokoč, Werner Bailer, Klaus Schoeffmann, Bernd Münzer, and George Awad. 2018. On influential trends in interactive video retrieval: video browser showdown 2015–2017. IEEE Transactions on Multimedia 20, 12 (2018), 3361–3376.
[13]
Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9879–9889.
[14]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
[15]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision. 7464–7473.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology
December 2023
1058 pages
ISBN:9798400708916
DOI:10.1145/3628797
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CLIP model
  2. image-text retrieval
  3. in-video information search
  4. lifelog events
  5. search engine
  6. searching speed optimization
  7. user experience enhancement

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SOICT 2023

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 38
    Total Downloads
  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)3
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media