research-article

DoppelSearch: A Novel Approach to Content-Based Video Retrieval for AI Challenge HCMC 2023

Authors:

Phong Phan Nguyen Huu,

Khoa Tran Dinh,

Ngan Tran Kim Ngoc,

Luong Tran Duc,

Quyen Nguyen Huu,

Van-Hau PhamAuthors Info & Claims

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

Pages 916 - 922

https://doi.org/10.1145/3628797.3628950

Published: 07 December 2023 Publication History

Abstract

Video retrieval, which has been considered as a critical task in the field of computer vision and pattern recognition recently, finds extensive applications in several aspects such as education, entertainment, security, and healthcare. However, it faces challenges due to the complexity of video data, the instability in feature extraction methods, or the semantic disparities between videos and text. In this paper, we present a novel approach for content-based video retrieval, named DoppelSearch, leveraging the CLIP (Contrastive Language-Image Pre-training) model architecture to classify and label video segments, offering users the ability to search for videos based on specific content. Our method capitalizes on the ViT-b/32 model for feature extraction and employs feature embedding, in conjunction with the Faiss library, to enhance search efficiency. Experimental results demonstrate our model’s high accuracy and swift retrieval times, promising new opportunities in content-based video retrieval for researchers, developers, and end-users. This paper not only introduces the application of the CLIP and ViT-b/32 models, but also elaborates on the specific feature extraction process and the utilization of Faiss for optimizing video retrieval. The DoppelSearch method represents a significant stride in the field of video retrieval and holds promise for diverse applications across various industries.

References

[1]

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650–9660.

[2]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[3]

Cathal Gurrin, Hideo Joho, Frank Hopfgartner, Liting Zhou, and Rami Albatal. 2016. Ntcir lifelog: The first test collection for lifelog research. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 705–708.

Digital Library

[4]

Cathal Gurrin, Björn Þór Jónsson, Duc Tien Dang Nguyen, Graham Healy, Jakub Lokoc, Liting Zhou, Luca Rossetto, Minh-Triet Tran, Wolfgang Hürst, Werner Bailer, 2023. Introduction to the Sixth Annual Lifelog Search Challenge, LSC’23. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. 678–679.

Digital Library

[5]

Cathal Gurrin, Björn Þór Jónsson, Klaus Schöffmann, Duc-Tien Dang-Nguyen, Jakub Lokoč, Minh-Triet Tran, Wolfgang Hürst, Luca Rossetto, and Graham Healy. 2021. Introduction to the fourth annual lifelog search challenge, LSC’21. In Proceedings of the 2021 International Conference on Multimedia Retrieval. 690–691.

Digital Library

[6]

Cathal Gurrin, Liting Zhou, Graham Healy, Björn Þór Jónsson, Duc-Tien Dang-Nguyen, Jakub Lokoć, Minh-Triet Tran, Wolfgang Hürst, Luca Rossetto, and Klaus Schöffmann. 2022. Introduction to the Fifth Annual Lifelog Search Challenge, LSC’22. In Proceedings of the 2022 International Conference on Multimedia Retrieval. 685–687.

Digital Library

[7]

Bogdan Ionescu, Henning Müller, Renaud Péteri, Asma Ben Abacha, Vivek Datla, Sadid A Hasan, Dina Demner-Fushman, Serge Kozlovski, Vitali Liauchuk, Yashin Dicente Cid, 2020. Overview of the ImageCLEF 2020: Multimedia retrieval in medical, lifelogging, nature, and internet applications. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22–25, 2020, Proceedings 11. Springer, 311–341.

Digital Library

[8]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.

[9]

Brett Koonce and Brett Koonce. 2021. MobileNetV3. Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization (2021), 125–144.

[10]

Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7331–7341.

[11]

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. 2020. Hero: Hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200 (2020).

[12]

Jakub Lokoč, Werner Bailer, Klaus Schoeffmann, Bernd Münzer, and George Awad. 2018. On influential trends in interactive video retrieval: video browser showdown 2015–2017. IEEE Transactions on Multimedia 20, 12 (2018), 3361–3376.

Digital Library

[13]

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9879–9889.

[14]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.

[15]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision. 7464–7473.

Index Terms

DoppelSearch: A Novel Approach to Content-Based Video Retrieval for AI Challenge HCMC 2023
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interactive systems and tools
2. Information systems
  1. Information retrieval
    1. Users and interactive retrieval
      1. Search interfaces
  2. Information systems applications
    1. Multimedia information systems
      1. Multimedia databases

Recommendations

BlazeSearch: A multimomal semantic search engine for retrieving in-video information for AI Challenge HCMC 2023
SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

In the world today, exploring information has become a critical part of modern life. As a result, search engines have shown their ability to enhance the knowledge-seeking process. However, these search engines still focus on searching for websites or ...
Content-based video retrieval: does video's semantic visual feature matter?
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

A new shot level video browsing method based on semantic visual features (e.g., car, mountain, and fire) is proposed to facilitate content-based retrieval. The video's binary semantic feature vector is utilized to calculate the score of similarity ...
ItrievalKD: An Iterative Retrieval Framework Assisted with Knowledge Distillation for Noisy Text-to-Image Retrieval
Advances in Knowledge Discovery and Data Mining
Abstract
Benefiting from the superiority of the pretraining paradigm on large-scale multi-modal data, current cross-modal pretrained models (such as CLIP) have shown excellent performance on text-to-image retrieval. However, the current research mainly ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

December 2023

1058 pages

ISBN:9798400708916

DOI:10.1145/3628797

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SOICT 2023

SOICT 2023: The 12th International Symposium on Information and Communication Technology

December 7 - 8, 2023

Ho Chi Minh, Vietnam

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
38
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)3

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten