research-article

Enhancing Video Retrieval with Robust CLIP-Based Multimodal System

Authors:
Minh-Dung Le-Quynh

Lazada Vietnam, Viet Nam

Lazada Vietnam, Viet Nam

0009-0004-0972-4109
View Profile

,
Anh-Tuan Nguyen

University of Science, Viet Nam

University of Science, Viet Nam

0009-0004-8382-1206
View Profile

,
Anh-Tuan Quang-Hoang

Ford Motor, United States

Ford Motor, United States

0009-0006-0209-9288
View Profile

,
Van-Huy Dinh

HUTECH University, Viet Nam

HUTECH University, Viet Nam

0009-0004-1374-5236
View Profile

,
Tien-Huy Nguyen

University of Information Technology, Viet Nam

University of Information Technology, Viet Nam

0009-0000-0196-6083
View Profile

,
Hoang-Bach Ngo

University of Science, Viet Nam

University of Science, Viet Nam

0009-0002-2290-1187
View Profile

,
Minh-Hung An

FPT Telecom, Vietnam

FPT Telecom, Vietnam

0009-0001-0394-4731
View Profile

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication TechnologyDecember 2023Pages 972–979https://doi.org/10.1145/3628797.3629011

Published:07 December 2023Publication History

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

Pages 972–979

ABSTRACT

In the rapidly evolving landscape of multimedia data, the need for efficient content-based video retrieval has become increasingly vital. To tackle this challenge, we introduce an interactive video retrieval system designed to retrieve data from vast online video collections efficiently. Our solution encompasses rich textual to visual descriptions, advanced human detection capabilities, and a novel Sketch-Text retrieval mechanism, rendering the search process comprehensive and precise. At its core, the system leverages the Contrastive Language-Image Pretraining (CLIP) model, renowned for its proficiency in bridging the gap between visual and textual data. Our user-friendly web application allows users to create queries, explore top results, find similar images, preview short video clips, and select and export pertinent data, enhancing the effectiveness and accessibility of content-based video retrieval.

References

Charles Adjetey and Kofi Sarpong Adu-Manu. 2021. Content-based image retrieval using Tesseract OCR engine and levenshtein algorithm. International Journal of Advanced Computer Science and Applications 12, 7 (2021).Google ScholarCross Ref
Ayan Kumar Bhunia, Yongxin Yang, Timothy M Hospedales, Tao Xiang, and Yi-Zhe Song. 2020. Sketch less for more: On-the-fly fine-grained sketch-based image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9779–9788.Google ScholarCross Ref
Mariona Carós, Maite Garolera, Petia Radeva, and Xavier Giro-i Nieto. 2020. Automatic reminiscence therapy for dementia. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 383–387.Google ScholarDigital Library
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.Google ScholarCross Ref
Peter Kitzing, Andreas Maier, and Viveka Lyberg Åhlander. 2009. Automatic speech recognition (ASR) and its use as a tool for assessment or therapy of voice, speech, and language disorders. Logopedics Phoniatrics Vocology 34, 2 (2009), 91–96.Google ScholarCross Ref
Maksim Kuprashevich and Irina Tolstykh. 2023. MiVOLO: Multi-input Transformer for Age and Gender Estimation. (2023). arXiv:arXiv:2307.04616Google Scholar
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML.Google Scholar
Danyang Liu, Ji Xu, Pengyuan Zhang, and Yonghong Yan. 2019. Investigation of knowledge transfer approaches to improve the acoustic modeling of Vietnamese ASR system. IEEE/CAA Journal of Automatica Sinica 6, 5 (2019), 1187–1195.Google ScholarCross Ref
Li Liu, Fumin Shen, Yuming Shen, Xianglong Liu, and Ling Shao. 2017. Deep sketch hashing: Fast free-hand sketch-based image retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2862–2871.Google ScholarCross Ref
Jakub Lokoč, Zuzana Vopálková, Patrik Dokoupil, and Ladislav Peška. 2023. Video Search with CLIP and Interactive Text Query Reformulation. In International Conference on Multimedia Modeling. Springer, 628–633.Google Scholar
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems. 13–23.Google Scholar
Ravina Mithe, Supriya Indalkar, and Nilam Divekar. 2013. Optical character recognition. International journal of recent technology and engineering (IJRTE) 2, 1 (2013), 72–75.Google Scholar
Yonggang Qi, Yi-Zhe Song, Honggang Zhang, and Jun Liu. 2016. Sketch-based image retrieval via siamese convolutional neural network. In 2016 IEEE international conference on image processing (ICIP). IEEE, 2460–2464.Google ScholarCross Ref
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.Google Scholar
Patsorn Sangkloy, Wittawat Jitkrittum, Diyi Yang, and James Hays. 2022. A sketch is worth a thousand words: Image retrieval with text and sketch. In European Conference on Computer Vision. Springer, 251–267.Google ScholarDigital Library
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.Google ScholarCross Ref
Chih-Fong Tsai. 2012. Bag-of-words representation in image annotation: A review. International Scholarly Research Notices 2012 (2012).Google ScholarCross Ref
Keiji Yanai and Yoshiyuki Kawano. 2015. Food image recognition using deep convolutional network with pre-training and fine-tuning. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 1–6.Google ScholarCross Ref
Sasi Kiran Yelamarthi, Shiva Krishna Reddy, Ashish Mishra, and Anurag Mittal. 2018. A zero-shot framework for sketch based image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 300–317.Google ScholarDigital Library
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks?Advances in neural information processing systems 27 (2014).Google Scholar
Dong Yu, Li Deng, and George Dahl. 2010. Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition. In Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning. sn.Google Scholar

Index Terms

Enhancing Video Retrieval with Robust CLIP-Based Multimodal System
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Information retrieval diversity

Recommendations

Multimodal retrieval with relevance feedback based on genetic programming

This paper presents a framework for multimodal retrieval with relevance feedback based on genetic programming. In this supervised learning-to-rank framework, genetic programming is used for the discovery of effective combination functions of (multimodal)...
Read More
VideoCLIP 2.0: An Interactive CLIP-Based Video Retrieval System for Novice Users at VBS2024
MultiMedia Modeling
Abstract
In this paper, we present an interactive video retrieval system named VideoCLIP 2.0 developed for the Video Browser Showdown in 2024. Building upon the foundation of the previous year’s system, VideoCLIP, this upgraded version incorporates ...
Read More
Multimodal Medical Case-Based Retrieval on the Radiology Image and Report: SNUMedinfo at VISCERAL Retrieval Benchmark
Revised Selected Papers from the First International Workshop on Multimodal Retrieval in the Medical Domain - Volume 9059

This paper describes the participation at the VISCERAL Retrieval benchmark. The task is about retrieving relevant medical cases from radiology image and report. Both query and retrieval datasets are composed of multimodal data. We extracted low-level ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology
December 2023
1058 pages
ISBN:9798400708916
DOI:10.1145/3628797

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 December 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
interactive video retrieval
multimodal retrieval
sketch-based image retrieval
text-based image retrieval
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate147of318submissions,46%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 45
  Total Downloads
- Downloads (Last 12 months)45
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Enhancing Video Retrieval with Robust CLIP-Based Multimodal System

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multimodal retrieval with relevance feedback based on genetic programming

VideoCLIP 2.0: An Interactive CLIP-Based Video Retrieval System for Novice Users at VBS2024

Multimodal Medical Case-Based Retrieval on the Radiology Image and Report: SNUMedinfo at VISCERAL Retrieval Benchmark

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Enhancing Video Retrieval with Robust CLIP-Based Multimodal System

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multimodal retrieval with relevance feedback based on genetic programming

VideoCLIP 2.0: An Interactive CLIP-Based Video Retrieval System for Novice Users at VBS2024

Multimodal Medical Case-Based Retrieval on the Radiology Image and Report: SNUMedinfo at VISCERAL Retrieval Benchmark

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media