ABSTRACT
In the rapidly evolving landscape of multimedia data, the need for efficient content-based video retrieval has become increasingly vital. To tackle this challenge, we introduce an interactive video retrieval system designed to retrieve data from vast online video collections efficiently. Our solution encompasses rich textual to visual descriptions, advanced human detection capabilities, and a novel Sketch-Text retrieval mechanism, rendering the search process comprehensive and precise. At its core, the system leverages the Contrastive Language-Image Pretraining (CLIP) model, renowned for its proficiency in bridging the gap between visual and textual data. Our user-friendly web application allows users to create queries, explore top results, find similar images, preview short video clips, and select and export pertinent data, enhancing the effectiveness and accessibility of content-based video retrieval.
- Charles Adjetey and Kofi Sarpong Adu-Manu. 2021. Content-based image retrieval using Tesseract OCR engine and levenshtein algorithm. International Journal of Advanced Computer Science and Applications 12, 7 (2021).Google ScholarCross Ref
- Ayan Kumar Bhunia, Yongxin Yang, Timothy M Hospedales, Tao Xiang, and Yi-Zhe Song. 2020. Sketch less for more: On-the-fly fine-grained sketch-based image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9779–9788.Google ScholarCross Ref
- Mariona Carós, Maite Garolera, Petia Radeva, and Xavier Giro-i Nieto. 2020. Automatic reminiscence therapy for dementia. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 383–387.Google ScholarDigital Library
- Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.Google ScholarCross Ref
- Peter Kitzing, Andreas Maier, and Viveka Lyberg Åhlander. 2009. Automatic speech recognition (ASR) and its use as a tool for assessment or therapy of voice, speech, and language disorders. Logopedics Phoniatrics Vocology 34, 2 (2009), 91–96.Google ScholarCross Ref
- Maksim Kuprashevich and Irina Tolstykh. 2023. MiVOLO: Multi-input Transformer for Age and Gender Estimation. (2023). arXiv:arXiv:2307.04616Google Scholar
- Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML.Google Scholar
- Danyang Liu, Ji Xu, Pengyuan Zhang, and Yonghong Yan. 2019. Investigation of knowledge transfer approaches to improve the acoustic modeling of Vietnamese ASR system. IEEE/CAA Journal of Automatica Sinica 6, 5 (2019), 1187–1195.Google ScholarCross Ref
- Li Liu, Fumin Shen, Yuming Shen, Xianglong Liu, and Ling Shao. 2017. Deep sketch hashing: Fast free-hand sketch-based image retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2862–2871.Google ScholarCross Ref
- Jakub Lokoč, Zuzana Vopálková, Patrik Dokoupil, and Ladislav Peška. 2023. Video Search with CLIP and Interactive Text Query Reformulation. In International Conference on Multimedia Modeling. Springer, 628–633.Google Scholar
- Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems. 13–23.Google Scholar
- Ravina Mithe, Supriya Indalkar, and Nilam Divekar. 2013. Optical character recognition. International journal of recent technology and engineering (IJRTE) 2, 1 (2013), 72–75.Google Scholar
- Yonggang Qi, Yi-Zhe Song, Honggang Zhang, and Jun Liu. 2016. Sketch-based image retrieval via siamese convolutional neural network. In 2016 IEEE international conference on image processing (ICIP). IEEE, 2460–2464.Google ScholarCross Ref
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.Google Scholar
- Patsorn Sangkloy, Wittawat Jitkrittum, Diyi Yang, and James Hays. 2022. A sketch is worth a thousand words: Image retrieval with text and sketch. In European Conference on Computer Vision. Springer, 251–267.Google ScholarDigital Library
- Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.Google ScholarCross Ref
- Chih-Fong Tsai. 2012. Bag-of-words representation in image annotation: A review. International Scholarly Research Notices 2012 (2012).Google ScholarCross Ref
- Keiji Yanai and Yoshiyuki Kawano. 2015. Food image recognition using deep convolutional network with pre-training and fine-tuning. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 1–6.Google ScholarCross Ref
- Sasi Kiran Yelamarthi, Shiva Krishna Reddy, Ashish Mishra, and Anurag Mittal. 2018. A zero-shot framework for sketch based image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 300–317.Google ScholarDigital Library
- Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks?Advances in neural information processing systems 27 (2014).Google Scholar
- Dong Yu, Li Deng, and George Dahl. 2010. Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition. In Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning. sn.Google Scholar
Index Terms
- Enhancing Video Retrieval with Robust CLIP-Based Multimodal System
Recommendations
Multimodal retrieval with relevance feedback based on genetic programming
This paper presents a framework for multimodal retrieval with relevance feedback based on genetic programming. In this supervised learning-to-rank framework, genetic programming is used for the discovery of effective combination functions of (multimodal)...
VideoCLIP 2.0: An Interactive CLIP-Based Video Retrieval System for Novice Users at VBS2024
MultiMedia ModelingAbstractIn this paper, we present an interactive video retrieval system named VideoCLIP 2.0 developed for the Video Browser Showdown in 2024. Building upon the foundation of the previous year’s system, VideoCLIP, this upgraded version incorporates ...
Multimodal Medical Case-Based Retrieval on the Radiology Image and Report: SNUMedinfo at VISCERAL Retrieval Benchmark
Revised Selected Papers from the First International Workshop on Multimodal Retrieval in the Medical Domain - Volume 9059This paper describes the participation at the VISCERAL Retrieval benchmark. The task is about retrieving relevant medical cases from radiology image and report. Both query and retrieval datasets are composed of multimodal data. We extracted low-level ...
Comments