skip to main content
10.1145/3628797.3629011acmotherconferencesArticle/Chapter ViewAbstractPublication PagessoictConference Proceedingsconference-collections
research-article

Enhancing Video Retrieval with Robust CLIP-Based Multimodal System

Published:07 December 2023Publication History

ABSTRACT

In the rapidly evolving landscape of multimedia data, the need for efficient content-based video retrieval has become increasingly vital. To tackle this challenge, we introduce an interactive video retrieval system designed to retrieve data from vast online video collections efficiently. Our solution encompasses rich textual to visual descriptions, advanced human detection capabilities, and a novel Sketch-Text retrieval mechanism, rendering the search process comprehensive and precise. At its core, the system leverages the Contrastive Language-Image Pretraining (CLIP) model, renowned for its proficiency in bridging the gap between visual and textual data. Our user-friendly web application allows users to create queries, explore top results, find similar images, preview short video clips, and select and export pertinent data, enhancing the effectiveness and accessibility of content-based video retrieval.

References

  1. Charles Adjetey and Kofi Sarpong Adu-Manu. 2021. Content-based image retrieval using Tesseract OCR engine and levenshtein algorithm. International Journal of Advanced Computer Science and Applications 12, 7 (2021).Google ScholarGoogle ScholarCross RefCross Ref
  2. Ayan Kumar Bhunia, Yongxin Yang, Timothy M Hospedales, Tao Xiang, and Yi-Zhe Song. 2020. Sketch less for more: On-the-fly fine-grained sketch-based image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9779–9788.Google ScholarGoogle ScholarCross RefCross Ref
  3. Mariona Carós, Maite Garolera, Petia Radeva, and Xavier Giro-i Nieto. 2020. Automatic reminiscence therapy for dementia. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 383–387.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.Google ScholarGoogle ScholarCross RefCross Ref
  5. Peter Kitzing, Andreas Maier, and Viveka Lyberg Åhlander. 2009. Automatic speech recognition (ASR) and its use as a tool for assessment or therapy of voice, speech, and language disorders. Logopedics Phoniatrics Vocology 34, 2 (2009), 91–96.Google ScholarGoogle ScholarCross RefCross Ref
  6. Maksim Kuprashevich and Irina Tolstykh. 2023. MiVOLO: Multi-input Transformer for Age and Gender Estimation. (2023). arXiv:arXiv:2307.04616Google ScholarGoogle Scholar
  7. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML.Google ScholarGoogle Scholar
  8. Danyang Liu, Ji Xu, Pengyuan Zhang, and Yonghong Yan. 2019. Investigation of knowledge transfer approaches to improve the acoustic modeling of Vietnamese ASR system. IEEE/CAA Journal of Automatica Sinica 6, 5 (2019), 1187–1195.Google ScholarGoogle ScholarCross RefCross Ref
  9. Li Liu, Fumin Shen, Yuming Shen, Xianglong Liu, and Ling Shao. 2017. Deep sketch hashing: Fast free-hand sketch-based image retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2862–2871.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jakub Lokoč, Zuzana Vopálková, Patrik Dokoupil, and Ladislav Peška. 2023. Video Search with CLIP and Interactive Text Query Reformulation. In International Conference on Multimedia Modeling. Springer, 628–633.Google ScholarGoogle Scholar
  11. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems. 13–23.Google ScholarGoogle Scholar
  12. Ravina Mithe, Supriya Indalkar, and Nilam Divekar. 2013. Optical character recognition. International journal of recent technology and engineering (IJRTE) 2, 1 (2013), 72–75.Google ScholarGoogle Scholar
  13. Yonggang Qi, Yi-Zhe Song, Honggang Zhang, and Jun Liu. 2016. Sketch-based image retrieval via siamese convolutional neural network. In 2016 IEEE international conference on image processing (ICIP). IEEE, 2460–2464.Google ScholarGoogle ScholarCross RefCross Ref
  14. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.Google ScholarGoogle Scholar
  15. Patsorn Sangkloy, Wittawat Jitkrittum, Diyi Yang, and James Hays. 2022. A sketch is worth a thousand words: Image retrieval with text and sketch. In European Conference on Computer Vision. Springer, 251–267.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle ScholarCross RefCross Ref
  17. Chih-Fong Tsai. 2012. Bag-of-words representation in image annotation: A review. International Scholarly Research Notices 2012 (2012).Google ScholarGoogle ScholarCross RefCross Ref
  18. Keiji Yanai and Yoshiyuki Kawano. 2015. Food image recognition using deep convolutional network with pre-training and fine-tuning. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 1–6.Google ScholarGoogle ScholarCross RefCross Ref
  19. Sasi Kiran Yelamarthi, Shiva Krishna Reddy, Ashish Mishra, and Anurag Mittal. 2018. A zero-shot framework for sketch based image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 300–317.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks?Advances in neural information processing systems 27 (2014).Google ScholarGoogle Scholar
  21. Dong Yu, Li Deng, and George Dahl. 2010. Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition. In Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning. sn.Google ScholarGoogle Scholar

Index Terms

  1. Enhancing Video Retrieval with Robust CLIP-Based Multimodal System

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology
      December 2023
      1058 pages
      ISBN:9798400708916
      DOI:10.1145/3628797

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 December 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate147of318submissions,46%
    • Article Metrics

      • Downloads (Last 12 months)45
      • Downloads (Last 6 weeks)11

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format