research-article

Efficient Video Retrieval with Advanced Deep Learning Models

Authors:

Tuấn Mạnh Hùng VõAuthors Info & Claims

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

Pages 945 - 952

https://doi.org/10.1145/3628797.3628995

Published: 07 December 2023 Publication History

Get Access

Abstract

Video retrieval is the process of finding specific video content in a large database. This is a crucial challenge in the age of digital multimedia. This article proposes a new approach to video retrieval using advanced deep learning models to extract features and perform retrieval tasks based on those features. Our method combines multiple feature extraction methods, including keyframe extraction, OpenAI CLIP [7] feature extraction, object detection, and automatic speech recognition (ASR). We use BERT [3] embeddings to encode these transcripts and store them in JSON and binary file formats. Our system achieves remarkable results in indexing and retrieving videos based on their visual, audio, textual, and contextual attributes. Our system can also retrieve videos based on either a single text description or multiple text descriptions of a sequence of events. We conducted extensive tests on diverse video data from Ho Chi Minh City AI Challenge 2023 competition organizers to validate the effectiveness of our approach. The results demonstrate that our proposed system is superior to other methods in terms of both retrieval accuracy and speed, making it highly suitable for real-time applications.

References

[1]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. arxiv:2002.05709 [cs.LG]

Google Scholar

[2]

Karan Desai and Justin Johnson. 2021. VirTex: Learning Visual Representations from Textual Annotations. arxiv:2006.06666 [cs.CV]

Google Scholar

[3]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805http://arxiv.org/abs/1810.04805

Google Scholar

[4]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. arxiv:1911.05722 [cs.CV]

Google Scholar

[5]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. CoRR abs/1702.08734 (2017). arXiv:1702.08734http://arxiv.org/abs/1702.08734

Google Scholar

[6]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. arxiv:1301.3781 [cs.CL]

Google Scholar

[7]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv:2103.00020 [cs.CV]

Google Scholar

[8]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.). Vol. 28. Curran Associates, Inc.https://proceedings.neurips.cc/paper_files/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf

Google Scholar

[9]

Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. 2020. Learning Visual Representations with Caption Annotations. arxiv:2008.01392 [cs.CV]

Google Scholar

[10]

Tomás Soucek, Jaroslav Moravec, and Jakub Lokoc. 2019. TransNet: A deep network for fast detection of common shot transitions. CoRR abs/1906.03363 (2019). arXiv:1906.03363http://arxiv.org/abs/1906.03363

Google Scholar

[11]

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, and Curtis P. Langlotz. 2022. Contrastive Learning of Medical Visual Representations from Paired Images and Text. arxiv:2010.00747 [cs.CV]

Google Scholar

Index Terms

Efficient Video Retrieval with Advanced Deep Learning Models
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval
2. Information systems
  1. Information retrieval

Recommendations

Enhancing Video Retrieval with Robust CLIP-Based Multimodal System
SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

In the rapidly evolving landscape of multimedia data, the need for efficient content-based video retrieval has become increasingly vital. To tackle this challenge, we introduce an interactive video retrieval system designed to retrieve data from vast ...
Towards Explainable Interactive Multi-modal Video Retrieval with Vitrivr
MultiMedia Modeling
Abstract
This paper presents the most recent iteration of the vitrivr multimedia retrieval system for its participation in the Video Browser Showdown (VBS) 2021. Building on existing functionality for interactive multi-modal retrieval, we overhaul query ...
Interactive Video Retrieval in the Age of Deep Learning
ICMR '19: Proceedings of the 2019 on International Conference on Multimedia Retrieval

We present a tutorial focusing on video retrieval tasks, where state-of-the-art deep learning approaches still benefit from interactive decisions of users. The tutorial covers general introduction to the interactive video retrieval research area, state-...

Comments

Information & Contributors

Information

Published In

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

December 2023

1058 pages

ISBN:9798400708916

DOI:10.1145/3628797

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SOICT 2023

SOICT 2023: The 12th International Symposium on Information and Communication Technology

December 7 - 8, 2023

Ho Chi Minh, Vietnam

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
70
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)2

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Index Terms

Recommendations

Enhancing Video Retrieval with Robust CLIP-Based Multimodal System

Towards Explainable Interactive Multi-modal Video Retrieval with Vitrivr

Interactive Video Retrieval in the Age of Deep Learning

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

HTML Format

Share

Share this Publication link

Share on social media

Affiliations