skip to main content
10.1145/3628797.3629021acmotherconferencesArticle/Chapter ViewAbstractPublication PagessoictConference Proceedingsconference-collections
research-article

Diverse Search Methods and Multi-Modal Fusion for High-Performance Video Retrieval

Published: 07 December 2023 Publication History

Abstract

Querying events within extensive video datasets currently stands as a prominent research focus within the field of multimedia information retrieval. Achieving high-performance retrieval within such contexts necessitates the efficient extraction and effective storage of information from videos to expedite the retrieval process. These challenges become notably pronounced when handling substantial datasets. In this paper, we introduce a system tailored for event querying within video data. Our system is meticulously crafted to optimize information retrieval speed and to efficiently organize storage, harnessing the power of FAISS and ElasticSearch. It boasts the capability to process diverse forms of input information, including textual video descriptions, Optical Character Recognition (OCR) results, Automatic Speech Recognition (ASR) transcriptions, visually similar images, and details about objects within videos, encompassing aspects such as color and quantity. Moreover, our system can also extract information about the temporal sequence of events within videos—a particularly challenging task when extracting information from video frames. By amalgamating these various input types, our system delivers optimal results.

References

[1]
Ahmad Sedky Adly, Islam Hegazy, Taha Elarif, and MS Abdelwahab. 2021. Indexed Dataset from YouTube for a Content-Based Video Search Engine. International Journal of Intelligent Computing and Information Sciences 21, 1 (2021), 196–215.
[2]
Ahmed Alateeq, Mark Roantree, and Cathal Gurrin. 2021. Voxento 2.0: a prototype voice-controlled interactive search engine for lifelogs. In Proceedings of the 4th Annual on Lifelog Search Challenge. 65–70.
[3]
Wei-Hong Ang, An-Zi Yen, Tai-Te Chu, Hen-Hsen Huang, and Hsin-Hsi Chen. 2021. LifeConcept: An interactive approach for multimodal lifelog retrieval through concept recommendation. In Proceedings of the 4th Annual on Lifelog Search Challenge. 47–51.
[4]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. (2017), 5803–5812.
[5]
Darwin Bautista and Rowel Atienza. 2022. Scene text recognition with permuted autoregressive sequence models. In European Conference on Computer Vision. Springer, 178–196.
[6]
David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. (2011), 190–200.
[7]
Duc-Tien Dang-Nguyen, Luca Piras, Michael Riegler, Liting Zhou, Mathias Lux, and Cathal Gurrin. 2018. Overview of ImageCLEFlifelog 2018: daily living understanding and lifelog moment retrieval. (2018).
[8]
Duc Tien Dang Nguyen, Luca Piras, Michael Riegler, Liting Zhou, Mathias Lux, Minh Triet Tran, Tu-Khiem Le, Van-Tu Ninh, and Cathal Gurrin. 2019. Overview of ImageCLEFlifelog 2019: solve my life puzzle and lifelog moment retrieval. CEUR Workshop Proceedings.
[9]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[10]
Maksim Dzabraev, Maksim Kalashnikov, Stepan Komkov, and Aleksandr Petiushko. 2021. Mdmmt: Multidomain multimodal transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3354–3363.
[11]
Trong-Le Do et al.2023. News Event Retrieval from Large Video Collection in Ho Chi Minh City AI Challenge 2023. In The 12th International Symposium on Information and Communication Technology, SoICT 2023, Ho Chi Minh City, Vietnam, December 7-8, 2023. ACM.
[12]
Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021).
[13]
Cathal Gurrin, Hideo Joho, Frank Hopfgartner, Liting Zhou, and Rami Albatal. 2016. Ntcir lifelog: The first test collection for lifelog research. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 705–708.
[14]
Cathal Gurrin, Hideo Joho, Frank Hopfgartner, Liting Zhou, V-T Ninh, T-K Le, Rami Albatal, D-T Dang-Nguyen, and Graham Healy. 2019. Overview of the NTCIR-14 lifelog-3 task. In Proceedings of the 14th NTCIR conference. NII, 14–26.
[15]
Cathal Gurrin, Björn Þór Jónsson, Klaus Schöffmann, Duc-Tien Dang-Nguyen, Jakub Lokoč, Minh-Triet Tran, Wolfgang Hürst, Luca Rossetto, and Graham Healy. 2021. Introduction to the fourth annual lifelog search challenge, LSC’21. In Proceedings of the 2021 International Conference on Multimedia Retrieval. 690–691.
[16]
Cathal Gurrin, Tu-Khiem Le, Van-Tu Ninh, Duc-Tien Dang-Nguyen, Björn Þór Jónsson, Jakub Lokoč, Wolfgang Hürst, Minh-Triet Tran, and Klaus Schoeffmann. 2020. Introduction to the third annual lifelog search challenge (LSC’20). In Proceedings of the 2020 International Conference on Multimedia Retrieval. 584–585.
[17]
Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 (2019).
[18]
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293–304.
[19]
Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516 (2018).
[20]
Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. 19–27.
[21]
Thao-Nhu Nguyen, Tu-Khiem Le, Van-Tu Ninh, Minh-Triet Tran, Nguyen Thanh Binh, Graham Healy, Annalina Caputo, and Cathal Gurrin. 2021. LifeSeeker 3.0: An Interactive Lifelog Search Engine for LSC’21. In Proceedings of the 4th annual on lifelog search challenge. 41–46.
[22]
Jesús Andrés Portillo-Quintero, José Carlos Ortiz-Bayliss, and Hugo Terashima-Marín. 2021. A straightforward framework for video retrieval using clip. In Mexican Conference on Pattern Recognition. Springer, 3–12.
[23]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
[24]
Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3202–3212.
[25]
Luca Rossetto, Matthias Baumgartner, Ralph Gasser, Lucien Heitz, Ruijie Wang, and Abraham Bernstein. 2021. Exploring Graph-querying approaches in LifeGraph. In Proceedings of the 4th Annual on Lifelog Search Challenge. 7–10.
[26]
Florian Spiess and Heiko Schuldt. 2022. Multimodal Interactive Lifelog Retrieval with Vitrivr-VR. In Proceedings of the 5th Annual on Lifelog Search Challenge (LSC ’22). Association for Computing Machinery, New York, NY, USA, 38–42.
[27]
Ly-Duyen Tran, Manh-Duy Nguyen, Binh Nguyen, Hyowon Lee, Liting Zhou, and Cathal Gurrin. 2022. E-Myscéal: Embedding-based Interactive Lifelog Retrieval System for LSC’22. In Proceedings of the 5th Annual on Lifelog Search Challenge. 32–37.
[28]
Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. 2023. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7464–7475.
[29]
Zhuofan Zong, Guanglu Song, and Yu Liu. 2023. Detrs with collaborative hybrid assignments training. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6748–6758.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology
December 2023
1058 pages
ISBN:9798400708916
DOI:10.1145/3628797
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. interactive retrieval system
  2. multimedia information retrieval
  3. video event retrieval

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SOICT 2023

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 56
    Total Downloads
  • Downloads (Last 12 months)37
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media