skip to main content
10.1145/3628797.3629022acmotherconferencesArticle/Chapter ViewAbstractPublication PagessoictConference Proceedingsconference-collections
research-article

Integrating Multiple Models For Effective Video Retrieval and Multi-stage Search

Published: 07 December 2023 Publication History

Abstract

Video is one of the most prevalent forms of data due to the widespread availability of recording devices. This makes video retrieval systems essential since they assist in locating a video segment within a dataset that most closely matches a given query. One of the difficulties of video querying is the processing of multimedia data (images, audio, and text). In addition, it is important to integrate temporal information, as inquiries frequently pertain to the depiction of events occurring within a specific time frame. Thus, this study introduces an innovative system capable of not only integrating various types of models but also effectively managing temporal searches through multi-stage processes. The efficacy of the systems was demonstrated at the AI Challenge 2023, which took place in Ho Chi Minh City, where our team got the best accuracy across all other contestants in the qualifying phase and received the top 1 position among the 60 participating teams.

References

[1]
Darwin Bautista and Rowel Atienza. 2022. Scene Text Recognition with Permuted Autoregressive Sequence Models. In European Conference on Computer Vision. Springer Nature Switzerland, Cham, 178–196. https://doi.org/10.1007/978-3-031-19815-1_11
[2]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv abs/2010.11929 (2020). arXiv:2010.11929
[3]
Trong-Le Do et al.2023. News Event Retrieval from Large Video Collection in Ho Chi Minh City AI Challenge 2023. In The 12th International Symposium on Information and Communication Technology, SoICT 2023, Ho Chi Minh City, Vietnam, December 7-8, 2023. ACM.
[4]
Silvan Heller, Viktor Gsteiger, Werner Bailer, Cathal Gurrin, Björn Þór Jónsson, Jakub Lokoč, Andreas Leibetseder, František Mejzlík, Ladislav Peška, Luca Rossetto, 2022. Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th Video Browser Showdown. International Journal of Multimedia Information Retrieval 11, 1 (2022), 1–18.
[5]
Nico Hezel, Konstantin Schall, Klaus Jung, and Kai Uwe Barthel. 2022. Efficient Search and Browsing of Large-Scale Video Collections with Vibro. In Conference on Multimedia Modeling. https://doi.org/10.1007/978-3-030-98355-0_43
[6]
Maria Tysse Hordvik, Julie Sophie Teilstad Østby, Manoj Kesavulu, Thao-Nhu Nguyen, Tu-Khiem Le, and Duc-Tien Dang-Nguyen. 2023. LifeLens: Transforming Lifelog Search with Innovative UX/UI Design. Proceedings of the 6th Annual ACM Lifelog Search Challenge (2023). https://api.semanticscholar.org/CorpusID:259025397
[7]
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs., 535–547 pages. arXiv:1702.08734
[8]
Teuvo Kohonen. 1990. The self-organizing map. Proc. IEEE 78, 9 (1990), 1464–1480.
[9]
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2015. Microsoft COCO: Common Objects in Context. arxiv:1405.0312 [cs.CV]
[10]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows., 9992-10002 pages. arXiv:2103.14030
[11]
Jakub Lokoč, Patrik Veselỳ, František Mejzlík, Gregor Kovalčík, Tomáš Souček, Luca Rossetto, Klaus Schoeffmann, Werner Bailer, Cathal Gurrin, Loris Sauter, 2021. Is the reign of interactive search eternal? Findings from the video browser showdown 2020. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 3 (2021), 1–26.
[12]
Thao-Nhu Nguyen, Bunyarit Puangthamawathanakun, A. Caputo, Graham Healy, Binh T. Nguyen, Chonlameth Arpnikanondt, and Cathal Gurrin. 2023. VideoCLIP: An Interactive CLIP-based Video Retrieval System at VBS2023. In Conference on Multimedia Modeling. https://api.semanticscholar.org/CorpusID:257858040
[13]
Thao-Nhu Nguyen, Bunyarit Puangthamawathanakun, Graham Healy, Binh T. Nguyen, Cathal Gurrin, and Annalina Caputo. 2022. Videofall - A Hierarchical Search Engine for VBS2022. In MultiMedia Modeling. Springer International Publishing, 518–523. https://doi.org/10.1007/978-3-030-98355-0_48
[14]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
[15]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.
[16]
T. Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. 2021. ImageNet-21K Pretraining for the Masses. ArXiv abs/2104.10972. arXiv:2104.10972
[17]
Luca Rossetto, Ralph Gasser, Jakub Lokoč, Werner Bailer, Klaus Schoeffmann, Bernd Muenzer, Tomáš Souček, Phuong Anh Nguyen, Paolo Bolettieri, Andreas Leibetseder, 2020. Interactive video retrieval in the age of deep learning–detailed evaluation of VBS 2019. IEEE Transactions on Multimedia 23 (2020), 243–256.
[18]
Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. 2019. Objects365: A Large-Scale, High-Quality Dataset for Object Detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 8429–8438. https://doi.org/10.1109/ICCV.2019.00852
[19]
Minh-Triet Tran, Nhat Hoang-Xuan, Hoang-Phuc Trang-Trung, Thanh-Cong Le, Mai-Khiem Tran, Minh-Quan Le, Tu-Khiem Le, Van-Tu Ninh, and Cathal Gurrin. 2022. V-FIRST: A Flexible Interactive Retrieval System for Video at VBS 2022. In Conference on Multimedia Modeling. https://api.semanticscholar.org/CorpusID:247600556
[20]
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. 2022. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks. arxiv:2208.10442 [cs.CV]
[21]
T. Weyand, A. Araujo, B. Cao, and J. Sim. 2020. Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval. In Proc. CVPR.
[22]
Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo Du, and Dacheng Tao. 2023. DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19348–19357.
[23]
Zhuofan Zong, Guanglu Song, and Yu Liu. 2023. Detrs with collaborative hybrid assignments training. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6748–6758.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology
December 2023
1058 pages
ISBN:9798400708916
DOI:10.1145/3628797
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Content-Based Image Retrieval
  2. Lifelog Event Retrieval
  3. Video Retrieval

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SOICT 2023

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 60
    Total Downloads
  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media