research-article

Integrating Multiple Models For Effective Video Retrieval and Multi-stage Search

Authors:

Tuong Bui Cong Khanh,

Khoa Tran Nhat,

Kien Luu Trung,

Thuyen Tran Doan,

Khiem Le Tran Trong,

Thanh Duc NgoAuthors Info & Claims

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

Pages 1003 - 1010

https://doi.org/10.1145/3628797.3629022

Published: 07 December 2023 Publication History

Abstract

Video is one of the most prevalent forms of data due to the widespread availability of recording devices. This makes video retrieval systems essential since they assist in locating a video segment within a dataset that most closely matches a given query. One of the difficulties of video querying is the processing of multimedia data (images, audio, and text). In addition, it is important to integrate temporal information, as inquiries frequently pertain to the depiction of events occurring within a specific time frame. Thus, this study introduces an innovative system capable of not only integrating various types of models but also effectively managing temporal searches through multi-stage processes. The efficacy of the systems was demonstrated at the AI Challenge 2023, which took place in Ho Chi Minh City, where our team got the best accuracy across all other contestants in the qualifying phase and received the top 1 position among the 60 participating teams.

References

[1]

Darwin Bautista and Rowel Atienza. 2022. Scene Text Recognition with Permuted Autoregressive Sequence Models. In European Conference on Computer Vision. Springer Nature Switzerland, Cham, 178–196. https://doi.org/10.1007/978-3-031-19815-1_11

Digital Library

[2]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv abs/2010.11929 (2020). arXiv:2010.11929

[3]

Trong-Le Do et al.2023. News Event Retrieval from Large Video Collection in Ho Chi Minh City AI Challenge 2023. In The 12th International Symposium on Information and Communication Technology, SoICT 2023, Ho Chi Minh City, Vietnam, December 7-8, 2023. ACM.

[4]

Silvan Heller, Viktor Gsteiger, Werner Bailer, Cathal Gurrin, Björn Þór Jónsson, Jakub Lokoč, Andreas Leibetseder, František Mejzlík, Ladislav Peška, Luca Rossetto, 2022. Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th Video Browser Showdown. International Journal of Multimedia Information Retrieval 11, 1 (2022), 1–18.

[5]

Nico Hezel, Konstantin Schall, Klaus Jung, and Kai Uwe Barthel. 2022. Efficient Search and Browsing of Large-Scale Video Collections with Vibro. In Conference on Multimedia Modeling. https://doi.org/10.1007/978-3-030-98355-0_43

Digital Library

[6]

Maria Tysse Hordvik, Julie Sophie Teilstad Østby, Manoj Kesavulu, Thao-Nhu Nguyen, Tu-Khiem Le, and Duc-Tien Dang-Nguyen. 2023. LifeLens: Transforming Lifelog Search with Innovative UX/UI Design. Proceedings of the 6th Annual ACM Lifelog Search Challenge (2023). https://api.semanticscholar.org/CorpusID:259025397

Digital Library

[7]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs., 535–547 pages. arXiv:1702.08734

[8]

Teuvo Kohonen. 1990. The self-organizing map. Proc. IEEE 78, 9 (1990), 1464–1480.

[9]

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2015. Microsoft COCO: Common Objects in Context. arxiv:1405.0312 [cs.CV]

[10]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows., 9992-10002 pages. arXiv:2103.14030

[11]

Jakub Lokoč, Patrik Veselỳ, František Mejzlík, Gregor Kovalčík, Tomáš Souček, Luca Rossetto, Klaus Schoeffmann, Werner Bailer, Cathal Gurrin, Loris Sauter, 2021. Is the reign of interactive search eternal? Findings from the video browser showdown 2020. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 3 (2021), 1–26.

Digital Library

[12]

Thao-Nhu Nguyen, Bunyarit Puangthamawathanakun, A. Caputo, Graham Healy, Binh T. Nguyen, Chonlameth Arpnikanondt, and Cathal Gurrin. 2023. VideoCLIP: An Interactive CLIP-based Video Retrieval System at VBS2023. In Conference on Multimedia Modeling. https://api.semanticscholar.org/CorpusID:257858040

Digital Library

[13]

Thao-Nhu Nguyen, Bunyarit Puangthamawathanakun, Graham Healy, Binh T. Nguyen, Cathal Gurrin, and Annalina Caputo. 2022. Videofall - A Hierarchical Search Engine for VBS2022. In MultiMedia Modeling. Springer International Publishing, 518–523. https://doi.org/10.1007/978-3-030-98355-0_48

Digital Library

[14]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.

[15]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.

[16]

T. Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. 2021. ImageNet-21K Pretraining for the Masses. ArXiv abs/2104.10972. arXiv:2104.10972

[17]

Luca Rossetto, Ralph Gasser, Jakub Lokoč, Werner Bailer, Klaus Schoeffmann, Bernd Muenzer, Tomáš Souček, Phuong Anh Nguyen, Paolo Bolettieri, Andreas Leibetseder, 2020. Interactive video retrieval in the age of deep learning–detailed evaluation of VBS 2019. IEEE Transactions on Multimedia 23 (2020), 243–256.

[18]

Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. 2019. Objects365: A Large-Scale, High-Quality Dataset for Object Detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 8429–8438. https://doi.org/10.1109/ICCV.2019.00852

[19]

Minh-Triet Tran, Nhat Hoang-Xuan, Hoang-Phuc Trang-Trung, Thanh-Cong Le, Mai-Khiem Tran, Minh-Quan Le, Tu-Khiem Le, Van-Tu Ninh, and Cathal Gurrin. 2022. V-FIRST: A Flexible Interactive Retrieval System for Video at VBS 2022. In Conference on Multimedia Modeling. https://api.semanticscholar.org/CorpusID:247600556

Digital Library

[20]

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. 2022. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks. arxiv:2208.10442 [cs.CV]

[21]

T. Weyand, A. Araujo, B. Cao, and J. Sim. 2020. Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval. In Proc. CVPR.

[22]

Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo Du, and Dacheng Tao. 2023. DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19348–19357.

[23]

Zhuofan Zong, Guanglu Song, and Yu Liu. 2023. Detrs with collaborative hybrid assignments training. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6748–6758.

Index Terms

Integrating Multiple Models For Effective Video Retrieval and Multi-stage Search
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interactive systems and tools
2. Information systems
  1. Information retrieval
    1. Users and interactive retrieval
      1. Search interfaces

Recommendations

Multimodal Video Retrieval with the 2017 IMOTION System
ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

The IMOTION system is a multimodal content-based video search and browsing application offering a rich set of query modes on the basis of a broad range of different features. It is able to scale with the size of the collection due to its underlying ...
Multimedia retrieval by means of merge of results from textual and content based retrieval subsystems
CLEF'09: Proceedings of the 10th international conference on Cross-language evaluation forum: multimedia experiments

The main goal of this paper it is to present our experiments in ImageCLEF 2009 Campaign (photo retrieval task). In 2008 we proved empirically that the Text-based Image Retrieval (TBIR) methods defeats the Content-based Image Retrieval CBIR "quality" of ...
Improving Image Retrieval Effectiveness via Multiple Queries

Conventional approaches to image retrieval are based on the assumption that relevant images are physically near the query image in some feature space. This is the basis of the cluster hypothesis. However, semantically related images are often scattered ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

December 2023

1058 pages

ISBN:9798400708916

DOI:10.1145/3628797

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SOICT 2023

SOICT 2023: The 12th International Symposium on Information and Communication Technology

December 7 - 8, 2023

Ho Chi Minh, Vietnam

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
60
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten