research-article

Multi-User Video Search: Bridging the Gap Between Text and Embedding Queries

Authors:

Khai Trinh Xuan,

Nguyen Nguyen Khoi,

Huy Luong-Quang,

Anh Nguyen-Luong-Nam,

Hong-Phuc NguyenAuthors Info & Claims

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

Pages 923 - 930

https://doi.org/10.1145/3628797.3628957

Published: 07 December 2023 Publication History

Abstract

Video search is a crucial task in the modern era, as the rapid growth of video platforms has led to an exponential increase in the number of videos on the internet. Effective video management is therefore essential. Significant research has been conducted on video search, with most approaches leveraging image-text retrieval or searching by object, speech, color, and text in images. However, these approaches can be inefficient when multiple users search for the same query simultaneously, as they may overlap in their search spaces. Additionally, most video search systems do not support complex queries that require information from multiple frames in a video. In this paper, we propose a solution to these problems by splitting the search space for different users and ignoring images that have already been considered by other users to avoid redundant searches. To address complex queries, we split the query and apply a technique called forward and backward search.

References

[1]

Giuseppe Amato, Paolo Bolettieri, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro, Nicola Messina, Lucia Vadicamo, and Claudio Vairo. 2023. VISIONE: A Large-Scale Video Retrieval System with Advanced Search Functionalities. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval (Thessaloniki, Greece) (ICMR ’23). Association for Computing Machinery, New York, NY, USA, 649–653. https://doi.org/10.1145/3591106.3592226

Digital Library

[2]

Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. 2019. Character Region Awareness for Text Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9365–9374.

[3]

Hervé Bredin and Antoine Laurent. 2021. End-to-end speaker segmentation for overlap-aware resegmentation. In Proc. Interspeech 2021. Brno, Czech Republic.

[4]

Fabio Carrara, Lucia Vadicamo, Claudio Gennaro, and Giuseppe Amato. 2022. Approximate Nearest Neighbor Search on Standard Search Engines. (2022), 214–221. https://doi.org/10.1007/978-3-031-17849-8_17

Digital Library

[5]

Cathal Gurrin, Björn Þór Jónsson, Klaus Schöffmann, Duc-Tien Dang-Nguyen, Jakub Lokoč, Minh-Triet Tran, Wolfgang Hürst, Luca Rossetto, and Graham Healy. 2021. Introduction to the Fourth Annual Lifelog Search Challenge, LSC’21. In Proceedings of the 2021 International Conference on Multimedia Retrieval (Taipei, Taiwan) (ICMR ’21). Association for Computing Machinery, New York, NY, USA, 690–691. https://doi.org/10.1145/3460426.3470945

Digital Library

[6]

Cathal Gurrin, Tu-Khiem Le, Van-Tu Ninh, Duc-Tien Dang-Nguyen, Björn Þór Jónsson, Jakub Lokoč, Wolfgang Hürst, Minh-Triet Tran, and Klaus Schöffmann. 2020. Introduction to the Third Annual Lifelog Search Challenge (LSC’20). In Proceedings of the 2020 International Conference on Multimedia Retrieval (Dublin, Ireland) (ICMR ’20). Association for Computing Machinery, New York, NY, USA, 584–585. https://doi.org/10.1145/3372278.3388043

Digital Library

[7]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. CoRR abs/1703.06870 (2017). arXiv:1703.06870http://arxiv.org/abs/1703.06870

[8]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385http://arxiv.org/abs/1512.03385

[9]

Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR abs/1508.01991 (2015). arXiv:1508.01991http://arxiv.org/abs/1508.01991

[10]

Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec, NanoCode012, Yonghye Kwon, TaoXie, Kalen Michael, Jiacong Fang, Imyhxy, Lorna, Colin Wong, Zeng Yifu, Abhiram V, Diego Montes, Zhiqiang Wang, Cristi Fati, Jebastin Nadar, Laughing, UnglvKitDe, Tkianai, YxNONG, Piotr Skalski, Adam Hogan, Max Strobel, Mrinal Jain, Lorenzo Mammana, and Xylieong. 2022. ultralytics/yolov5: v6.2 - YOLOv5 Classification Models, Apple M1, Reproducibility, ClearML and Deci.ai integrations. (2022). https://doi.org/10.5281/zenodo.7002879

[11]

Glenn Jocher, Ayush Chaurasia, and Jing Qiu. 2023. YOLO by Ultralytics. https://github.com/ultralytics/ultralytics

[12]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.

[13]

Miroslav Kratochvíl, Patrik Veselý, František Mejzlík, and Jakub Lokoč. 2020. SOM-Hunter: Video Browsing with Relevance-to-SOM Feedback Loop. In MultiMedia Modeling, Yong Man Ro, Wen-Huang Cheng, Junmo Kim, Wei-Ta Chu, Peng Cui, Jung-Woo Choi, Min-Chun Hu, and Wesley De Neve (Eds.). Springer International Publishing, Cham, 790–795.

[14]

Jakub Lokoč, František Mejzlík, Tomáš Souček, Patrik Dokoupil, and Ladislav Peška. 2022. Video Search with Context-Aware Ranker and Relevance Feedback. In MultiMedia Modeling, Björn Þór Jónsson, Cathal Gurrin, Minh-Triet Tran, Duc-Tien Dang-Nguyen, Anita Min-Chun Hu, Binh Huynh Thi Thanh, and Benoit Huet (Eds.). Springer International Publishing, Cham, 505–510.

[15]

Jakub Lokoč, Zuzana Vopálková, Patrik Dokoupil, and Ladislav Peška. 2023. Video Search with CLIP and Interactive Text Query Reformulation. (2023), 628–633. https://doi.org/10.1007/978-3-031-27077-2_50

Digital Library

[16]

Sebastian Lubos, Massimiliano Rubino, Christian Tautschnig, Markus Tautschnig, Boda Wen, Klaus Schoeffmann, and Alexander Felfernig. 2023. Perfect Match in Video Retrieval. In MultiMedia Modeling - 29th International Conference, MMM 2023, Bergen, Norway, January 9-12, 2023, Proceedings, Part I(Lecture Notes in Computer Science, Vol. 13833), Duc-Tien Dang-Nguyen, Cathal Gurrin, Martha A. Larson, Alan F. Smeaton, Stevan Rudinac, Minh-Son Dao, Christoph Trattner, and Phoebe Chen (Eds.). Springer, 634–639. https://doi.org/10.1007/978-3-031-27077-2_51

Digital Library

[17]

Nicola Messina, Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Giuseppe Amato, and Rita Cucchiara. 2022. ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval. In International Conference on Content-based Multimedia Indexing. 64–70.

[18]

Thai Binh Nguyen. 2021. Vietnamese end-to-end speech recognition using wav2vec 2.0. https://doi.org/10.5281/zenodo.5356039

[19]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. CoRR abs/2103.00020 (2021). arXiv:2103.00020https://arxiv.org/abs/2103.00020

[20]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. http://arxiv.org/abs/1908.10084

[21]

Stephen Robertson. 2004. Understanding Inverse Document Frequency: On Theoretical Arguments for IDF. Journal of Documentation - J DOC 60 (10 2004), 503–520. https://doi.org/10.1108/00220410410560582

[22]

Luca Rossetto, Mahnaz Amiri Parian, Ralph Gasser, Ivan Giangreco, Silvan Heller, and Heiko Schuldt. 2019. Deep Learning-Based Concept Detection in vitrivr. In MultiMedia Modeling, Ioannis Kompatsiaris, Benoit Huet, Vasileios Mezaris, Cathal Gurrin, Wen-Huang Cheng, and Stefanos Vrochidis (Eds.). Springer International Publishing, Cham, 616–621.

[23]

Konstantin Schall, Nico Hezel, Klaus Jung, and Kai Uwe Barthel. 2023. Vibro: Video Browsing with Semantic and Visual Image Embeddings. In MultiMedia Modeling, Duc-Tien Dang-Nguyen, Cathal Gurrin, Martha Larson, Alan F. Smeaton, Stevan Rudinac, Minh-Son Dao, Christoph Trattner, and Phoebe Chen (Eds.). Springer International Publishing, Cham, 665–670.

[24]

Weixi Song, Jiangshan He, Xinghan Li, Shiwei Feng, and Chao Liang. 2023. QIVISE: A Quantum-Inspired Interactive Video Search Engine in VBS2023. In MultiMedia Modeling, Duc-Tien Dang-Nguyen, Cathal Gurrin, Martha Larson, Alan F. Smeaton, Stevan Rudinac, Minh-Son Dao, Christoph Trattner, and Phoebe Chen (Eds.). Springer International Publishing, Cham, 640–645.

[25]

Tomáš Souček and Jakub Lokoč. 2020. TransNet V2: An effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838 (2020).

[26]

Haoyang Zhang, Ying Wang, Feras Dayoub, and Niko Sünderhauf. 2021. VarifocalNet: An IoU-aware Dense Object Detector. In CVPR.

[27]

Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, 2023. Recognize Anything: A Strong Image Tagging Model. arXiv preprint arXiv:2306.03514 (2023).

Index Terms

Multi-User Video Search: Bridging the Gap Between Text and Embedding Queries
1. Information systems
  1. Information retrieval
    1. Users and interactive retrieval
      1. Collaborative search
      2. Search interfaces

Recommendations

Establishing the utility of non-text search for news video retrieval with real world users
MM '07: Proceedings of the 15th ACM international conference on Multimedia

TRECVID participants have enjoyed consistent success using storyboard interfaces for shot-based retrieval, as measured by TRECVID interactive search mean average precision (MAP). However, much is lost by only looking at MAP, and especially by neglecting ...
Video Search with CLIP and Interactive Text Query Reformulation
MultiMedia Modeling
Abstract
Nowadays, deep learning based models like CLIP allow simple design of cross-modal video search systems that are able to solve many tasks considered as highly challenging several years ago. In this paper, we analyze a CLIP based search approach ...
Video Search with Collage Queries
MultiMedia Modeling
Abstract
Nowadays, popular web search portals enable users to find available images corresponding to a provided free-form text description. With such sources of example images, a suitable composition/collage of images can be constructed as an appropriate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

December 2023

1058 pages

ISBN:9798400708916

DOI:10.1145/3628797

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SOICT 2023

SOICT 2023: The 12th International Symposium on Information and Communication Technology

December 7 - 8, 2023

Ho Chi Minh, Vietnam

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
59
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten