Leveraging LLMs and Generative Models for Interactive Known-Item Video Search

Ma, Zhixin; Wu, Jiaxin; Ngo, Chong Wah

doi:10.1007/978-3-031-53302-0_35

Zhixin Ma¹⁴,
Jiaxin Wu^14,15 &
Chong Wah Ngo¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14557))

Included in the following conference series:

International Conference on Multimedia Modeling

1028 Accesses

Abstract

While embedding techniques such as CLIP have considerably boosted search performance, user strategies in interactive video search still largely operate on a trial-and-error basis. Users are often required to manually adjust their queries and carefully inspect the search results, which greatly rely on the users’ capability and proficiency. Recent advancements in large language models (LLMs) and generative models offer promising avenues for enhancing interactivity in video retrieval and reducing the personal bias in query interpretation, particularly in the known-item search. Specifically, LLMs can expand and diversify the semantics of the queries while avoiding grammar mistakes or the language barrier. In addition, generative models have the ability to imagine or visualize the verbose query as images. We integrate these new LLM capabilities into our existing system and evaluate their effectiveness on V3C1 and V3C2 datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS

Article 24 August 2023

Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data

ViewsInsight2.0: Enhancing Video Retrieval for VBS 2025 with an Automatic Query Generator Powered by Large Language Models

Notes

1.
VBS23-KIS-t3: Almost static shot of a brown-white caravan and a horse on a meadow. The caravan is in the center, the horse in the back to its right, and there is a large tree on the right. The camera is slightly shaky, and there is a forested hill in the background.

References

Berns, F., Rossetto, L., Schoeffmann, K., Beecks, C., Awad, G.: V3C1 dataset: an evaluation of content characteristics. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, ICMR 2019, pp. 334–338 (2019)
Google Scholar
Heller, S., et al.: Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th video browser showdown. Int. J. Multimedia Inf. Retr. 11, 1–18 (2022)
Article Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv arXiv:abs/2301.12597 (2023)
Li, J., Li, D., Xiong, C., Hoi, S.C.H.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (2022)
Google Scholar
Loko, J., et al.: Is the reign of interactive search eternal? Findings from the video browser showdown 2020. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 17, 1–26 (2021)
Article Google Scholar
Luo, H., et al.: CLIP4Clip: an empirical study of clip for end to end video clip retrieval. Neurocomputing 508, 293–304 (2021)
Article Google Scholar
Nguyen, P.A., Ngo, C.W.: Interactive search vs. automatic search. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 17, 1–24 (2021)
Google Scholar
OpenAI: GPT-4 technical report. CoRR abs/2303.08774 (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 10674–10685 (2022)
Google Scholar
Rossetto, L., Schoeffmann, K., Bernstein, A.: Insights on the V3C2 dataset. arXiv preprint arXiv:2105.01475 (2021)
Schall, K., Hezel, N., Jung, K., Barthel, K.U.: Vibro: video browsing with semantic and visual image embeddings. In: Dang-Nguyen, D.T., et al. (eds.) MultiMedia Modeling, MMM 2023. LNCS, vol. 13833, pp. 665–670. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_56
Touvron, H., et al.: LLaMA: open and efficient foundation language models. arXiv arXiv:2302.13971 (2023)
Wu, J., Ngo, C.W., Chan, W.K., Hou, Z.: (un)likelihood training for interpretable embedding. ACM Trans. Inf. Syst. 42, 1–26 (2023)
Google Scholar

Download references

Acknowledgments

This research was supported by the Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 1 grant.

Author information

Authors and Affiliations

School of Computing and Information Systems, Singapore Management University, Singapore, Singapore
Zhixin Ma, Jiaxin Wu & Chong Wah Ngo
Department of Computer Science, City University of Hong Kong, Hong Kong, China
Jiaxin Wu

Authors

Zhixin Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jiaxin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Chong Wah Ngo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhixin Ma .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, Z., Wu, J., Ngo, C.W. (2024). Leveraging LLMs and Generative Models for Interactive Known-Item Video Search. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_35

Download citation

DOI: https://doi.org/10.1007/978-3-031-53302-0_35
Published: 29 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53301-3
Online ISBN: 978-3-031-53302-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Leveraging LLMs and Generative Models for Interactive Known-Item Video Search