TalkSee: Interactive Video Retrieval Engine Using Large Language Model

Gu, Guihe; Wu, Zhengqian; He, Jiangshan; Song, Lin; Wang, Zhongyuan; Liang, Chao

doi:10.1007/978-3-031-53302-0_36

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14557))

Included in the following conference series:

International Conference on Multimedia Modeling

360 Accesses

Abstract

The current interactive retrieval system mostly relies on collecting user’s positive and negative feedback and updating the retrieval content based on this feedback. However, this method is not always sufficient to accurately express users’ retrieval intent. Inspired by the powerful language understanding capability of the Large Language Model (LLM), we propose TalkSee, an interactive video retrieval engine using LLM for interaction in order to better capture users’ latent retrieval intentions. We use the large language model for processing positive and negative feedback into natural language interactions. Specifically, combined with feedback, we leverage LLM to generate questions, update the queries, and conduct re-ranking. Last but not least, we design a tailored interactive user interface (UI) in conjunction with the above method for more efficient and effective video retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Amato, G., et al.: VISIONE at video browser showdown 2023. In: Dang-Nguyen, DT., et al. (eds.) MultiMedia Modeling, MMM 2023. LNCS, vol. 13833, pp. 615–621. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_48
Jónsson, B.Þ., Khan, O.S., Koelma, D.C., Rudinac, S., Worring, M., Zahálka, J.: Exquisitor at the video browser showdown 2020. In: Ro, Y.M., et al. (eds.) MMM 2020, Part II 26. LNCS, vol. 11962, pp. 796–802. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37734-2_72
Lee, Y., Choi, H., Park, S., Ro, Y.M.: IVIST: interactive video search tool in VBS 2021. In: Lokoč, J., et al. (eds.) MMM 2021, Part II 27. LNCS, vol. 12573, pp. 423–428. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67835-7_39
Chapter Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Lokoč, J., et al.: Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS. Multimedia Syst. 29(10), 1–24 (2023)
Google Scholar
Schall, K., Hezel, N., Jung, K., Barthel, K.U.: Vibro: video browsing with semantic and visual image embeddings. In: Dang-Nguyen, D.T., et al. (eds.) MultiMedia Modeling, MMM 2023. LNCS, vol. 13833, pp. 665–670. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_56
Song, W., He, J., Li, X., Feng, S., Liang, C.: QIVISE: a quantum-inspired interactive video search engine in VBS2023. In: Dang-Nguyen, D.T., et al. (eds.) International Conference on Multimedia Modeling, vol. 13833, pp. 640–645. Springer, Heidelberg (2023). https://doi.org/10.1007/978-3-031-27077-2_52
Sun, W., Yan, L., Ma, X., Ren, P., Yin, D., Ren, Z.: Is ChatGPT good at search? Investigating large language models as re-ranking agent. arXiv preprint arXiv:2304.09542 (2023)
Thomee, B., Lew, M.S.: Interactive search in image retrieval: a survey. Int. J. Multimedia Inf. Retriev. 1, 71–86 (2012)
Article Google Scholar
Xu, H., et al.: mPLUG-2: a modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402 (2023)

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (No. U1903214, 62372339, 62371350, 61876135), and the Ministry of Education Industry-University Cooperative Education Project (No. 202102246004, 220800006041043). The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University. We would like to express our sincere gratitude to Zhiyu Zhou for his previous contribution to this work.

Author information

Authors and Affiliations

National Engineering Research Center for Multimedia Software (NERCMS), Wuhan, China
Guihe Gu, Zhengqian Wu, Jiangshan He, Lin Song, Zhongyuan Wang & Chao Liang
Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan, China
Guihe Gu, Zhengqian Wu, Jiangshan He, Lin Song, Zhongyuan Wang & Chao Liang
School of Computer Science, Wuhan University, Wuhan, China
Guihe Gu, Zhengqian Wu, Jiangshan He, Lin Song, Zhongyuan Wang & Chao Liang

Authors

Guihe Gu
View author publications
You can also search for this author in PubMed Google Scholar
Zhengqian Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jiangshan He
View author publications
You can also search for this author in PubMed Google Scholar
Lin Song
View author publications
You can also search for this author in PubMed Google Scholar
Zhongyuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chao Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Lin Song or Chao Liang .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gu, G., Wu, Z., He, J., Song, L., Wang, Z., Liang, C. (2024). TalkSee: Interactive Video Retrieval Engine Using Large Language Model. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_36

Download citation

DOI: https://doi.org/10.1007/978-3-031-53302-0_36
Published: 29 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53301-3
Online ISBN: 978-3-031-53302-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics