Abstract
The current interactive retrieval system mostly relies on collecting user’s positive and negative feedback and updating the retrieval content based on this feedback. However, this method is not always sufficient to accurately express users’ retrieval intent. Inspired by the powerful language understanding capability of the Large Language Model (LLM), we propose TalkSee, an interactive video retrieval engine using LLM for interaction in order to better capture users’ latent retrieval intentions. We use the large language model for processing positive and negative feedback into natural language interactions. Specifically, combined with feedback, we leverage LLM to generate questions, update the queries, and conduct re-ranking. Last but not least, we design a tailored interactive user interface (UI) in conjunction with the above method for more efficient and effective video retrieval.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Amato, G., et al.: VISIONE at video browser showdown 2023. In: Dang-Nguyen, DT., et al. (eds.) MultiMedia Modeling, MMM 2023. LNCS, vol. 13833, pp. 615–621. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_48
Jónsson, B.Þ., Khan, O.S., Koelma, D.C., Rudinac, S., Worring, M., Zahálka, J.: Exquisitor at the video browser showdown 2020. In: Ro, Y.M., et al. (eds.) MMM 2020, Part II 26. LNCS, vol. 11962, pp. 796–802. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37734-2_72
Lee, Y., Choi, H., Park, S., Ro, Y.M.: IVIST: interactive video search tool in VBS 2021. In: Lokoč, J., et al. (eds.) MMM 2021, Part II 27. LNCS, vol. 12573, pp. 423–428. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67835-7_39
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Lokoč, J., et al.: Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS. Multimedia Syst. 29(10), 1–24 (2023)
Schall, K., Hezel, N., Jung, K., Barthel, K.U.: Vibro: video browsing with semantic and visual image embeddings. In: Dang-Nguyen, D.T., et al. (eds.) MultiMedia Modeling, MMM 2023. LNCS, vol. 13833, pp. 665–670. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_56
Song, W., He, J., Li, X., Feng, S., Liang, C.: QIVISE: a quantum-inspired interactive video search engine in VBS2023. In: Dang-Nguyen, D.T., et al. (eds.) International Conference on Multimedia Modeling, vol. 13833, pp. 640–645. Springer, Heidelberg (2023). https://doi.org/10.1007/978-3-031-27077-2_52
Sun, W., Yan, L., Ma, X., Ren, P., Yin, D., Ren, Z.: Is ChatGPT good at search? Investigating large language models as re-ranking agent. arXiv preprint arXiv:2304.09542 (2023)
Thomee, B., Lew, M.S.: Interactive search in image retrieval: a survey. Int. J. Multimedia Inf. Retriev. 1, 71–86 (2012)
Xu, H., et al.: mPLUG-2: a modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402 (2023)
Acknowledgement
This work is supported by the National Natural Science Foundation of China (No. U1903214, 62372339, 62371350, 61876135), and the Ministry of Education Industry-University Cooperative Education Project (No. 202102246004, 220800006041043). The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University. We would like to express our sincere gratitude to Zhiyu Zhou for his previous contribution to this work.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gu, G., Wu, Z., He, J., Song, L., Wang, Z., Liang, C. (2024). TalkSee: Interactive Video Retrieval Engine Using Large Language Model. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_36
Download citation
DOI: https://doi.org/10.1007/978-3-031-53302-0_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53301-3
Online ISBN: 978-3-031-53302-0
eBook Packages: Computer ScienceComputer Science (R0)