Skip to main content

TalkSee: Interactive Video Retrieval Engine Using Large Language Model

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Abstract

The current interactive retrieval system mostly relies on collecting user’s positive and negative feedback and updating the retrieval content based on this feedback. However, this method is not always sufficient to accurately express users’ retrieval intent. Inspired by the powerful language understanding capability of the Large Language Model (LLM), we propose TalkSee, an interactive video retrieval engine using LLM for interaction in order to better capture users’ latent retrieval intentions. We use the large language model for processing positive and negative feedback into natural language interactions. Specifically, combined with feedback, we leverage LLM to generate questions, update the queries, and conduct re-ranking. Last but not least, we design a tailored interactive user interface (UI) in conjunction with the above method for more efficient and effective video retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Amato, G., et al.: VISIONE at video browser showdown 2023. In: Dang-Nguyen, DT., et al. (eds.) MultiMedia Modeling, MMM 2023. LNCS, vol. 13833, pp. 615–621. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_48

  2. Jónsson, B.Þ., Khan, O.S., Koelma, D.C., Rudinac, S., Worring, M., Zahálka, J.: Exquisitor at the video browser showdown 2020. In: Ro, Y.M., et al. (eds.) MMM 2020, Part II 26. LNCS, vol. 11962, pp. 796–802. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37734-2_72

  3. Lee, Y., Choi, H., Park, S., Ro, Y.M.: IVIST: interactive video search tool in VBS 2021. In: Lokoč, J., et al. (eds.) MMM 2021, Part II 27. LNCS, vol. 12573, pp. 423–428. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67835-7_39

    Chapter  Google Scholar 

  4. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  5. Lokoč, J., et al.: Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th VBS. Multimedia Syst. 29(10), 1–24 (2023)

    Google Scholar 

  6. Schall, K., Hezel, N., Jung, K., Barthel, K.U.: Vibro: video browsing with semantic and visual image embeddings. In: Dang-Nguyen, D.T., et al. (eds.) MultiMedia Modeling, MMM 2023. LNCS, vol. 13833, pp. 665–670. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_56

  7. Song, W., He, J., Li, X., Feng, S., Liang, C.: QIVISE: a quantum-inspired interactive video search engine in VBS2023. In: Dang-Nguyen, D.T., et al. (eds.) International Conference on Multimedia Modeling, vol. 13833, pp. 640–645. Springer, Heidelberg (2023). https://doi.org/10.1007/978-3-031-27077-2_52

  8. Sun, W., Yan, L., Ma, X., Ren, P., Yin, D., Ren, Z.: Is ChatGPT good at search? Investigating large language models as re-ranking agent. arXiv preprint arXiv:2304.09542 (2023)

  9. Thomee, B., Lew, M.S.: Interactive search in image retrieval: a survey. Int. J. Multimedia Inf. Retriev. 1, 71–86 (2012)

    Article  Google Scholar 

  10. Xu, H., et al.: mPLUG-2: a modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402 (2023)

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (No. U1903214, 62372339, 62371350, 61876135), and the Ministry of Education Industry-University Cooperative Education Project (No. 202102246004, 220800006041043). The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University. We would like to express our sincere gratitude to Zhiyu Zhou for his previous contribution to this work.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Lin Song or Chao Liang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gu, G., Wu, Z., He, J., Song, L., Wang, Z., Liang, C. (2024). TalkSee: Interactive Video Retrieval Engine Using Large Language Model. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53302-0_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53301-3

  • Online ISBN: 978-3-031-53302-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics