Skip to main content

VISIONE 5.0: Enhanced User Interface and AI Models for VBS2024

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Abstract

In this paper, we introduce the fifth release of VISIONE, an advanced video retrieval system offering diverse search functionalities. The user can search for a target video using textual prompts, drawing objects and colors appearing in the target scenes in a canvas, or images as query examples to search for video keyframes with similar content. Compared to the previous version of our system, which was runner-up at VBS 2023, the forthcoming release, set to participate in VBS 2024, showcases a refined user interface that enhances its usability and updated AI models for more effective video content analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/facebookresearch/faiss.

  2. 2.

    https://lucene.apache.org/.

References

  1. Amato, G., et al.: The VISIONE video search system: exploiting off-the-shelf text search engines for large-scale video retrieval. J. Imag. 7(5), 76 (2021)

    Article  Google Scholar 

  2. Amato, G., et al.: Visione: a large-scale video retrieval system with advanced search functionalities. In: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval,D pp. 649–653 (2023)

    Google Scholar 

  3. Amato, G., et al.: VISIONE at video browser showdown 2023. In: Dang-Nguyen, D.-T., et al. (eds.) MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part I, pp. 615–621. Springer International Publishing, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_48

    Chapter  Google Scholar 

  4. Amato, G., et al.: VISIONE Feature Repository for VBS: Multi-Modal Features and Detected Objects from LapGyn100 Dataset, October 2023. https://doi.org/10.5281/zenodo.10013328

  5. Amato, G., et al.: VISIONE Feature Repository for VBS: Multi-Modal Features and Detected Objects from MVK Dataset (2023). https://doi.org/10.5281/zenodo.8355037

    Article  Google Scholar 

  6. Amato, G.,et al.: VISIONE feature repository for VBS: multi-modal features and detected objects from V3C1+V3C2 dataset (Jul 2023). https://doi.org/10.5281/zenodo.8188570

  7. Amato, G., et al.: VISIONE for newbies: an easier-to-use video retrieval system. In: Proceedings of the 20th International Conference on Content-based Multimedia Indexing. Association for Computing Machinery (2023)

    Google Scholar 

  8. Amato, G., Carrara, F., Falchi, F., Gennaro, C., Vadicamo, L.: Large-scale instance-level image retrieval. Inform. Process. Manage. 57(6), 102100 (2020)

    Article  Google Scholar 

  9. Carrara, F., Gennaro, C., Vadicamo, L., Amato, G.: Vec2Doc: transforming dense vectors into sparse representations for efficient information retrieval. In: Pedreira, O., Estivill-Castro, V. (eds.) Similarity Search and Applications: 16th International Conference, SISAP 2023, A Coruña, Spain, October 9–11, 2023, Proceedings, pp. 215–222. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-46994-7_18

    Chapter  Google Scholar 

  10. Carrara, F., Vadicamo, L., Gennaro, C., Amato, G.: Approximate nearest neighbor search on standard search engines. In: Skopal, T., Falchi, F., Lokoč, J., Sapino, M.L., Bartolini, I., Patella, M. (eds.) Similarity Search and Applications: 15th International Conference, SISAP 2022, Bologna, Italy, October 5–7, 2022, Proceedings, pp. 214–221. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-17849-8_17

    Chapter  Google Scholar 

  11. Cormack, G.V., Clarke, C.L., Buettcher, S.: Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 758–759 (2009)

    Google Scholar 

  12. Fang, H., Xiong, P., Xu, L., Chen, Y.: Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021)

  13. Heller, S., Gsteiger, V., Bailer, W., Gurrin, C., Jónsson, B.Þ, Lokoč, J., et al.: Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th video browser showdown. Int. J. Multimed. Inform. Retrieval 11(1), 1–18 (2022)

    Article  Google Scholar 

  14. Ilharco, G., et al.: Openclip (2021). https://doi.org/10.5281/zenodo.5143773

    Article  Google Scholar 

  15. Lokoč, J., et al.: Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th vbs. Multimedia Systems, pp. 1–24 (2023)

    Google Scholar 

  16. Lokoč, J., et al.: A Task Category Space for User-Centric Comparative Multimedia Search Evaluations. In: Þór Jónsson, B., Gurrin, C., Tran, M.-T., Dang-Nguyen, D.-T., Hu, A.M.-C., Huynh Thi Thanh, B., Huet, B. (eds.) MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6–10, 2022, Proceedings, Part I, pp. 193–204. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-030-98358-1_16

    Chapter  Google Scholar 

  17. Lokoč, J., et al.: Is the reign of interactive search eternal? findings from the video browser showdown 2020. ACM Trans. Multimed. Comput. Commun. Appl. 17(3), 1–26 (2021)

    Article  Google Scholar 

  18. Lokoč, J., Vopálková, Z., Dokoupil, P., Peška, L.: Video search with CLIP and interactive text query reformulation. In: Dang-Nguyen, D.-T., Gurrin, C., Larson, M., Smeaton, A.F., Rudinac, S., Dao, M.-S., Trattner, C., Chen, P. (eds.) MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part I, pp. 628–633. Springer International Publishing, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_50

    Chapter  Google Scholar 

  19. Ma, Z., Wu, J., Loo, W., Ngo, C.W.: Reinforcement learning enhanced pichunter for interactive search. In: MultiMedia Modeling (2023)

    Google Scholar 

  20. Messina, N., et al.: Aladin: distilling fine-grained alignment scores for efficient image-text matching and retrieval. In: Proceedings of the 19th International Conference on Content-based Multimedia Indexing, pp. 64–70 (2022)

    Google Scholar 

  21. Oquab, M., et al.: Dinov2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  22. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, ICML 2021, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  23. Rossetto, L., Gasser, R., Sauter, L., Bernstein, A., Schuldt, H.: A system for interactive multimedia retrieval evaluations. In: Lokoč, J., Skopal, T., Schoeffmann, K., Mezaris, V., Li, X., Vrochidis, S., Patras, I. (eds.) MultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part II, pp. 385–390. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3-030-67835-7_33

    Chapter  Google Scholar 

  24. Rossetto, L., Schuldt, H., Awad, G., Butt, A.A.: V3C – a research video collection. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, W.-H., Vrochidis, S. (eds.) MultiMedia Modeling: 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8–11, 2019, Proceedings, Part I, pp. 349–360. Springer International Publishing, Cham (2019). https://doi.org/10.1007/978-3-030-05710-7_29

    Chapter  Google Scholar 

  25. Schall, K., Hezel, N., Jung, K., Barthel, K.U.: Vibro: video browsing with semantic and visual image embeddings. In: Dang-Nguyen, D.-T., Gurrin, C., Larson, M., Smeaton, A.F., Rudinac, S., Dao, M.-S., Trattner, C., Chen, P. (eds.) MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part I, pp. 665–670. Springer International Publishing, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_56

    Chapter  Google Scholar 

  26. Schoeffmann, K.: lifexplore at the lifelog search challenge 2023. In: Proceedings of the 6th Annual ACM Lifelog Search Challenge, pp. 53–58 (2023)

    Google Scholar 

  27. Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)

    Google Scholar 

  28. Spiess, F., Heller, S., Rossetto, L., Sauter, L., Weber, P., Schuldt, H.: Traceable asynchronous workflows in video retrieval with vitrivr-VR. In: Dang-Nguyen, D.-T., Gurrin, C., Larson, M., Smeaton, A.F., Rudinac, S., Dao, M.-S., Trattner, C., Chen, P. (eds.) MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part I, pp. 622–627. Springer International Publishing, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_49

    Chapter  Google Scholar 

  29. Truong, Q.-T., et al.: Marine Video Kit: a new marine video dataset for content-based analysis and retrieval. In: Dang-Nguyen, D.-T., Gurrin, C., Larson, M., Smeaton, A.F., Rudinac, S., Dao, M.-S., Trattner, C., Chen, P. (eds.) MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part I, pp. 539–550. Springer International Publishing, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_42

    Chapter  Google Scholar 

  30. Zhang, S., et al. Large-scale domain-specific pretraining for biomedical vision-language processing (2023). https://doi.org/10.48550/ARXIV.2303.00915

Download references

Acknowledgements

This work was partially funded by AI4Media - A European Excellence Centre for Media, Society and Democracy (EC, H2020 n. 951911), the PNRR-National Centre for HPC, Big Data and Quantum Computing project CUP B93C22000620006, and by the Horizon Europe Research & Innovation Programme under Grant agreement N. 101092612 (Social and hUman ceNtered XR - SUN project). Views and opinions expressed in this paper are those of the authors only and do not necessarily reflect those of the European Union. Neither the European Union nor the European Commission can be held responsible for them.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lucia Vadicamo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Amato, G. et al. (2024). VISIONE 5.0: Enhanced User Interface and AI Models for VBS2024. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53302-0_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53301-3

  • Online ISBN: 978-3-031-53302-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics