Skip to main content

V-FIRST: A Flexible Interactive Retrieval System for Video at VBS 2022

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13142))

Abstract

Video retrieval systems have a wide range of applications across multiple domains, therefore the development of user-friendly and efficient systems is necessary. For VBS 2022, we develop a flexible interactive system for video retrieval, namely V-FIRST, that supports two scenarios of usage: query with text descriptions and query with visual examples. We take advantage of both visual and temporal information from videos to extract concepts related to entities, events, scenes, activities, and motion trajectories for video indexing. Our system supports queries with keywords and sentence descriptions as V-FIRST can evaluate the semantic similarities between visual and textual embedding vectors. V-FIRST also allows users to express queries with visual impressions, such as sketches and 2D spatial maps of dominant colors. We use query expansion, elastic temporal video navigation, and intellisense for hints to further boost the performance of our system.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Amato, G., et al.: VISIONE at video browser showdown 2021. In: Lokoč, J., et al. (eds.) MMM 2021. LNCS, vol. 12573, pp. 473–478. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67835-7_47

    Chapter  Google Scholar 

  2. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)

    Google Scholar 

  3. Heller, S., et al.: Towards explainable interactive multi-modal video retrieval with Vitrivr. In: Lokoč, J., et al. (eds.) MMM 2021. LNCS, vol. 12573, pp. 435–440. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67835-7_41

    Chapter  Google Scholar 

  4. Kratochvíl, M., Veselý, P., Mejzlík, F., Lokoč, J.: SOM-hunter: video browsing with relevance-to-SOM feedback loop. In: Ro, Y.M., et al. (eds.) MMM 2020. LNCS, vol. 11962, pp. 790–795. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37734-2_71

    Chapter  Google Scholar 

  5. Liu, Y., Chen, H., Shen, C., He, T., Jin, L., Wang, L.: AbcNet: real-time scene text spotting with adaptive Bezier-curve network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

    Google Scholar 

  6. Nguyen, N., et al.: Dictionary-guided scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7383–7392, June 2021

    Google Scholar 

  7. Ressmann, A., Schoeffmann, K.: IVOS - the ITEC interactive video object search system at VBS2021. In: Lokoč, J., et al. (eds.) MMM 2021. LNCS, vol. 12573, pp. 479–483. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67835-7_48

    Chapter  Google Scholar 

  8. Rossetto, L., et al.: VideoGraph – towards using knowledge graphs for interactive video retrieval. In: Lokoč, J., et al. (eds.) MMM 2021. LNCS, vol. 12573, pp. 417–422. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67835-7_38

    Chapter  Google Scholar 

  9. Schoeffmann, K., Lokoc, J., Bailer, W.: 10 years of video browser showdown. In: Chua, T., et al. (eds.) MMAsia 2020: ACM Multimedia Asia, Virtual Event/Singapore, 7–9 March 2021, pp. 73:1–73:3. ACM (2020)

    Google Scholar 

  10. Tan, M., Pang, R., Le, Q.V.: EfficientDet: scalable and efficient object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10778–10787 (2020)

    Google Scholar 

  11. Tran, D., et al.: A VR interface for browsing visual spaces at VBS2021, pp. 490–495 (2021)

    Google Scholar 

  12. Tran, L.-D., et al.: A VR interface for browsing visual spaces at VBS2021. In: Lokoč, J., et al. (eds.) MMM 2021. LNCS, vol. 12573, pp. 490–495. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-67835-7_50

    Chapter  Google Scholar 

  13. Tran, M., et al.: FIRST - flexible interactive retrieval system for visual lifelog exploration at LSC 2020. In: Gurrin, C., et al. (eds.) Proceedings of the Third ACM Workshop on Lifelog Search Challenge, LSC@ICMR 2020, Dublin, Ireland, 8–11 June 2020, pp. 67–72. ACM (2020)

    Google Scholar 

  14. Trang-Trung, H., Le, H., Tran, M.: Lifelog moment retrieval with self-attention based joint embedding model. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, 22–25 September 2020. CEUR Workshop Proceedings, vol. 2696. CEUR-WS.org (2020). http://ceur-ws.org/Vol-2696/paper_60.pdf

  15. Trang-Trung, H., et al.: Flexible interactive retrieval system 2.0 for visual lifelog exploration at LSC 2021. In: Gurrin, C., et al. (eds.) Proceedings of the 4th Annual on Lifelog Search Challenge, LSC@ICMR 2021, Taipei, Taiwan, 21 August 2021, pp. 81–87. ACM (2021)

    Google Scholar 

  16. Vo, K., Yamazaki, K., Truong, S., Tran, M., Sugimoto, A., Le, N.: ABN: agent-aware boundary networks for temporal action proposal generation. IEEE Access 9, 126431–126445 (2021)

    Article  Google Scholar 

  17. Vo-Ho, V., Le, N., Yamazaki, K., Sugimoto, A., Tran, M.: Agent-environment network for temporal action proposal generation. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, 6–11 June 2021, pp. 2160–2164. IEEE (2021)

    Google Scholar 

  18. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1452–1464 (2017)

    Article  Google Scholar 

Download references

Acknowledgement

The team would like to thank Vinh-Hung Tran, Trong-Thang Pham for the enhanced captioning module; Trong-Tung Nguyen and Huu-Nghia Nguyen-Ho for the human-object interaction module; Tien-Phat Nguyen and Ba-Thinh Tran-Le for the moving trajectory retrieval method.

Hoang-Phuc Trang-Trung, Thanh-Cong Le, and Mai-Khiem Tran were funded by Vingroup Joint Stock Company and supported by the Domestic Master/ PhD Scholarship Programme of Vingroup Innovation Foundation (VINIF), Vingroup Big Data Institute (VINBIGDATA), code VINIF.2020.ThS.JVN.03, VINIF.2020. ThS.JVN.05, and VINIF.2020.ThS.JVN.06, respectively.

The work was funded by Gia Lam Urban Development and Investment Company Limited, Vingroup and supported by Vingroup Innovation Foundation (VINIF) under project code VINIF.2019.DA19.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Minh-Triet Tran or Nhat Hoang-Xuan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tran, MT. et al. (2022). V-FIRST: A Flexible Interactive Retrieval System for Video at VBS 2022. In: Þór Jónsson, B., et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham. https://doi.org/10.1007/978-3-030-98355-0_55

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-98355-0_55

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-98354-3

  • Online ISBN: 978-3-030-98355-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics