Skip to main content

Less Is More: Similarity Models for Content-Based Video Retrieval

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13834))

Abstract

The concept of object-to-object similarity plays a crucial role in interactive content-based video retrieval tools. Similarity (or distance) models are core components of several retrieval concepts, e.g. Query by Example or relevance feedback. In these scenarios, the common approach is to apply some feature extractor that transforms the object to a vector of features, i.e., positions it into an induced latent space. The similarity is then based on some distance metric in this space.

Historically, feature extractors were mostly based on some color histograms or hand-crafted descriptors such as SIFT, but nowadays state-of-the-art tools mostly rely on some deep learning (DL) approaches. However, so far there were no systematic study of how suitable are individual feature extractors in the video retrieval domain. Or, in other words, to what extent are human-perceived and model-based similarities concordant. To fill this gap, we conducted a user study with over 4000 similarity judgements comparing over 20 variants of feature extractors. Results corroborate the dominance of deep learning approaches, but surprisingly favor smaller and simpler DL models instead of larger ones.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.oberlo.com/blog/youtube-statistics.

  2. 2.

    https://huggingface.co/.

  3. 3.

    https://otrok.ms.mff.cuni.cz:8030/user.

  4. 4.

    The exact prompt was “Which image is more similar to the one on the top?”.

  5. 5.

    All mentioned differences were stat. sign. with \(p<0.05\) w.r.t. Fisher exact test.

  6. 6.

    The first group included RGB Histogram 256, LAB Positional 8x8, ImageGPT medium, EfficientNetB7, ViT large and ResNetV2 152. The second group included RGB Histogram 64, LAB Positional 2x2, ImageGPT small, EfficientNetB0, ViT base and ResNetV2 50.

References

  1. Berns, F., Rossetto, L., Schoeffmann, K., Beecks, C., Awad, G.: V3C1 dataset: an evaluation of content characteristics. In: ICMR 2019, pp. 334–338. ACM (2019)

    Google Scholar 

  2. Chen, M., et al.: Generative pretraining from pixels. In: ICML 2020. PMLR (2020)

    Google Scholar 

  3. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR 2009, pp. 248–255. IEEE (2009)

    Google Scholar 

  4. Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. arXiv (2020)

    Google Scholar 

  5. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38

    Chapter  Google Scholar 

  6. Hebart, M.N., Zheng, C.Y., Pereira, F., Baker, C.I.: Revealing the multidimensional mental representations of natural objects underlying human similarity judgements. Nat. Hum. Behav. 4(11), 1173–1185 (2020)

    Article  Google Scholar 

  7. Heller, S., Gsteiger, V., Bailer, W., et al.: Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th video browser showdown. Int. J. Multimed. Inf. Retr. 11(1), 1–18 (2022). https://doi.org/10.1007/s13735-021-00225-2

    Article  Google Scholar 

  8. Hezel, N., Schall, K., Jung, K., Barthel, K.U.: Efficient search and browsing of large-scale video collections with vibro. In: Þór Jónsson, B., et al. (eds.) MMM 2022. LNCS, vol. 13142, pp. 487–492. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98355-0_43

  9. Hofmann, K., Schuth, A., Bellogín, A., de Rijke, M.: Effects of position bias on click-based recommender evaluation. In: de Rijke, M., et al. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 624–630. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_67

    Chapter  Google Scholar 

  10. Huang, P., Dai, S.: Image retrieval by texture similarity. Pattern Recogn. 36(3), 665–679 (2003)

    Article  Google Scholar 

  11. Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: CVPR 2010, pp. 3304–3311. IEEE (2010)

    Google Scholar 

  12. Kratochvíl, M., Veselý, P., Mejzlík, F., Lokoč, J.: SOM-hunter: video browsing with relevance-to-SOM feedback loop. In: Ro, Y.M., et al. (eds.) MMM 2020. LNCS, vol. 11962, pp. 790–795. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-37734-2_71

    Chapter  Google Scholar 

  13. Křenková, M., Mic, V., Zezula, P.: Similarity search with the distance density model. In: Skopal, T., Falchi, F., Lokoč, J., Sapino, M.L., Bartolini, I., Patella, M. (eds.) SISAP 2022. LNCS, vol. 13590, pp. 118–132. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-17849-8_10

    Chapter  Google Scholar 

  14. Li, X., Xu, C., Yang, G., Chen, Z., Dong, J.: W2VV++ fully deep learning for ad-hoc video search. In: ACM MM 2019, pp. 1786–1794 (2019)

    Google Scholar 

  15. Li, Y., et al.: TGIF: a new dataset and benchmark on animated GIF description. In: CVPR 2016, pp. 4641–4650 (2016)

    Google Scholar 

  16. Lokoč, J., Mejzlík, F., Souček, T., Dokoupil, P., Peška, L.: Video search with context-aware ranker and relevance feedback. In: Þór Jónsson, B., et al. (eds.) MMM 2022. LNCS, vol. 13142, pp. 505–510. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98355-0_46

  17. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94

    Article  Google Scholar 

  18. Lu, T.C., Chang, C.C.: Color image retrieval technique based on color features and image bitmap. Inf. Process. Manag. 43(2), 461–472 (2007)

    Article  Google Scholar 

  19. McLaren, K.: The development of the CIE 1976 (L*a*b*) uniform colour-space and colour-difference formula. J. Soc. Dyers Colour. 92, 338–341 (2008)

    Article  Google Scholar 

  20. Peterson, J.C., Abbott, J.T., Griffiths, T.L.: Evaluating (and improving) the correspondence between deep neural networks and human representations. Cogn. Sci. 42(8), 2648–2669 (2018)

    Article  Google Scholar 

  21. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML 2019, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  22. Roads, B.D., Love, B.C.: Enriching ImageNet with human similarity judgments and psychological embeddings. In: CVPR 2021, pp. 3547–3557. IEEE/CVF (2021)

    Google Scholar 

  23. Skopal, T.: On visualizations in the role of universal data representation. In: ICMR 2020, pp. 362–367. ACM (2020)

    Google Scholar 

  24. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11141, pp. 270–279. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01424-7_27

    Chapter  Google Scholar 

  25. Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. In: ICML 2019, pp. 6105–6114. PMLR (2019)

    Google Scholar 

  26. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  27. Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: CVPR 2016, pp. 5288–5296 (2016)

    Google Scholar 

Download references

Acknowledgments

This paper has been supported by Czech Science Foundation (GAČR) project 22-21696S and Charles University grant SVV-260588. Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA CZ LM2018140) supported by the Ministry of Education, Youth and Sports of the Czech Republic. Source codes and raw data are available from https://github.com/Anophel/image_similarity_study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ladislav Peška .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Veselý, P., Peška, L. (2023). Less Is More: Similarity Models for Content-Based Video Retrieval. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13834. Springer, Cham. https://doi.org/10.1007/978-3-031-27818-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-27818-1_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-27817-4

  • Online ISBN: 978-3-031-27818-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics