Abstract
Video retrieval methods, e.g., for visual concept classification, person recognition, and similarity search, are essential to perform fine-grained semantic search in large video archives. However, such retrieval methods often have to be adapted to the users’ changing search requirements: which concepts or persons are frequently searched for, what research topics are currently important or will be relevant in the future? In this paper, we present VIVA, a software tool for building content-based video retrieval methods based on deep learning models. VIVA allows non-expert users to conduct visual information retrieval for concepts and persons in video archives and to add new people or concepts to the underlying deep learning models as new requirements arise. For this purpose, VIVA provides a novel semi-automatic data acquisition workflow including a web crawler, image similarity search, as well as review and user feedback components to reduce the time-consuming manual effort for collecting training samples. We present experimental retrieval results using VIVA for four use cases in the context of a historical video collection of the German Broadcasting Archive based on about 34,000 h of television recordings from the former German Democratic Republic (GDR). We evaluate the performance of deep learning models built using VIVA for 91 GDR specific concepts and 98 personalities from the former GDR as well as the performance of the image and person similarity search approaches.
Similar content being viewed by others
References
Amato, G., Bolettieri, P., Carrara, F., Debole, F., Falchi, F., Gennaro, C., Vadicamo, L., Vairo, C.: The VISIONE video search system: exploiting off-the-shelf text search engines for large-scale video retrieval. J. Imaging 7(5), 76 (2021). https://doi.org/10.3390/jimaging7050076
Amato, G., Falchi, F., Gennaro, C., Rabitti, F.: Searching and annotating 100m images with yfcc100m-hnfc6 and mi-file. In: Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing. pp. 1–4 (2017)
Andreadis, S., Moumtzidou, A., Gkountakos, K., Pantelidis, N., Apostolidis, K., Galanopoulos, D., Gialampoukidis, I., Vrochidis, S., Mezaris, V., Kompatsiaris, I.: VERGE in VBS 2021. In: Proceedings of the 27th International Conference on MultiMedia Modeling (MMM 2021). Lecture Notes in Computer Science, vol. 12573, pp. 398–404. Springer (2021) https://doi.org/10.1007/978-3-030-67835-7_35
Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: Vggface2: a dataset for recognising faces across pose and age. In: Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018). pp. 67–74. IEEE (2018)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255. IEEE (2009)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4690–4699. IEEE (2019)
Deng, J., Guo, J., Zhou, Y., Yu, J., Kotsia, I., Zafeiriou, S.: Retinaface: single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641 (2019)
Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20(1), 1–21 (2019)
Gasser, R., Rossetto, L., Schuldt, H.: Multimodal multimedia retrieval with vitrivr. In: Proceedings of the International Conference on Multimedia Retrieval (ICMR 2019). pp. 391–394. ACM (2019). https://doi.org/10.1145/3323873.3326921
Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: a dataset and benchmark for large-scale face recognition. In: Proceedings of 14th European Conference on Computer Vision. pp. 87–102. Lecture Notes in Computer Science, Springer (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hu, B., Song, R., Wei, X., Yao, Y., Hua, X., Liu, Y.: PyRetri: A pytorch-based library for unsupervised image retrieval by deep convolutional neural networks. In: Proceedings of the 28th ACM International Conference on Multimedia. pp. 4461–4464. ACM (2020). https://doi.org/10.1145/3394171.3414537
Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2010)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(3), 535–547 (2019)
Kernighan, B.W., Lin, S.: An efficient heuristic procedure for partitioning graphs. Bell Syst. Tech. J. 49(2), 291–307 (1970)
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2015)
Korfhage, N., Mühling, M., Freisleben, B.: Intentional image similarity search. In: IAPR Workshop on Artificial Neural Networks in Pattern Recognition, pp. 23–35. Springer (2020)
Korfhage, N., Mühling, M., Freisleben, B.: ElasticHash: semantic image similarity search by deep hashing with elasticsearch. In: Proceedings of the International Conference on Computer Analysis of Images and Patterns (CAIP). pp 14–23. Springer (2021)
Kratochvíl, M., Mejzlík, F., Veselý, P., Soucek, T., Lokoc, J.: SOMHunter: Lightweight video search system with SOM-guided relevance feedback. In: Proceedings of the 28th International Conference on Multimedia (MM). pp. 4481–4484. ACM (2020), https://doi.org/10.1145/3394171.3414542
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4. Int. J. Comput. Vis. 128(7), 1–26 (2020)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988 (2017)
Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.J., Fei-Fei, L., Yuille, A., Huang, J., Murphy, K.: Progressive neural architecture search. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34 (2018)
Lokoc, J., Schoeffmann, K., Bailer, W., Rossetto, L., Gurrin, C.: Interactive video retrieval in the age of deep learning. In: Proceedings of the International Conference on Multimedia Retrieval (ICMR), pp 2–4. ACM (2019). https://doi.org/10.1145/3323873.3326588
Mühling, M., Ewerth, R., Stadelmann, T., Zöfel, C., Shi, B., Freisleben, B.: University of Marburg at TRECVID 2007: shot boundary detection and high level feature extraction. In: TRECVID (2007)
Mühling, M., Meister, M., Korfhage, N., Wehling, J., Hörth, A., Ewerth, R., Freisleben, B.: Content-based video retrieval in historical collections of the German broadcasting archive. Int. J. Digit. Libr. 20(2), 167–183 (2019)
Nguyen, P.A., Wu, J., Ngo, C., Francis, D., Huet, B.: VIREO@ video browser showdown 2020. In: Proceedings of the 26th International Conference on MultiMedia Modeling (MMM). Lecture Notes in Computer Science, vol. 11962, pp. 772–777. Springer (2020). https://doi.org/10.1007/978-3-030-37734-2_68
Norouzi, M., Punjani, A., Fleet, D.J.: Fast search in Hamming space with multi-index hashing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3108–3115. IEEE (2012)
Pustu-Iren, K., Mühling, M., Korfhage, N., Bars, J., Bernhöft, S., Hörth, A., Freisleben, B., Ewerth, R.: Investigating correlations of inter-coder agreement and machine annotation performance for historical video data. In: Proceedings of the International Conference on Theory and Practice of Digital Libraries, pp. 107–114 (2019)
Rodrigues, J., Cristo, M., Colonna, J.G.: Deep hashing for multi-label image retrieval: a survey. Artif. Intell. Rev. 53(7), 5261–5307 (2020)
Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vis. 77(1–3), 157–173 (2008)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Smeulders, A.W., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–9 (2015)
Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: Proceedings of the International Conference on Machine Learning. pp. 6105–6114 (2019)
Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., Liu, W.: CosFace: Large margin cosine loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5265–5274. IEEE Computer Society (2018)
Wang, J., Yi, X., Guo, R., Jin, H., Xu, P., Li, S., Wang, X., Guo, X., Li, C., Xu, X., et al.: Milvus: A purpose-built vector data management system. In: Proceedings of the International Conference on Management of Data, pp 2614–2627 (2021)
Wang, J., Zhang, T., Sebe, N., Shen, H.T., et al.: A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 769–790 (2017)
Wang, J., Liu, W., Kumar, S., Chang, S.F.: Learning to hash for indexing big data: a survey. Proc. IEEE 104(1), 34–57 (2015)
Yeager, L., Bernauer, J., Gray, A., Houston, M.: Digits: the deep learning GPU training system. In: ICML 2015 AutoML Workshop (2015)
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1452–1464 (2017)
Acknowledgements
This work is financially supported by the German Research Foundation (DFG project number 388420599).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mühling, M., Korfhage, N., Pustu-Iren, K. et al. VIVA: visual information retrieval in video archives. Int J Digit Libr 23, 319–333 (2022). https://doi.org/10.1007/s00799-022-00337-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-022-00337-y