ABSTRACT
Multimedia retrieval in computer science is the process of obtaining text, images, videos, and audio segments, all in digital form relevant to an information need from a collection of these resources. With the ever-growing amount of data, scalable and interactive retrieval systems that can efficiently work on extensive data collections while maintaining high precision are in high demand by industries and researchers. This paper presents the Pumpkin system, an interactive multimedia retrieval system first used in The AI Challenge Ho Chi Minh City 2023, an annual video event and moment retrieval competition. The system is built and set in motion to handle the retrieval task in a video collection of considerable size and complexity by three primary methods: visual-text association search, object-based search, and audio speech instances search. Additionally, the system has an integrated temporal workflow to search for conceptually related shots in a sequential motion, which removes out-of-context while leveraging suitable results as the user inputs more details to the system. Our system also puts great emphasis on user experience by cooperating with a clean and intuitive interface design with simplified user-side functionality, allowing a more efficient process of information retrieval, whether primary or complex, in a huge collection of multimedia data.
- 2018. ITI-CERTH participation in TRECVID 2017. Zenodo. https://doi.org/10.5281/zenodo.1183440Google ScholarCross Ref
- Ahmed Alateeq, Mark Roantree, and Cathal Gurrin. 2021. Voxento 2.0: A Prototype Voice-Controlled Interactive Search Engine for Lifelogs. In Proceedings of the 4th Annual on Lifelog Search Challenge (Taipei, Taiwan) (LSC ’21). Association for Computing Machinery, New York, NY, USA, 65–70. https://doi.org/10.1145/3463948.3469071Google ScholarDigital Library
- Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. CoRR abs/2006.11477 (2020). arXiv:2006.11477https://arxiv.org/abs/2006.11477Google Scholar
- Cathal Gurrin, Björn Þór Jónsson, Klaus Schöffmann, Duc-Tien Dang-Nguyen, Jakub Lokoč, Minh-Triet Tran, Wolfgang Hürst, Luca Rossetto, and Graham Healy. 2021. Introduction to the Fourth Annual Lifelog Search Challenge, LSC’21. In Proceedings of the 2021 International Conference on Multimedia Retrieval (Taipei, Taiwan) (ICMR ’21). Association for Computing Machinery, New York, NY, USA, 690–691. https://doi.org/10.1145/3460426.3470945Google ScholarDigital Library
- Silvan Heller, Viktor Gsteiger, Werner Bailer, Cathal Gurrin, Björn Þór Jónsson, Jakub Lokoč, Andreas Leibetseder, František Mejzlík, Ladislav Peška, Luca Rossetto, Konstantin Schall, Klaus Schoeffmann, Heiko Schuldt, Florian Spiess, Ly-Duyen Tran, Lucia Vadicamo, Patrik Veselý, Stefanos Vrochidis, and Jiaxin Wu. 2022. Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th Video Browser Showdown. International Journal of Multimedia Information Retrieval 11, 1 (March 2022), 1–18.Google ScholarCross Ref
- Maria Tysse Hordvik, Julie Sophie Teilstad Østby, Manoj Kesavulu, Thao-Nhu Nguyen, Tu-Khiem Le, and Duc-Tien Dang-Nguyen. 2023. LifeLens: Transforming Lifelog Search with Innovative UX/UI Design. In Proceedings of the 6th Annual ACM Lifelog Search Challenge (Thessaloniki, Greece) (LSC ’23). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/3592573.3593096Google ScholarDigital Library
- Tanuj Jain, Christopher Lennan, Zubin John, and Dat Tran. 2019. Imagededup. https://github.com/idealo/imagededup.Google Scholar
- Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.Google ScholarCross Ref
- Omar Shahbaz Khan, Björn Þór Jónsson, Mathias Larsen, Liam Poulsen, Dennis C. Koelma, Stevan Rudinac, Marcel Worring, and Jan Zahálka. 2021. Exquisitor at the Video Browser Showdown 2021: Relationships Between Semantic Classifiers. In MultiMedia Modeling, Jakub Lokoč, Tomáš Skopal, Klaus Schoeffmann, Vasileios Mezaris, Xirong Li, Stefanos Vrochidis, and Ioannis Patras (Eds.). Springer International Publishing, Cham, 410–416.Google Scholar
- Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).Google Scholar
- Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019. W2VV++: Fully Deep Learning for Ad-hoc Video Search. https://doi.org/10.1145/3343031.3350906Google ScholarDigital Library
- Jakub Lokoč, Patrik Veselý, František Mejzlík, Gregor Kovalčík, Tomáš Souček, Luca Rossetto, Klaus Schoeffmann, Werner Bailer, Cathal Gurrin, Loris Sauter, Jaeyub Song, Stefanos Vrochidis, Jiaxin Wu, and Björn þóR Jónsson. 2021. Is the Reign of Interactive Search Eternal? Findings from the Video Browser Showdown 2020. ACM Trans. Multimedia Comput. Commun. Appl. 17, 3, Article 91 (jul 2021), 26 pages. https://doi.org/10.1145/3445031Google ScholarDigital Library
- Jakub Lokoč, Gregor Kovalčík, Tomáš Souček, Jaroslav Moravec, and Přemysl Čech. 2019. VIRET: A Video Retrieval Tool for Interactive Known-item Search. 177–181. https://doi.org/10.1145/3323873.3325034Google ScholarDigital Library
- Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1037–1042.Google ScholarCross Ref
- Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark Dras, and Mark Johnson. 2018. A Fast and Accurate Vietnamese Word Segmenter. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018). 2582–2587.Google Scholar
- Thao-Nhu Nguyen, Tu-Khiem Le, Van-Tu Ninh, Cathal Gurrin, Minh-Triet Tran, Thanh Binh Nguyen, Graham Healy, Annalina Caputo, and Sinead Smyth. 2023. E-LifeSeeker: An Interactive Lifelog Search Engine for LSC’23. In Proceedings of the 6th Annual ACM Lifelog Search Challenge (Thessaloniki, Greece) (LSC ’23). Association for Computing Machinery, New York, NY, USA, 13–17. https://doi.org/10.1145/3592573.3593098Google ScholarDigital Library
- Thao-Nhu Nguyen, Tu-Khiem Le, Van-Tu Ninh, Minh-Triet Tran, Nguyen Thanh Binh, Graham Healy, Annalina Caputo, and Cathal Gurrin. 2021. LifeSeeker 3.0: An Interactive Lifelog Search Engine for LSC’21. In Proceedings of the 4th Annual on Lifelog Search Challenge (Taipei, Taiwan) (LSC ’21). Association for Computing Machinery, New York, NY, USA, 41–46. https://doi.org/10.1145/3463948.3469065Google ScholarDigital Library
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv:2103.00020 [cs.CV]Google Scholar
- Jérôme Revaud, Matthijs Douze, Cordelia Schmid, and Hervé Jégou. 2013. Event Retrieval in Large Video Collections with Circulant Temporal Encoding. In 2013 IEEE Conference on Computer Vision and Pattern Recognition. 2459–2466. https://doi.org/10.1109/CVPR.2013.318Google ScholarDigital Library
- Luca Rossetto, Ralph Gasser, Silvan Heller, Mahnaz Parian-Scherb, Loris Sauter, Florian Spiess, Heiko Schuldt, Ladislav Peška, Tomáš Souček, Miroslav Kratochvíl, František Mejzlík, Patrik Veselý, and Jakub Lokoč. 2021. On the User-Centric Comparative Remote Evaluation of Interactive Video Search Systems. IEEE MultiMedia 28, 4 (2021), 18–28. https://doi.org/10.1109/MMUL.2021.3066779Google ScholarDigital Library
- Mingxing Tan and Quoc V. Le. 2020. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arxiv:1905.11946 [cs.LG]Google Scholar
- Ly-Duyen Tran, Manh-Duy Nguyen, Duc-Tien Dang-Nguyen, Silvan Heller, Florian Spiess, Jakub Lokoč, Ladislav Peška, Thao-Nhu Nguyen, Omar Shahbaz Khan, Aaron Duane, Björn þór Jónsson, Luca Rossetto, An-Zi Yen, Ahmed Alateeq, Naushad Alam, Minh-Triet Tran, Graham Healy, Klaus Schoeffmann, and Cathal Gurrin. 2023. Comparing Interactive Retrieval Approaches at the Lifelog Search Challenge 2021. IEEE Access 11 (2023), 30982–30995. https://doi.org/10.1109/ACCESS.2023.3248284Google ScholarCross Ref
- Shangliang Xu, Xinxin Wang, Wenyu Lv, Qinyao Chang, Cheng Cui, Kaipeng Deng, Guanzhong Wang, Qingqing Dang, Shengyu Wei, Yuning Du, and Baohua Lai. 2022. PP-YOLOE: An evolved version of YOLO. arxiv:2203.16250 [cs.CV]Google Scholar
Index Terms
- An Interactive System for Multimedia Retrieval in Video Collection with Temporal Integration
Recommendations
Improving video event retrieval by user feedback
In content based video retrieval videos are often indexed with semantic labels (concepts) using pre-trained classifiers. These pre-trained classifiers (concept detectors), are not perfect, and thus the labels are noisy. Additionally, the amount of pre-...
News Event Retrieval from Large Video Collection in Ho Chi Minh City AI Challenge 2023
SOICT '23: Proceedings of the 12th International Symposium on Information and Communication TechnologyEvent retrieval from large collections of TV news videos is crucial for efficient information access, enabling researchers, journalists, and the general public to quickly locate and analyze relevant content amidst the vast sea of news coverage, ...
NewsInsight: A Comprehensive Video Event Retrieval System with Spatial Insights and Query Assistance
SOICT '23: Proceedings of the 12th International Symposium on Information and Communication TechnologyVideo event retrieval is the task of finding videos that are relevant to a given query. It is a challenging problem because videos are typically much larger than images, and they can contain a variety of different objects and scenes. However, there are ...
Comments