Abstract
Person-Action instance search (P-A INS) aims to retrieve the instances of a specific person doing a specific action, which appears in the 2019–2021 INS tasks of the world-famous TREC Video Retrieval Evaluation (TRECVID). Most of the top-ranking solutions can be summarized with a Division-Fusion-Optimization (DFO) framework, in which person and action recognition scores are obtained separately, then fused, and, optionally, further optimized to generate the final ranking. However, TRECVID only evaluates the final ranking results, ignoring the effects of intermediate steps and their implementation methods. We argue that conducting the fine-grained evaluations of intermediate steps of DFO framework will (1) provide a quantitative analysis of the different methods’ performance in intermediate steps; (2) find out better design choices that contribute to improving retrieval performance; and (3) inspire new ideas for future research from the limitation analysis of current techniques. Particularly, we propose an indirect evaluation method motivated by the leave-one-out strategy, which finds an optimal solution surpassing the champion teams in 2020–2021 INS tasks. Moreover, to validate the generalizability and robustness of the proposed solution under various scenarios, we specifically construct a new large-scale P-A INS dataset and conduct comparative experiments with both the leading NIST TRECVID INS solution and the state-of-the-art P-A INS method. Finally, we discuss the limitations of our evaluation work and suggest future research directions.
- [1] . 2021. Evaluating multiple video understanding and retrieval tasks at TRECVID 2021. In Proceedings of the TREC Video Retrieval Evaluation. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv21.papers/tv21overview.pdfGoogle Scholar
- [2] . 2018. VGGFace2: A dataset for recognising faces across pose and age. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG’18). IEEE, 67–74.Google ScholarDigital Library
- [3] . 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google ScholarCross Ref
- [4] . 2018. Learning to detect human-object interactions. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’18). IEEE, 381–389.Google ScholarCross Ref
- [5] . 2021. Reformulating HOI detection as adaptive set prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9004–9013.Google ScholarCross Ref
- [6] . 2014. Building a large concept bank for representing events in video. arXiv preprint arXiv:1403.7591 (2014).Google Scholar
- [7] . 2020. RetinaFace: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5203–5212.Google ScholarCross Ref
- [8] . 2019. ArcFace: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4690–4699.Google ScholarCross Ref
- [9] . 2019. SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6202–6211.Google ScholarCross Ref
- [10] . 2016. Dynamic scene recognition with complementary spatiotemporal features. IEEE Trans. Pattern Anal. Mach. Intell. 38, 12 (2016), 2389–2401.Google ScholarDigital Library
- [11] . 2018. Person retrieval in surveillance video using height, color and gender. In Proceedings of the 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS’18). IEEE, 1–6.Google ScholarCross Ref
- [12] . 2023. Research on sports video retrieval algorithm based on semantic feature extraction. Multim. Tools Applic. 82 (2023), 21941–21955.Google Scholar
- [13] . 2019. DeepStar: Detecting starring characters in movies. IEEE Access 7 (2019), 9265–9272.Google ScholarCross Ref
- [14] . 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6546–6555.Google ScholarCross Ref
- [15] . 2019. StNet: Local and global spatial-temporal modeling for action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8401–8408.Google ScholarDigital Library
- [16] . 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141.Google ScholarCross Ref
- [17] . 2007. Semantic-based surveillance video retrieval. IEEE Trans. Image Process. 16 (2007), p.1168–1181.Google ScholarDigital Library
- [18] . 2020. MovieNet: A holistic dataset for movie understanding. In Proceedings of the 16th European Conference on Computer Vision. Springer, 709–727.Google ScholarDigital Library
- [19] . 2021. Video action retrieval using action recognition model. In Proceedings of the International Conference on Multimedia Retrieval. 603–606.Google ScholarDigital Library
- [20] . 2019. WHU-NERCMS at TRECVID2019: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/whu_nercms.pdfGoogle Scholar
- [21] . 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of the 6th ACM International Conference on Image and Video Retrieval. 494–501.Google ScholarDigital Library
- [22] . 2019. NII Hitachi UIT at TRECVID 2019. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/nii_hitachi_uit.pdfGoogle Scholar
- [23] . 2020. NII_UIT AT TRECVID 2020. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/nii_uit.pdfGoogle Scholar
- [24] . 2019. BUPT-MCPRL at TRECVID 2019: ActEV and INS. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/bupt-mcprl.pdfGoogle Scholar
- [25] . 2011. TVParser: An automatic TV video parsing method. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3377–3384.Google ScholarDigital Library
- [26] . 2020. PPDM: Parallel point detection and matching for real-time human-object interaction detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 482–490.Google ScholarCross Ref
- [27] . 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7083–7093.Google ScholarCross Ref
- [28] . 2022. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3202–3211.Google ScholarCross Ref
- [29] . 2010. Story: Style, Structure, Substance, and the Principles of Screenwriting. HarperCollins e-books.Google Scholar
- [30] . 2015. Object instance search in videos via spatio-temporal trajectory discovery. IEEE Trans. Multim. 18, 1 (2015), 116–127.Google ScholarDigital Library
- [31] . 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
- [32] . 2020. UEC at TRECVID 2020: INS and ActEV. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/uec.pdfGoogle Scholar
- [33] . 2020. Ensemble ranking: Aggregation of rankings produced by different multi-criteria decision-making methods. Omega 96 (2020), 102254.Google ScholarCross Ref
- [34] . 2006. Large-scale concept ontology for multimedia. IEEE Multim. 13, 3 (2006), 86–91.Google ScholarDigital Library
- [35] . 2023. A spatio-temporal identity verification method for person-action instance search in movies. In Proceedings of the 29th International Conference on MultiMedia Modeling. Springer, 82–94.Google ScholarDigital Library
- [36] . 2021. WHU-NERCMS at TRECVID2021: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv21.papers/whu-nercms.pdfGoogle Scholar
- [37] . 2021. Contextual similarity aggregation with self-attention for visual re-ranking. In Advances in Neural Information Processing Systems, , , , , and (Eds.), Vol. 34. Curran Associates, Inc., 3135–3148.Google Scholar
- [38] , Andrea Vedaldi, and Andrew Zisserman. 2015. Deep face recognition. In Proceedings of the British Machine Vision Conference 2015 (BMVC 2015, Swansea, UK, September 7-10, 2015) Xianghua Xie, Mark W. Jones, and Gary K. L. Tam (Eds.). BMVA Press, 41.1–41.12.Google Scholar
- [39] . 2019. PKU-ICST at TRECVID 2019: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/pku-icst.pdfGoogle Scholar
- [40] . 2020. PKU_WICT at TRECVID 2020: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/pku-wict.pdfGoogle Scholar
- [41] . 2021. PKU_WICT at TRECVID 2021: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv21.papers/pku_wict.pdfGoogle Scholar
- [42] . 2012. Ensemble learning. In Ensemble Machine Learning. Springer, 1–34.Google ScholarCross Ref
- [43] . 2020. A local-to-global approach to multi-modal movie scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10146–10155.Google ScholarCross Ref
- [44] . 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 815–823.Google ScholarCross Ref
- [45] . 2021. Automatic face recognition and finding occurrence of actors in movies. In Inventive Communication and Computational Technologies. Springer, 115–129.Google Scholar
- [46] . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- [47] . 2020. Efficient facial feature learning with wide ensemble-based convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5800–5809.Google ScholarCross Ref
- [48] . 2021. BUPT-MCPRL at TRECVID 2021. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv21.papers/bupt-mcprl.pdfGoogle Scholar
- [49] . 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5693–5703.Google ScholarCross Ref
- [50] . 2021. QPIC: Query-based pairwise human-object interaction detection with image-wide contextual information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10410–10419.Google ScholarCross Ref
- [51] . 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.Google ScholarDigital Library
- [52] . 2020. Actor conditioned attention maps for video action detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 527–536.Google ScholarCross Ref
- [53] . 2018. MovieGraphs: Towards understanding human-centric situations from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8581–8590.Google ScholarCross Ref
- [54] . 2020. Searching for desired person doing desired action based on visual and audio feature in large scale video database. In Proceedings of the International Conference on Multimedia Analysis and Pattern Recognition (MAPR’20). IEEE, 1–6.Google Scholar
- [55] . 2020. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Trans. Image Process. 29 (2020), 4057–4069.Google ScholarDigital Library
- [56] . 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794–7803.Google ScholarCross Ref
- [57] . 2019. Salient time slice pruning and boosting for person-scene instance search in TV series. In Proceedings of the ACM Multimedia Asia Conference. 1–6.Google ScholarDigital Library
- [58] . 2016. A discriminative feature learning approach for deep face recognition. In Proceedings of the European Conference on Computer Vision. Springer, 499–515.Google ScholarCross Ref
- [59] . 2008. A novel framework for semantic annotation and personalized retrieval of sports video. IEEE Trans. Multim. 10, 3 (2008), 421–436.Google ScholarDigital Library
- [60] , Shih-Fu Chang, Lyndon Kennedy, and Winston Hsu. 2007. Columbia university.s baseline detectors for 374 LSCOM semantic visual concepts. Technical Report. Columbia University. Retrieved from http://www.ee.columbia.edu/dvmm/columbia374Google Scholar
- [61] . 2020. WHU-NERCMS at TRECVID2020: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/whu_nercms.pdfGoogle Scholar
- [62] . 2021. Instance search via fusing hierarchical multi-level retrieval and human-object interaction detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2323–2327.Google ScholarCross Ref
- [63] . 2019. Inf@TRECVID 2019: Instance search task. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv19.papers/inf_ins.pdfGoogle Scholar
- [64] . 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Process. Lett. 23, 10 (
Oct. 2016), 1499–1503.Google ScholarCross Ref - [65] . 2020. BUPT-MCPRL aW TRECVID 2020: INS. In Proceedings of the TRECVID Workshop. Retrieved from https://www-nlpir.nist.gov/projects/tvpubs/tv20.papers/bupt-mcprl_ins.pdfGoogle Scholar
- [66] . 2018. ECO: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision (ECCV’18). 695–712.Google ScholarDigital Library
Index Terms
- Person-action Instance Search in Story Videos: An Experimental Study
Recommendations
A Spatio-Temporal Identity Verification Method for Person-Action Instance Search in Movies
MultiMedia ModelingAbstractAs one of the challenging problems in video search, Person-Action Instance Search (P-A INS) aims to retrieve shots with a specific person carrying out a specific action from massive amounts of video shots. Most existing methods conduct person INS ...
An experimental study of passive dynamic walking
A two-straight-legged walking mechanism with flat feet is designed and built to study the passive dynamic gait. It is shown that the mechanism having flat feet can exhibit passive dynamic walking as those with curved feet, but the walking efficiency is ...
A study of results overlap and uniqueness among major web search engines
The performance and capabilities of Web search engines is an important and significant area of research. Millions of people world wide use Web search engines very day. This paper reports the results of a major study examining the overlap among results ...
Comments