Abstract
Human-Centric Spatio-Temporal Video Grounding (HC-STVG) is a recently emerging task that aims to localize the spatio-temporal locations of the target person depicted in a natural language query. To tackle this task, we propose a simple yet effective two-stage framework, which is based on a Matching and Localizing paradigm. Under our framework, we decompose HC-STVG into two stages. In the first stage, we conduct cross-modal matching between the query and candidate moments to determine the temporal boundaries. Specifically, we develop an Augmented 2D Temporal Adjacent Network (Aug-2D-TAN) as our temporal matching module. In this module, we improve 2D-TAN [7] from two aspects: 1), A Temporal-Aware Context Aggregation module (TACA) to jointly aggregate the past contexts in forward direction and the future contexts in backward direction, which helps to learn more discriminative moment representations for cross-modal matching. 2), A Random Concatenation Augmentation (RCA) mechanism to combat overfitting and reduce the risk of unreasonably learning query-independent saliency prior, which is mistakenly provided by the training videos that only contain a single salient event. In the second stage, we utilize the pretrained MDETR [4] model to associate the language query with meaningful bounding boxes. Then, we conduct a query-based denoising procedure on the language-aware bounding boxes to obtain the frame-wise prediction for spatial localization. Experiments show that our simple yet effective framework can achieve a promising performance for the challenging HC-STVG task.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600. arXiv preprint arXiv:1808.01340 (2018)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6202–6211 (2019)
Gu, C., et al.: Ava: A video dataset of spatio-temporally localized atomic visual actions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)
Kamath, A., Singh, M., LeCun, Y., Misra, I., Synnaeve, G., Carion, N.: Mdetr-modulated detection for end-to-end multi-modal understanding. arXiv preprint arXiv:2104.12763 (2021)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Tang, Z., et al.: Human-centric spatio-temporal video grounding with visual transformers. IEEE Trans. Circuits Syst. Video Technol. (2021)
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12870–12877 (2020)
Zhang, Y., Li, Z., Min, Z.: Efficient second-order TreeCRF for neural dependency parsing. In: Proceedings of ACL, pp. 3295–3305 (2020). https://www.aclweb.org/anthology/2020.acl-main.302
Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: Proceedings of the IEEE International Conference On Computer Vision, pp. 5803–5812 (2017)
Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: Temporal activity localization via language query. In: Proceedings of the IEEE International Conference On Computer Vision, pp. 5267–5275 (2017)
Ge, R., Gao, J., Chen, K., Nevatia, R.: Mac: Mining activity concepts for language-based temporal localization. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 245–253. IEEE (2019)
Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48
Wu, J., Li, G., Liu, S., Lin, L.: Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12386–12393 (2020)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: European Conference on Computer Vision. pp. 69–85. Springer (2016)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference On Computer Vision And Pattern Recognition, pp. 11–20 (2016)
Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE Conference On Computer Vision And Pattern Recognition, pp. 1115–1124 (2017)
Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4644–4653 (2019)
Qiu, H., et al.: Language-aware fine-grained object representation for referring expression comprehension. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4171–4180 (2020)
Chen, Z., Ma, L., Luo, W., Wong, K.Y.K.: Weakly-supervised spatio-temporally grounding natural sentence in video. arXiv preprint arXiv:1906.02549 (2019)
Su, R., Yu, Q., Xu, D.: Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1533–1542 (2021)
Wang, Z., Wang, L., Wu, T., Li, T., Wu, G.: Negative sample matters: A renaissance of metric learning for temporal grounding. arXiv preprint arXiv:2109.04872 (2021)
Li, K., Guo, D., Wang, M.: Proposal-free video grounding with contextual pyramid network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1902–1910 (2021)
Mun, J., Cho, M., Han, B.: Local-global video-text interactions for temporal grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10810–10819 (2020)
Wang, H., Zha, Z.J., Chen, X., Xiong, Z., Luo, J.: Dual path interaction network for video moment localization. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4116–4124 (2020)
Xiao, S., Chen, L., Zhang, S., Ji, W., Shao, J., Ye, L., Xiao, J.: Boundary proposal network for two-stage natural language video localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2986–2994 (2021)
Zhang, H., Sun, A., Jing, W., Zhen, L., Zhou, J.T., Goh, R.S.M.: Natural language video localization: A revisit in span-based question answering framework. Ieee Trans. Pattern Anal. Mach. Intell. (2021)
Acknowledgements
This work was supported partially by the NSFC (U1911401, U1811461, 62076260, 61772570), Guangdong Natural Science Funds Project (2020B1515120085), Guangdong NSF for Distinguished Young Scholar (2022B1515020009), and the Key-Area Research and Development Program of Guangzhou (202007030004).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tan, C., Hu, JF., Zheng, WS. (2022). Matching and Localizing: A Simple yet Effective Framework for Human-Centric Spatio-Temporal Video Grounding. In: Fang, L., Povey, D., Zhai, G., Mei, T., Wang, R. (eds) Artificial Intelligence. CICAI 2022. Lecture Notes in Computer Science(), vol 13604. Springer, Cham. https://doi.org/10.1007/978-3-031-20497-5_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-20497-5_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20496-8
Online ISBN: 978-3-031-20497-5
eBook Packages: Computer ScienceComputer Science (R0)