Matching and Localizing: A Simple yet Effective Framework for Human-Centric Spatio-Temporal Video Grounding

Tan, Chaolei; Hu, Jian-Fang; Zheng, Wei-Shi

doi:10.1007/978-3-031-20497-5_25

Chaolei Tan¹²,
Jian-Fang Hu¹² &
Wei-Shi Zheng¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13604))

Included in the following conference series:

CAAI International Conference on Artificial Intelligence

1724 Accesses

Abstract

Human-Centric Spatio-Temporal Video Grounding (HC-STVG) is a recently emerging task that aims to localize the spatio-temporal locations of the target person depicted in a natural language query. To tackle this task, we propose a simple yet effective two-stage framework, which is based on a Matching and Localizing paradigm. Under our framework, we decompose HC-STVG into two stages. In the first stage, we conduct cross-modal matching between the query and candidate moments to determine the temporal boundaries. Specifically, we develop an Augmented 2D Temporal Adjacent Network (Aug-2D-TAN) as our temporal matching module. In this module, we improve 2D-TAN [7] from two aspects: 1), A Temporal-Aware Context Aggregation module (TACA) to jointly aggregate the past contexts in forward direction and the future contexts in backward direction, which helps to learn more discriminative moment representations for cross-modal matching. 2), A Random Concatenation Augmentation (RCA) mechanism to combat overfitting and reduce the risk of unreasonably learning query-independent saliency prior, which is mistakenly provided by the training videos that only contain a single salient event. In the second stage, we utilize the pretrained MDETR [4] model to associate the language query with meaningful bounding boxes. Then, we conduct a query-based denoising procedure on the language-aware bounding boxes to obtain the frame-wise prediction for spatial localization. Experiments show that our simple yet effective framework can achieve a promising performance for the challenging HC-STVG task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation

Weakly-Supervised Spatio-Temporal Video Grounding with Variational Cross-Modal Alignment

Training-Free Video Temporal Grounding Using Large-Scale Pre-trained Models

References

Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600. arXiv preprint arXiv:1808.01340 (2018)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6202–6211 (2019)
Google Scholar
Gu, C., et al.: Ava: A video dataset of spatio-temporally localized atomic visual actions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)
Google Scholar
Kamath, A., Singh, M., LeCun, Y., Misra, I., Synnaeve, G., Carion, N.: Mdetr-modulated detection for end-to-end multi-modal understanding. arXiv preprint arXiv:2104.12763 (2021)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Tang, Z., et al.: Human-centric spatio-temporal video grounding with visual transformers. IEEE Trans. Circuits Syst. Video Technol. (2021)
Google Scholar
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12870–12877 (2020)
Google Scholar
Zhang, Y., Li, Z., Min, Z.: Efficient second-order TreeCRF for neural dependency parsing. In: Proceedings of ACL, pp. 3295–3305 (2020). https://www.aclweb.org/anthology/2020.acl-main.302
Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: Proceedings of the IEEE International Conference On Computer Vision, pp. 5803–5812 (2017)
Google Scholar
Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: Temporal activity localization via language query. In: Proceedings of the IEEE International Conference On Computer Vision, pp. 5267–5275 (2017)
Google Scholar
Ge, R., Gao, J., Chen, K., Nevatia, R.: Mac: Mining activity concepts for language-based temporal localization. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 245–253. IEEE (2019)
Google Scholar
Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48
Chapter Google Scholar
Wu, J., Li, G., Liu, S., Lin, L.: Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12386–12393 (2020)
Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: European Conference on Computer Vision. pp. 69–85. Springer (2016)
Google Scholar
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference On Computer Vision And Pattern Recognition, pp. 11–20 (2016)
Google Scholar
Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE Conference On Computer Vision And Pattern Recognition, pp. 1115–1124 (2017)
Google Scholar
Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4644–4653 (2019)
Google Scholar
Qiu, H., et al.: Language-aware fine-grained object representation for referring expression comprehension. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4171–4180 (2020)
Google Scholar
Chen, Z., Ma, L., Luo, W., Wong, K.Y.K.: Weakly-supervised spatio-temporally grounding natural sentence in video. arXiv preprint arXiv:1906.02549 (2019)
Su, R., Yu, Q., Xu, D.: Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1533–1542 (2021)
Google Scholar
Wang, Z., Wang, L., Wu, T., Li, T., Wu, G.: Negative sample matters: A renaissance of metric learning for temporal grounding. arXiv preprint arXiv:2109.04872 (2021)
Li, K., Guo, D., Wang, M.: Proposal-free video grounding with contextual pyramid network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1902–1910 (2021)
Google Scholar
Mun, J., Cho, M., Han, B.: Local-global video-text interactions for temporal grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10810–10819 (2020)
Google Scholar
Wang, H., Zha, Z.J., Chen, X., Xiong, Z., Luo, J.: Dual path interaction network for video moment localization. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4116–4124 (2020)
Google Scholar
Xiao, S., Chen, L., Zhang, S., Ji, W., Shao, J., Ye, L., Xiao, J.: Boundary proposal network for two-stage natural language video localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2986–2994 (2021)
Google Scholar
Zhang, H., Sun, A., Jing, W., Zhen, L., Zhou, J.T., Goh, R.S.M.: Natural language video localization: A revisit in span-based question answering framework. Ieee Trans. Pattern Anal. Mach. Intell. (2021)
Google Scholar

Download references

Acknowledgements

This work was supported partially by the NSFC (U1911401, U1811461, 62076260, 61772570), Guangdong Natural Science Funds Project (2020B1515120085), Guangdong NSF for Distinguished Young Scholar (2022B1515020009), and the Key-Area Research and Development Program of Guangzhou (202007030004).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
Chaolei Tan, Jian-Fang Hu & Wei-Shi Zheng

Authors

Chaolei Tan
View author publications
You can also search for this author in PubMed Google Scholar
Jian-Fang Hu
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Shi Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei-Shi Zheng .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Lu Fang
Xiaomi Inc., Beijing, China
Daniel Povey
Shanghai Jiao Tong University, Shanghai, China
Guangtao Zhai
JD Explore Academy, Beijing, China
Tao Mei
Chinese Academy of Sciences, Beijing, China
Ruiping Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tan, C., Hu, JF., Zheng, WS. (2022). Matching and Localizing: A Simple yet Effective Framework for Human-Centric Spatio-Temporal Video Grounding. In: Fang, L., Povey, D., Zhai, G., Mei, T., Wang, R. (eds) Artificial Intelligence. CICAI 2022. Lecture Notes in Computer Science(), vol 13604. Springer, Cham. https://doi.org/10.1007/978-3-031-20497-5_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-20497-5_25
Published: 17 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20496-8
Online ISBN: 978-3-031-20497-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Matching and Localizing: A Simple yet Effective Framework for Human-Centric Spatio-Temporal Video Grounding