Skip to main content

Matching and Localizing: A Simple yet Effective Framework for Human-Centric Spatio-Temporal Video Grounding

  • Conference paper
  • First Online:
  • 1309 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13604))

Abstract

Human-Centric Spatio-Temporal Video Grounding (HC-STVG) is a recently emerging task that aims to localize the spatio-temporal locations of the target person depicted in a natural language query. To tackle this task, we propose a simple yet effective two-stage framework, which is based on a Matching and Localizing paradigm. Under our framework, we decompose HC-STVG into two stages. In the first stage, we conduct cross-modal matching between the query and candidate moments to determine the temporal boundaries. Specifically, we develop an Augmented 2D Temporal Adjacent Network (Aug-2D-TAN) as our temporal matching module. In this module, we improve 2D-TAN [7] from two aspects: 1), A Temporal-Aware Context Aggregation module (TACA) to jointly aggregate the past contexts in forward direction and the future contexts in backward direction, which helps to learn more discriminative moment representations for cross-modal matching. 2), A Random Concatenation Augmentation (RCA) mechanism to combat overfitting and reduce the risk of unreasonably learning query-independent saliency prior, which is mistakenly provided by the training videos that only contain a single salient event. In the second stage, we utilize the pretrained MDETR [4] model to associate the language query with meaningful bounding boxes. Then, we conduct a query-based denoising procedure on the language-aware bounding boxes to obtain the frame-wise prediction for spatial localization. Experiments show that our simple yet effective framework can achieve a promising performance for the challenging HC-STVG task.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600. arXiv preprint arXiv:1808.01340 (2018)

  2. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6202–6211 (2019)

    Google Scholar 

  3. Gu, C., et al.: Ava: A video dataset of spatio-temporally localized atomic visual actions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)

    Google Scholar 

  4. Kamath, A., Singh, M., LeCun, Y., Misra, I., Synnaeve, G., Carion, N.: Mdetr-modulated detection for end-to-end multi-modal understanding. arXiv preprint arXiv:2104.12763 (2021)

  5. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  6. Tang, Z., et al.: Human-centric spatio-temporal video grounding with visual transformers. IEEE Trans. Circuits Syst. Video Technol. (2021)

    Google Scholar 

  7. Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12870–12877 (2020)

    Google Scholar 

  8. Zhang, Y., Li, Z., Min, Z.: Efficient second-order TreeCRF for neural dependency parsing. In: Proceedings of ACL, pp. 3295–3305 (2020). https://www.aclweb.org/anthology/2020.acl-main.302

  9. Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: Proceedings of the IEEE International Conference On Computer Vision, pp. 5803–5812 (2017)

    Google Scholar 

  10. Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: Temporal activity localization via language query. In: Proceedings of the IEEE International Conference On Computer Vision, pp. 5267–5275 (2017)

    Google Scholar 

  11. Ge, R., Gao, J., Chen, K., Nevatia, R.: Mac: Mining activity concepts for language-based temporal localization. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 245–253. IEEE (2019)

    Google Scholar 

  12. Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48

    Chapter  Google Scholar 

  13. Wu, J., Li, G., Liu, S., Lin, L.: Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12386–12393 (2020)

    Google Scholar 

  14. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: European Conference on Computer Vision. pp. 69–85. Springer (2016)

    Google Scholar 

  15. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference On Computer Vision And Pattern Recognition, pp. 11–20 (2016)

    Google Scholar 

  16. Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE Conference On Computer Vision And Pattern Recognition, pp. 1115–1124 (2017)

    Google Scholar 

  17. Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4644–4653 (2019)

    Google Scholar 

  18. Qiu, H., et al.: Language-aware fine-grained object representation for referring expression comprehension. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4171–4180 (2020)

    Google Scholar 

  19. Chen, Z., Ma, L., Luo, W., Wong, K.Y.K.: Weakly-supervised spatio-temporally grounding natural sentence in video. arXiv preprint arXiv:1906.02549 (2019)

  20. Su, R., Yu, Q., Xu, D.: Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1533–1542 (2021)

    Google Scholar 

  21. Wang, Z., Wang, L., Wu, T., Li, T., Wu, G.: Negative sample matters: A renaissance of metric learning for temporal grounding. arXiv preprint arXiv:2109.04872 (2021)

  22. Li, K., Guo, D., Wang, M.: Proposal-free video grounding with contextual pyramid network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1902–1910 (2021)

    Google Scholar 

  23. Mun, J., Cho, M., Han, B.: Local-global video-text interactions for temporal grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10810–10819 (2020)

    Google Scholar 

  24. Wang, H., Zha, Z.J., Chen, X., Xiong, Z., Luo, J.: Dual path interaction network for video moment localization. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4116–4124 (2020)

    Google Scholar 

  25. Xiao, S., Chen, L., Zhang, S., Ji, W., Shao, J., Ye, L., Xiao, J.: Boundary proposal network for two-stage natural language video localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2986–2994 (2021)

    Google Scholar 

  26. Zhang, H., Sun, A., Jing, W., Zhen, L., Zhou, J.T., Goh, R.S.M.: Natural language video localization: A revisit in span-based question answering framework. Ieee Trans. Pattern Anal. Mach. Intell. (2021)

    Google Scholar 

Download references

Acknowledgements

This work was supported partially by the NSFC (U1911401, U1811461, 62076260, 61772570), Guangdong Natural Science Funds Project (2020B1515120085), Guangdong NSF for Distinguished Young Scholar (2022B1515020009), and the Key-Area Research and Development Program of Guangzhou (202007030004).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei-Shi Zheng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tan, C., Hu, JF., Zheng, WS. (2022). Matching and Localizing: A Simple yet Effective Framework for Human-Centric Spatio-Temporal Video Grounding. In: Fang, L., Povey, D., Zhai, G., Mei, T., Wang, R. (eds) Artificial Intelligence. CICAI 2022. Lecture Notes in Computer Science(), vol 13604. Springer, Cham. https://doi.org/10.1007/978-3-031-20497-5_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20497-5_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20496-8

  • Online ISBN: 978-3-031-20497-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics