Skip to main content

CAMG: Context-Aware Moment Graph Network for Multimodal Temporal Activity Localization via Language

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14302))

  • 1901 Accesses

Abstract

Temporal Activity Localization via Language (TALL) is a challenging task for language based video understanding, especially when a video contains multiple moments of interest and the language query has words describing complex context dependencies between the moments. Latest studies have proposed various ways to exploit the temporal context of adjacent moments, but two apparent limitations remained. First, only limited context information was encoded based on RNNs or 2-D convolutions, which highly depended on the pre-sorting of proposals and lacked flexibility. Second, semantically correlated content in different moments was ignored, i.e., semantic context. To address these limitations, we propose a novel GCN-based framework, i.e., Context-Aware Moment Graph (CAMG) network, to jointly model the temporal context and semantic context. Also, we design a multi-step fusion scheme to aggregate object, motion and textual features. A Query-Gated Integration Module is further designed to select queried objects and filter out noisy ones. Our model achieves superior performance to state-of-the-art methods on two widely-used benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/ultralytics/yolov5.

References

  1. Chen, S., Jiang, W., Liu, W., Jiang, Y.-G.: Learning modality interaction for temporal sentence localization and event captioning in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 333–351. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_20

    Chapter  Google Scholar 

  2. Chen, S., Jiang, Y.-G.: Hierarchical visual-textual graph for temporal activity localization via language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 601–618. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_36

    Chapter  Google Scholar 

  3. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)

  4. Gao, J., Sun, X., Ghanem, B., Zhou, X., Ge, S.: Efficient video grounding with which-where reading comprehension. TCSVT 32, 6900–6913 (2022)

    Google Scholar 

  5. Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: ICCV, pp. 5267–5275 (2017)

    Google Scholar 

  6. Jiang, B., Huang, X., Yang, C., Yuan, J.: Cross-modal video moment retrieval with spatial and language-temporal attention. In: ICMR, pp. 217–225 (2019)

    Google Scholar 

  7. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: ICCV, pp. 706–715 (2017)

    Google Scholar 

  8. Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.S.: Attentive moment retrieval in videos. In: ACM SIGIR, pp. 15–24 (2018)

    Google Scholar 

  9. Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., Chua, T.S.: Cross-modal moment localization in videos. In: ACM MM, pp. 843–851 (2018)

    Google Scholar 

  10. Ning, K., Xie, L., Liu, J., Wu, F., Tian, Q.: Interaction-integrated network for natural language moment localization. TIP 30, 2538–2548 (2021)

    Google Scholar 

  11. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)

    Google Scholar 

  12. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-ResNet and the impact of residual connections on learning. In: AAAI (2017)

    Google Scholar 

  13. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR, pp. 2818–2826 (2016)

    Google Scholar 

  14. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2015)

    Google Scholar 

  15. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)

  16. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. Acm Trans. Graph. (TOG) 38(5), 1–12 (2019)

    Article  Google Scholar 

  17. Wang, Z., Wang, L., Wu, T., Li, T., Wu, G.: Negative sample matters: a renaissance of metric learning for temporal grounding. In: AAAI, vol. 36, pp. 2613–2623 (2022)

    Google Scholar 

  18. Xiao, S., et al.: Boundary proposal network for two-stage natural language video localization. arXiv preprint arXiv:2103.08109 (2021)

  19. Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression. In: AAAI, vol. 33, pp. 9159–9166 (2019)

    Google Scholar 

  20. Zhang, D., Dai, X., Wang, X., Wang, Y.F., Davis, L.S.: MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. In: CVPR, pp. 1247–1257 (2019)

    Google Scholar 

  21. Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: AAAI, vol. 34, pp. 12870–12877 (2020)

    Google Scholar 

  22. Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: ACM SIGIR, pp. 655–664 (2019)

    Google Scholar 

  23. Zhou, H., Zhang, C., Luo, Y., Hu, C., Zhang, W.: Thinking inside uncertainty: interest moment perception for diverse temporal grounding. TCSVT 32, 7190–7203 (2022)

    Google Scholar 

Download references

Acknowledgment

This work was supported by National Science and Technology Innovation 2030 - Major Project (No. 2021ZD0114001; No. 2021ZD0114000), National Natural Science Foundation of China (No. 61976057; No. 62172101), and the Science and Technology Commission of Shanghai Municipality (No. 21511101000; No. 22DZ1100101).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuejie Zhang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5610 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hu, Y. et al. (2023). CAMG: Context-Aware Moment Graph Network for Multimodal Temporal Activity Localization via Language. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14302. Springer, Cham. https://doi.org/10.1007/978-3-031-44693-1_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44693-1_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44692-4

  • Online ISBN: 978-3-031-44693-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics