CAMG: Context-Aware Moment Graph Network for Multimodal Temporal Activity Localization via Language

Hu, Yuelin; Xu, Yuanwu; Zhang, Yuejie; Feng, Rui; Zhang, Tao; Lu, Xuequan; Gao, Shang

doi:10.1007/978-3-031-44693-1_34

Yuelin Hu¹¹,
Yuanwu Xu¹¹,
Yuejie Zhang¹¹,
Rui Feng¹¹,
Tao Zhang¹²,
Xuequan Lu¹³ &
…
Shang Gao¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14302))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

1901 Accesses

Abstract

Temporal Activity Localization via Language (TALL) is a challenging task for language based video understanding, especially when a video contains multiple moments of interest and the language query has words describing complex context dependencies between the moments. Latest studies have proposed various ways to exploit the temporal context of adjacent moments, but two apparent limitations remained. First, only limited context information was encoded based on RNNs or 2-D convolutions, which highly depended on the pre-sorting of proposals and lacked flexibility. Second, semantically correlated content in different moments was ignored, i.e., semantic context. To address these limitations, we propose a novel GCN-based framework, i.e., Context-Aware Moment Graph (CAMG) network, to jointly model the temporal context and semantic context. Also, we design a multi-step fusion scheme to aggregate object, motion and textual features. A Query-Gated Integration Module is further designed to select queried objects and filter out noisy ones. Our model achieves superior performance to state-of-the-art methods on two widely-used benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language

Multi-grained Cascade Interaction Network for Temporal Activity Localization via Language

Context-aware network with foreground recalibration for grounding natural language in video

Article 26 February 2021

Notes

1.
https://github.com/ultralytics/yolov5.

References

Chen, S., Jiang, W., Liu, W., Jiang, Y.-G.: Learning modality interaction for temporal sentence localization and event captioning in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 333–351. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_20
Chapter Google Scholar
Chen, S., Jiang, Y.-G.: Hierarchical visual-textual graph for temporal activity localization via language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 601–618. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_36
Chapter Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Gao, J., Sun, X., Ghanem, B., Zhou, X., Ge, S.: Efficient video grounding with which-where reading comprehension. TCSVT 32, 6900–6913 (2022)
Google Scholar
Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: ICCV, pp. 5267–5275 (2017)
Google Scholar
Jiang, B., Huang, X., Yang, C., Yuan, J.: Cross-modal video moment retrieval with spatial and language-temporal attention. In: ICMR, pp. 217–225 (2019)
Google Scholar
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: ICCV, pp. 706–715 (2017)
Google Scholar
Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.S.: Attentive moment retrieval in videos. In: ACM SIGIR, pp. 15–24 (2018)
Google Scholar
Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., Chua, T.S.: Cross-modal moment localization in videos. In: ACM MM, pp. 843–851 (2018)
Google Scholar
Ning, K., Xie, L., Liu, J., Wu, F., Tian, Q.: Interaction-integrated network for natural language moment localization. TIP 30, 2538–2548 (2021)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Google Scholar
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-ResNet and the impact of residual connections on learning. In: AAAI (2017)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR, pp. 2818–2826 (2016)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. Acm Trans. Graph. (TOG) 38(5), 1–12 (2019)
Article Google Scholar
Wang, Z., Wang, L., Wu, T., Li, T., Wu, G.: Negative sample matters: a renaissance of metric learning for temporal grounding. In: AAAI, vol. 36, pp. 2613–2623 (2022)
Google Scholar
Xiao, S., et al.: Boundary proposal network for two-stage natural language video localization. arXiv preprint arXiv:2103.08109 (2021)
Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression. In: AAAI, vol. 33, pp. 9159–9166 (2019)
Google Scholar
Zhang, D., Dai, X., Wang, X., Wang, Y.F., Davis, L.S.: MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. In: CVPR, pp. 1247–1257 (2019)
Google Scholar
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: AAAI, vol. 34, pp. 12870–12877 (2020)
Google Scholar
Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: ACM SIGIR, pp. 655–664 (2019)
Google Scholar
Zhou, H., Zhang, C., Luo, Y., Hu, C., Zhang, W.: Thinking inside uncertainty: interest moment perception for diverse temporal grounding. TCSVT 32, 7190–7203 (2022)
Google Scholar

Download references

Acknowledgment

This work was supported by National Science and Technology Innovation 2030 - Major Project (No. 2021ZD0114001; No. 2021ZD0114000), National Natural Science Foundation of China (No. 61976057; No. 62172101), and the Science and Technology Commission of Shanghai Municipality (No. 21511101000; No. 22DZ1100101).

Author information

Authors and Affiliations

School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Shanghai Collaborative Innovation Center of Intelligent Visual Computing, Fudan University, Shanghai, 200433, People’s Republic of China
Yuelin Hu, Yuanwu Xu, Yuejie Zhang & Rui Feng
School of Information Management and Engineering, Shanghai Key Laboratory of Financial Information Technology, Shanghai University of Finance and Economics, Shanghai, 200433, People’s Republic of China
Tao Zhang
School of Information Technology, Deakin University, Waurn Ponds, VIC, 3216, Australia
Xuequan Lu & Shang Gao

Authors

Yuelin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yuanwu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yuejie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Feng
View author publications
You can also search for this author in PubMed Google Scholar
Tao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xuequan Lu
View author publications
You can also search for this author in PubMed Google Scholar
Shang Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuejie Zhang .

Editor information

Editors and Affiliations

Emory University, Atlanta, GA, USA
Fei Liu
Microsoft Research Asia, Beijing, China
Nan Duan
Soochow University, Suzhou, China
Qingting Xu
Soochow University, Suzhou, China
Yu Hong

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5610 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, Y. et al. (2023). CAMG: Context-Aware Moment Graph Network for Multimodal Temporal Activity Localization via Language. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14302. Springer, Cham. https://doi.org/10.1007/978-3-031-44693-1_34

Download citation

DOI: https://doi.org/10.1007/978-3-031-44693-1_34
Published: 08 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44692-4
Online ISBN: 978-3-031-44693-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

CAMG: Context-Aware Moment Graph Network for Multimodal Temporal Activity Localization via Language

Abstract

Access this chapter

Similar content being viewed by others

Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language

Multi-grained Cascade Interaction Network for Temporal Activity Localization via Language

Context-aware network with foreground recalibration for grounding natural language in video

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 5610 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

CAMG: Context-Aware Moment Graph Network for Multimodal Temporal Activity Localization via Language

Abstract

Access this chapter

Similar content being viewed by others

Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language

Multi-grained Cascade Interaction Network for Temporal Activity Localization via Language

Context-aware network with foreground recalibration for grounding natural language in video

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 5610 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation