CAMG: Context-Aware Moment Graph Network for Multimodal Temporal Activity Localization via Language

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14302))

  • 1901 Accesses


Temporal Activity Localization via Language (TALL) is a challenging task for language based video understanding, especially when a video contains multiple moments of interest and the language query has words describing complex context dependencies between the moments. Latest studies have proposed various ways to exploit the temporal context of adjacent moments, but two apparent limitations remained. First, only limited context information was encoded based on RNNs or 2-D convolutions, which highly depended on the pre-sorting of proposals and lacked flexibility. Second, semantically correlated content in different moments was ignored, i.e., semantic context. To address these limitations, we propose a novel GCN-based framework, i.e., Context-Aware Moment Graph (CAMG) network, to jointly model the temporal context and semantic context. Also, we design a multi-step fusion scheme to aggregate object, motion and textual features. A Query-Gated Integration Module is further designed to select queried objects and filter out noisy ones. Our model achieves superior performance to state-of-the-art methods on two widely-used benchmark datasets.

This work was supported by National Science and Technology Innovation 2030 - Major Project (No. 2021ZD0114001; No. 2021ZD0114000), National Natural Science Foundation of China (No. 61976057; No. 62172101), and the Science and Technology Commission of Shanghai Municipality (No. 21511101000; No. 22DZ1100101).

