Abstract
Temporal Activity Localization via Language (TALL) is a challenging task for language based video understanding, especially when a video contains multiple moments of interest and the language query has words describing complex context dependencies between the moments. Latest studies have proposed various ways to exploit the temporal context of adjacent moments, but two apparent limitations remained. First, only limited context information was encoded based on RNNs or 2-D convolutions, which highly depended on the pre-sorting of proposals and lacked flexibility. Second, semantically correlated content in different moments was ignored, i.e., semantic context. To address these limitations, we propose a novel GCN-based framework, i.e., Context-Aware Moment Graph (CAMG) network, to jointly model the temporal context and semantic context. Also, we design a multi-step fusion scheme to aggregate object, motion and textual features. A Query-Gated Integration Module is further designed to select queried objects and filter out noisy ones. Our model achieves superior performance to state-of-the-art methods on two widely-used benchmark datasets.
Similar content being viewed by others
References
Chen, S., Jiang, W., Liu, W., Jiang, Y.-G.: Learning modality interaction for temporal sentence localization and event captioning in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 333–351. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_20
Chen, S., Jiang, Y.-G.: Hierarchical visual-textual graph for temporal activity localization via language. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 601–618. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_36
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Gao, J., Sun, X., Ghanem, B., Zhou, X., Ge, S.: Efficient video grounding with which-where reading comprehension. TCSVT 32, 6900–6913 (2022)
Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: ICCV, pp. 5267–5275 (2017)
Jiang, B., Huang, X., Yang, C., Yuan, J.: Cross-modal video moment retrieval with spatial and language-temporal attention. In: ICMR, pp. 217–225 (2019)
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: ICCV, pp. 706–715 (2017)
Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.S.: Attentive moment retrieval in videos. In: ACM SIGIR, pp. 15–24 (2018)
Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., Chua, T.S.: Cross-modal moment localization in videos. In: ACM MM, pp. 843–851 (2018)
Ning, K., Xie, L., Liu, J., Wu, F., Tian, Q.: Interaction-integrated network for natural language moment localization. TIP 30, 2538–2548 (2021)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-ResNet and the impact of residual connections on learning. In: AAAI (2017)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR, pp. 2818–2826 (2016)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2015)
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. Acm Trans. Graph. (TOG) 38(5), 1–12 (2019)
Wang, Z., Wang, L., Wu, T., Li, T., Wu, G.: Negative sample matters: a renaissance of metric learning for temporal grounding. In: AAAI, vol. 36, pp. 2613–2623 (2022)
Xiao, S., et al.: Boundary proposal network for two-stage natural language video localization. arXiv preprint arXiv:2103.08109 (2021)
Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression. In: AAAI, vol. 33, pp. 9159–9166 (2019)
Zhang, D., Dai, X., Wang, X., Wang, Y.F., Davis, L.S.: MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. In: CVPR, pp. 1247–1257 (2019)
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: AAAI, vol. 34, pp. 12870–12877 (2020)
Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: ACM SIGIR, pp. 655–664 (2019)
Zhou, H., Zhang, C., Luo, Y., Hu, C., Zhang, W.: Thinking inside uncertainty: interest moment perception for diverse temporal grounding. TCSVT 32, 7190–7203 (2022)
Acknowledgment
This work was supported by National Science and Technology Innovation 2030 - Major Project (No. 2021ZD0114001; No. 2021ZD0114000), National Natural Science Foundation of China (No. 61976057; No. 62172101), and the Science and Technology Commission of Shanghai Municipality (No. 21511101000; No. 22DZ1100101).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hu, Y. et al. (2023). CAMG: Context-Aware Moment Graph Network for Multimodal Temporal Activity Localization via Language. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14302. Springer, Cham. https://doi.org/10.1007/978-3-031-44693-1_34
Download citation
DOI: https://doi.org/10.1007/978-3-031-44693-1_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44692-4
Online ISBN: 978-3-031-44693-1
eBook Packages: Computer ScienceComputer Science (R0)