Asymmetric Relation Consistency Reasoning for Video Relation Grounding

Li, Huan; Wei, Ping; Li, Jiapeng; Ma, Zeyu; Shang, Jiahui; Zheng, Nanning

doi:10.1007/978-3-031-19833-5_8

Huan Li¹²,
Ping Wei¹²,
Jiapeng Li¹²,
Zeyu Ma¹²,
Jiahui Shang¹² &
…
Nanning Zheng¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13695))

Included in the following conference series:

European Conference on Computer Vision

1836 Accesses
4 Citations

Abstract

Video relation grounding has attracted growing attention in the fields of video understanding and multimodal learning. While the past years have witnessed remarkable progress in this issue, the difficulties of multi-instance and complex temporal reasoning make it still a challenging task. In this paper, we propose a novel Asymmetric Relation Consistency (ARC) reasoning model to solve the video relation grounding problem. To overcome the multi-instance confusion problem, an asymmetric relation reasoning method and a novel relation consistency loss are proposed to ensure the consistency of the relationships across multiple instances. In order to precisely localize the relation instance in temporal context, a transformer-based relation reasoning module is proposed. Our model is trained in a weakly-supervised manner. The proposed method was tested on the challenging video relation dataset. Experiments manifest that the performance of our method outperforms the state-of-the-art methods by a large margin. Extensive ablation studies also prove the effectiveness and strength of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, Z., Ma, L., Luo, W., Wong, K.Y.K.: Weakly-supervised spatio-temporally grounding natural sentence in video. In: The Annual Meeting of the Association for Computational Linguistics (2019)
Google Scholar
Da, C., Zhang, Y., Zheng, Y., Pan, P., Xu, Y., Pan, C.: Asynce: disentangling false-positives for weakly-supervised video grounding. In: ACM International Conference on Multimedia (2021)
Google Scholar
Ding, X., et al.: Support-set based cross-supervision for video grounding. In: IEEE CVPR (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Gao, C., Xu, J., Zou, Y., Huang, J.-B.: DRG: dual relation graph for human-object interaction detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 696–712. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_41
Chapter Google Scholar
Gao, K., Chen, L., Huang, Y., Xiao, J.: Video relation detection via tracklet based visual transformer. In: ACM International Conference on Multimedia (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Kim, B., Lee, J., Kang, J., Kim, E.S., Kim, H.J.: Hotr: end-to-end human-object interaction detection with transformers. In: IEEE CVPR (2021)
Google Scholar
Krishna, R., Chami, I., Bernstein, M., Fei-Fei, L.: Referring relationships. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Li, J., Wei, P., Zhang, Y., Zheng, N.: A slow-i-fast-p architecture for compressed video action recognition. In: ACM International Conference on Multimedia (2020)
Google Scholar
Li, Q., Tao, Q., Joty, S., Cai, J., Luo, J.: VQA-E: explaining, elaborating, and enhancing your answers for visual questions. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 570–586. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_34
Chapter Google Scholar
Li, Y., Yang, X., Shang, X., Chua, T.S.: Interventional video relation detection. In: ACM International Conference on Multimedia (2021)
Google Scholar
Liao, Y., Liu, S., Wang, F., Chen, Y., Qian, C., Feng, J.: Ppdm: parallel point detection and matching for real-time human-object interaction detection. In: IEEE CVPR (2020)
Google Scholar
Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_51
Chapter Google Scholar
Ma, Z., Wei, P., Li, H., Zheng, N.: Hoig: end-to-end human-object interactions grounding with transformers. In: IEEE International Conference on Multimedia and Expo (2022)
Google Scholar
Mi, L., Chen, Z.: Hierarchical graph attention network for visual relationship detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing (2014)
Google Scholar
Qian, X., Zhuang, Y., Li, Y., Xiao, S., Pu, S., Xiao, J.: Video relation detection with spatio-temporal graph. In: ACM International Conference on Multimedia (2019)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems (2015)
Google Scholar
Shang, X., Li, Y., Xiao, J., Ji, W., Chua, T.S.: Video visual relation detection via iterative inference. In: ACM International Conference on Multimedia (2021)
Google Scholar
Shang, X., Ren, T., Guo, J., Zhang, H., Chua, T.S.: Video visual relation detection. In: ACM International Conference on Multimedia (2017)
Google Scholar
Shi, J., Xu, J., Gong, B., Xu, C.: Not all frames are equal: Weakly-supervised video grounding with contextual similarity and visual clustering losses. In: IEEE CVPR (2019)
Google Scholar
Soldan, M., Xu, M., Qu, S., Tegner, J., Ghanem, B.: Vlg-net: video-language graph matching network for video grounding. In: IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Sun, X., Ren, T., Zi, Y., Wu, G.: Video visual relation detection via multi-modal feature fusion. In: ACM International Conference on Multimedia (2019)
Google Scholar
Tamura, M., Ohashi, H., Yoshinaga, T.: Qpic: query-based pairwise human-object interaction detection with image-wide contextual information. In: IEEE CVPR (2021)
Google Scholar
Tsai, Y.H.H., Divvala, S., Morency, L.P., Salakhutdinov, R., Farhadi, A.: Video relationship reasoning using gated spatio-temporal energy graph. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Google Scholar
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence - video to text. In: IEEE International Conference on Computer Vision (2015)
Google Scholar
Wang, T., Yang, T., Danelljan, M., Khan, F.S., Zhang, X., Sun, J.: Learning human-object interaction detection using interaction points. In: IEEE CVPR (2020)
Google Scholar
Wang, W., Gao, J., Xu, C.: Weakly-supervised video object grounding via stable context learning. In: ACM International Conference on Multimedia (2021)
Google Scholar
Wang, Y., Zhou, W., Li, H.: Fine-grained semantic alignment network for weakly supervised temporal language grounding. In: Findings of the Association for Computational Linguistics (2021)
Google Scholar
Wei, P., Zhao, Y., Zheng, N., Zhu, S.C.: Modeling 4d human-object interactions for joint event segmentation, recognition, and object localization. In: IEEE Trans. Pattern Anal. Mach. Intell., 1165–1179 (2017)
Google Scholar
Xiao, J., Shang, X., Yang, X., Tang, S., Chua, T.-S.: Visual relation grounding in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 447–464. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_27
Chapter Google Scholar
Yu, R., Li, A., Morariu, V.I., Davis, L.S.: Visual relationship detection with internal and external linguistic knowledge distillation. In: IEEE International Conference on Computer Vision (2017)
Google Scholar
Zhan, Y., Yu, J., Yu, T., Tao, D.: On exploring undetermined relationships for visual relationship detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Zhao, Y., Zhao, Z., Zhang, Z., Lin, Z.: Cascaded prediction network via segment tree for temporal video grounding. In: IEEE CVPR (2021)
Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision (2017)
Google Scholar

Download references

Acknowledgement

This research was supported by the grants Key Research and Development Program of China (No. 2018AAA0102501), and National Natural Science Foundation of China (No. 61876149, No. 62088102).

Author information

Authors and Affiliations

Xi’an Jiaotong University, Xi’an, China
Huan Li, Ping Wei, Jiapeng Li, Zeyu Ma, Jiahui Shang & Nanning Zheng

Authors

Huan Li
View author publications
You can also search for this author in PubMed Google Scholar
Ping Wei
View author publications
You can also search for this author in PubMed Google Scholar
Jiapeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Zeyu Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jiahui Shang
View author publications
You can also search for this author in PubMed Google Scholar
Nanning Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ping Wei .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, H., Wei, P., Li, J., Ma, Z., Shang, J., Zheng, N. (2022). Asymmetric Relation Consistency Reasoning for Video Relation Grounding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13695. Springer, Cham. https://doi.org/10.1007/978-3-031-19833-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-19833-5_8
Published: 04 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19832-8
Online ISBN: 978-3-031-19833-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Asymmetric Relation Consistency Reasoning for Video Relation Grounding