Learning semantic-unified cross-modal representations for open-vocabulary video scene graph generation

Hu, Yufan; Zhang, Fang; Wei, Ran; Gao, Junling

doi:10.1007/s00530-025-01767-9

Learning semantic-unified cross-modal representations for open-vocabulary video scene graph generation

Regular Paper
Published: 10 April 2025

Volume 31, article number 188, (2025)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Yufan Hu²,
Fang Zhang¹,
Ran Wei³ &
…
Junling Gao¹

119 Accesses
Explore all metrics

This article has been updated

Abstract

Video scene graph generation (VidSGG) aims to predict visual relation triplets in videos, which is a key step toward a deeper comprehension of video scenes. To enhance applicability in real-world scenarios, open-vocabulary settings have been explored in recent VidSGG frameworks. However, the heavy reliance on aligned visual and textual representations from pre-trained vision-language models (VLMs) limits the in-depth understanding of visual relationships, as they fail to comprehend compositional scene relationships. To address this, we propose a novel open-vocabulary VidSGG framework named semantic-unified cross-modal learning (SUCML), which leverages the exceptional capabilities of large language models (LLMs) in visual understanding and semantic reasoning to achieve robust visual relation prediction. Specifically, we incorporate rich knowledge and open-vocabulary capabilities into our framework and design a cross-modal adapter to facilitate the semantic-unified cross-modal representation learning process. We then utilize the impressive semantic understanding and reasoning capabilities of LLMs, providing predefined textual instructions along with the learned semantic-unified visual and text tokens as inputs to the pre-trained LLM for robust relation prediction. Extensive experimental results on two public datasets demonstrate that SUCML significantly outperforms existing methods, showcasing the promising potential of LLMs for high-level semantic reasoning and scene understanding.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual context learning based on cross-modal knowledge for continuous sign language recognition

Article 16 October 2024

Multi-level video captioning method based on semantic space

Article 08 February 2024

Hierarchical bi-directional conceptual interaction for text-video retrieval

Article 15 October 2024

Data availability

No datasets were generated or analysed during the current study.

Change history

01 June 2025
Original article has been updated to correct affiliation.

References

Shang, X., Ren, T., Guo, J., Zhang, H., Chua, T.-S.: Video visual relation detection. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1300–1308 (2017)
Cao, Q., Huang, H., Shang, X., Wang, B., Chua, T.-S.: 3-D relation network for visual relation recognition in videos. Neurocomputing 432, 91–100 (2021)
Article Google Scholar
Teng, Y., Wang, L., Li, Z., Wu, G.: Target adaptive context aggregation for video scene graph generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13688–13697 (2021)
Su, Z., Shang, X., Chen, J., Jiang, Y.-G., Qiu, Z., Chua, T.-S.: Video relation detection via multiple hypothesis association. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 3127–3135 (2020)
Shang, X., Li, Y., Xiao, J., Ji, W., Chua, T.-S.: Video visual relation detection via iterative inference. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3654–3663 (2021)
Li, Y., Yang, X., Shang, X., Chua, T.-S.: Interventional video relation detection. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 4091–4099 (2021)
Liu, C., Jin, Y., Xu, K., Gong, G., Mu, Y.: Beyond short-term snippet: video relation detection with spatio-temporal global context. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10840–10849 (2020)
Gao, K., Chen, L., Niu, Y., Shao, J., Xiao, J.: Classification-then-grounding: reformulating video scene graphs as temporal bipartite graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19497–19506 (2022)
Kang, K., Li, H., Xiao, T., Ouyang, W., Yan, J., Liu, X., Wang, X.: Object detection in videos with tubelet proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 727–735 (2017)
Chen, Y., Cao, Y., Hu, H., Wang, L.: Memory enhanced global-local aggregation for video object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10337–10346 (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp. 213–229. Springer, London (2020)
Tan, M., Pang, R., Le, Q.V.: Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790 (2020)
Lin, K., Li, L., Lin, C.-C., Ahmed, F., Gan, Z., Liu, Z., Lu, Y., Wang, L.: Swinbert: end-to-end transformers with sparse attention for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17949–17958 (2022)
Zheng, Q., Wang, C., Tao, D.: Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13096–13105 (2020)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.-N., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning, pp. 19730–19742. PMLR (2023)
Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.-F.: Open-vocabulary object detection using captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14393–14402 (2021)
Gao, M., Xing, C., Niebles, J.C., Li, J., Xu, R., Liu, W., Xiong, C.: Open vocabulary object detection with pseudo bounding-box labels. In: European Conference on Computer Vision, pp. 266–282. Springer (2022)
Gu, X., Lin, T.-Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. Preprint arXiv:2104.13921 (2021)
Wang, L., Liu, Y., Du, P., Ding, Z., Liao, Y., Qi, Q., Chen, B., Liu, S.: Object-aware distillation pyramid for open-vocabulary object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11186–11196 (2023)
Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: The Eleventh International Conference on Learning Representations (2022)
Herzig, R., Mendelson, A., Karlinsky, L., Arbelle, A., Feris, R., Darrell, T., Globerson, A.: Incorporating structured representations into pretrained vision and language models using scene graphs. Preprint arXiv:2305.06343 (2023)
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022)
Google Scholar
Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: InstructBLIP: towards general-purpose vision-language models with instruction tuning (2023). arXiv: 2305.06500
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: open and efficient foundation language models. Preprint arXiv:2302.13971 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Adv. Neural Inform. Process. Syst. 36, 1 (2024)
Google Scholar
Tsai, Y.-H.H., Divvala, S., Morency, L.-P., Salakhutdinov, R., Farhadi, A.: Video relationship reasoning using gated spatio-temporal energy graph. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10424–10433 (2019)
Zhang, Y., Pan, Y., Yao, T., Huang, R., Mei, T., Chen, C.-W.: End-to-end video scene graph generation with temporal propagation transformer. IEEE Trans. Multimed. 26, 1613–1625 (2023)
Article Google Scholar
Wang, W., Gao, K., Luo, Y., Jiang, T., Gao, F., Shao, J., Sun, J., Xiao, J.: Triple correlations-guided label supplementation for unbiased video scene graph generation. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5153–5163 (2023)
Wang, W., Luo, Y., Chen, Z., Jiang, T., Yang, Y., Xiao, J.: Taking a closer look at visual relation: unbiased video scene graph generation with decoupled label learning. IEEE Trans. Multimed. 2023, 1 (2023)
Google Scholar
Qian, X., Zhuang, Y., Li, Y., Xiao, S., Pu, S., Xiao, J.: Video relation detection with spatio-temporal graph. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 84–93 (2019)
Chen, S., Shi, Z., Mettes, P., Snoek, C.G.: Social fabric: tubelet compositions for video relation detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13485–13494 (2021)
Wei, M., Chen, L., Ji, W., Yue, X., Zimmermann, R.: In defense of clip-based video relation detection. IEEE Trans. Image Process. 2024, 1 (2024)
Google Scholar
Zhang, G., Tang, Y., Zhang, C., Zheng, X., Zhao, Y.: Entity dependency learning network with relation prediction for video visual relation detection. IEEE Trans. Circ. Syst. Video Technol. 2024, 1 (2024)
Google Scholar
He, T., Gao, L., Song, J., Li, Y.-F.: Towards open-vocabulary scene graph generation with prompt-based finetuning. In: European Conference on Computer Vision, pp. 56–73. Springer, London (2022)
Zhang, Y., Pan, Y., Yao, T., Huang, R., Mei, T., Chen, C.-W.: Learning to generate language-supervised and open-vocabulary scene graph using pre-trained visual-semantic space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2915–2924 (2023)
Li, L., Xiao, J., Chen, G., Shao, J., Zhuang, Y., Chen, L.: Zero-shot visual relation detection via composite visual cues from large language models. Adv. Neural Inform. Process. Syst. 36, 1 (2024)
Google Scholar
Gao, K., Chen, L., Zhang, H., Xiao, J., Sun, Q.: Compositional prompt tuning with motion cues for open-vocabulary video relation detection. Preprint arXiv:2302.00268 (2023)
Li, R., Zhang, S., Lin, D., Chen, K., He, X.: From pixels to graphs: open-vocabulary scene graph generation with vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 28076–28086 (2024)
Yang, S., Wang, Y., Ji, X., Wu, X.: Multi-modal prompting for open-vocabulary video visual relationship detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 6513–6521 (2024)
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Wu, Z., Gao, J., Xu, C.: Open-vocabulary video scene graph generation via union-aware semantic alignment. In: ACM Multimedia (2024)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers and distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X.V., et al.: Opt: open pre-trained transformer language models. Preprint arXiv:2205.01068 (2022)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth $16\times 16$ words: transformers for image recognition at scale. Preprint arXiv:2010.11929 (2020)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666 (2019)
Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G.: Learning to prompt for open-vocabulary object detection with vision-language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14084–14093 (2022)
Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., et al.: Llama-adapter v2: parameter-efficient visual instruction model. Preprint arXiv:2304.15010 (2023)
Zhang, F., Liang, T., Wu, Z., Yin, Y.: Pill: plug into LLM with adapter expert and attention gate. Preprint arXiv:2311.02126 (2023)
Luo, G., Zhou, Y., Ren, T., Chen, S., Sun, X., Ji, R.: Cheap and quick: efficient vision-language instruction tuning for large language models. Adv. Neural Inform. Process. Syst. 36, 1 (2024)
Google Scholar
Ebrahimi, S., Arik, S.O., Nama, T., Pfister, T.: Crome: cross-modal adapters for efficient multimodal LLM. Preprint arXiv:2408.06610 (2024)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUS). Preprint arXiv:1606.08415 (2016)
Elfwing, S., Uchibe, E., Doya, K.: Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 107, 3–11 (2018)
Article Google Scholar
Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., Le, Q.V.: Finetuned language models are zero-shot learners. Preprint arXiv:2109.01652 (2021)
Xie, H., Peng, C.-J., Tseng, Y.-W., Chen, H.-J., Hsu, C.-F., Shuai, H.-H., Cheng, W.-H.: Emovit: revolutionizing emotion insights with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26596–26605 (2024)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306 (2024)
Shang, X., Di, D., Xiao, J., Cao, Y., Yang, X., Chua, T.-S.: Annotating objects and relations in user-generated videos. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 279–287 (2019)
Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., Gao, J.: Vinvl: making visual representations matter in vision-language models. Preprint arXiv:2101.00529 1(6), 8 (2021)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: low-rank adaptation of large language models. Preprint arXiv:2106.09685 (2021)
Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. J. Mach. Learn. Res. 25(70), 1–53 (2024)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. Preprint arXiv:1412.6980 (2014)
Li, D., Li, J., Li, H., Niebles, J.C., Hoi, S.C.: Align and prompt: video-and-language pre-training with entity prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4953–4963 (2022)
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. See https://vicuna.lmsys.org. Accessed 14 April 2023, 2(3), 6 (2023)

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (62303046), in part by Beijing Natural Science Foundation (4234088), in part by the Open Projects Program of State Key Laboratory of Multimodal Artificial Intelligence Systems.

Author information

Authors and Affiliations

The School of Life Sciences, Tiangong University, Tianjin, 300387, China
Fang Zhang & Junling Gao
The School of Intelligence Science and Technology, University of Science and Technology Beijing, Beijing, 100083, China
Yufan Hu
Department of Rehabilitation Engineering, China Civil Affairs University, Beijing, 102600, China
Ran Wei

Authors

Yufan Hu
View author publications
Search author on:PubMed Google Scholar
Fang Zhang
View author publications
Search author on:PubMed Google Scholar
Ran Wei
View author publications
Search author on:PubMed Google Scholar
Junling Gao
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.H.: Designed the major modules and experiments. Wrote the paper, including the introduction, methods, and parts of the related work and experiments. F.Z.: Conducted experimental design and model tuning. Wrote parts of the related work and experiments, and performed result analysis. R.W.: Designed some modules and conducted some experiments. Reviewed and revised the introduction and methods. Optimized the overall paper structure and polished the whole paper. J.G.: Conceptualized and designed the framework. Wrote parts of the introduction. Reviewed and revised the whole paper, and oversaw the entire project. All authors approved the final version of the article as accepted for publication, including references.

Corresponding author

Correspondence to Junling Gao.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Communicated by An-An Liu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Original article has been updated to correct affiliation.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hu, Y., Zhang, F., Wei, R. et al. Learning semantic-unified cross-modal representations for open-vocabulary video scene graph generation. Multimedia Systems 31, 188 (2025). https://doi.org/10.1007/s00530-025-01767-9

Download citation

Received: 26 October 2024
Accepted: 15 March 2025
Published: 10 April 2025
DOI: https://doi.org/10.1007/s00530-025-01767-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning semantic-unified cross-modal representations for open-vocabulary video scene graph generation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Visual context learning based on cross-modal knowledge for continuous sign language recognition

Multi-level video captioning method based on semantic space

Hierarchical bi-directional conceptual interaction for text-video retrieval

Explore related subjects

Data availability

Change history

01 June 2025

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now