Abstract
Video scene graph generation (VidSGG) aims to predict visual relation triplets in videos, which is a key step toward a deeper comprehension of video scenes. To enhance applicability in real-world scenarios, open-vocabulary settings have been explored in recent VidSGG frameworks. However, the heavy reliance on aligned visual and textual representations from pre-trained vision-language models (VLMs) limits the in-depth understanding of visual relationships, as they fail to comprehend compositional scene relationships. To address this, we propose a novel open-vocabulary VidSGG framework named semantic-unified cross-modal learning (SUCML), which leverages the exceptional capabilities of large language models (LLMs) in visual understanding and semantic reasoning to achieve robust visual relation prediction. Specifically, we incorporate rich knowledge and open-vocabulary capabilities into our framework and design a cross-modal adapter to facilitate the semantic-unified cross-modal representation learning process. We then utilize the impressive semantic understanding and reasoning capabilities of LLMs, providing predefined textual instructions along with the learned semantic-unified visual and text tokens as inputs to the pre-trained LLM for robust relation prediction. Extensive experimental results on two public datasets demonstrate that SUCML significantly outperforms existing methods, showcasing the promising potential of LLMs for high-level semantic reasoning and scene understanding.



Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
Change history
01 June 2025
Original article has been updated to correct affiliation.
References
Shang, X., Ren, T., Guo, J., Zhang, H., Chua, T.-S.: Video visual relation detection. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1300–1308 (2017)
Cao, Q., Huang, H., Shang, X., Wang, B., Chua, T.-S.: 3-D relation network for visual relation recognition in videos. Neurocomputing 432, 91–100 (2021)
Teng, Y., Wang, L., Li, Z., Wu, G.: Target adaptive context aggregation for video scene graph generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13688–13697 (2021)
Su, Z., Shang, X., Chen, J., Jiang, Y.-G., Qiu, Z., Chua, T.-S.: Video relation detection via multiple hypothesis association. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 3127–3135 (2020)
Shang, X., Li, Y., Xiao, J., Ji, W., Chua, T.-S.: Video visual relation detection via iterative inference. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3654–3663 (2021)
Li, Y., Yang, X., Shang, X., Chua, T.-S.: Interventional video relation detection. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 4091–4099 (2021)
Liu, C., Jin, Y., Xu, K., Gong, G., Mu, Y.: Beyond short-term snippet: video relation detection with spatio-temporal global context. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10840–10849 (2020)
Gao, K., Chen, L., Niu, Y., Shao, J., Xiao, J.: Classification-then-grounding: reformulating video scene graphs as temporal bipartite graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19497–19506 (2022)
Kang, K., Li, H., Xiao, T., Ouyang, W., Yan, J., Liu, X., Wang, X.: Object detection in videos with tubelet proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 727–735 (2017)
Chen, Y., Cao, Y., Hu, H., Wang, L.: Memory enhanced global-local aggregation for video object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10337–10346 (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp. 213–229. Springer, London (2020)
Tan, M., Pang, R., Le, Q.V.: Efficientdet: scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10781–10790 (2020)
Lin, K., Li, L., Lin, C.-C., Ahmed, F., Gan, Z., Liu, Z., Lu, Y., Wang, L.: Swinbert: end-to-end transformers with sparse attention for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17949–17958 (2022)
Zheng, Q., Wang, C., Tao, D.: Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13096–13105 (2020)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.-N., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning, pp. 19730–19742. PMLR (2023)
Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.-F.: Open-vocabulary object detection using captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14393–14402 (2021)
Gao, M., Xing, C., Niebles, J.C., Li, J., Xu, R., Liu, W., Xiong, C.: Open vocabulary object detection with pseudo bounding-box labels. In: European Conference on Computer Vision, pp. 266–282. Springer (2022)
Gu, X., Lin, T.-Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. Preprint arXiv:2104.13921 (2021)
Wang, L., Liu, Y., Du, P., Ding, Z., Liao, Y., Qi, Q., Chen, B., Liu, S.: Object-aware distillation pyramid for open-vocabulary object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11186–11196 (2023)
Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? In: The Eleventh International Conference on Learning Representations (2022)
Herzig, R., Mendelson, A., Karlinsky, L., Arbelle, A., Feris, R., Darrell, T., Globerson, A.: Incorporating structured representations into pretrained vision and language models using scene graphs. Preprint arXiv:2305.06343 (2023)
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022)
Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: InstructBLIP: towards general-purpose vision-language models with instruction tuning (2023). arXiv: 2305.06500
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: open and efficient foundation language models. Preprint arXiv:2302.13971 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Adv. Neural Inform. Process. Syst. 36, 1 (2024)
Tsai, Y.-H.H., Divvala, S., Morency, L.-P., Salakhutdinov, R., Farhadi, A.: Video relationship reasoning using gated spatio-temporal energy graph. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10424–10433 (2019)
Zhang, Y., Pan, Y., Yao, T., Huang, R., Mei, T., Chen, C.-W.: End-to-end video scene graph generation with temporal propagation transformer. IEEE Trans. Multimed. 26, 1613–1625 (2023)
Wang, W., Gao, K., Luo, Y., Jiang, T., Gao, F., Shao, J., Sun, J., Xiao, J.: Triple correlations-guided label supplementation for unbiased video scene graph generation. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5153–5163 (2023)
Wang, W., Luo, Y., Chen, Z., Jiang, T., Yang, Y., Xiao, J.: Taking a closer look at visual relation: unbiased video scene graph generation with decoupled label learning. IEEE Trans. Multimed. 2023, 1 (2023)
Qian, X., Zhuang, Y., Li, Y., Xiao, S., Pu, S., Xiao, J.: Video relation detection with spatio-temporal graph. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 84–93 (2019)
Chen, S., Shi, Z., Mettes, P., Snoek, C.G.: Social fabric: tubelet compositions for video relation detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13485–13494 (2021)
Wei, M., Chen, L., Ji, W., Yue, X., Zimmermann, R.: In defense of clip-based video relation detection. IEEE Trans. Image Process. 2024, 1 (2024)
Zhang, G., Tang, Y., Zhang, C., Zheng, X., Zhao, Y.: Entity dependency learning network with relation prediction for video visual relation detection. IEEE Trans. Circ. Syst. Video Technol. 2024, 1 (2024)
He, T., Gao, L., Song, J., Li, Y.-F.: Towards open-vocabulary scene graph generation with prompt-based finetuning. In: European Conference on Computer Vision, pp. 56–73. Springer, London (2022)
Zhang, Y., Pan, Y., Yao, T., Huang, R., Mei, T., Chen, C.-W.: Learning to generate language-supervised and open-vocabulary scene graph using pre-trained visual-semantic space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2915–2924 (2023)
Li, L., Xiao, J., Chen, G., Shao, J., Zhuang, Y., Chen, L.: Zero-shot visual relation detection via composite visual cues from large language models. Adv. Neural Inform. Process. Syst. 36, 1 (2024)
Gao, K., Chen, L., Zhang, H., Xiao, J., Sun, Q.: Compositional prompt tuning with motion cues for open-vocabulary video relation detection. Preprint arXiv:2302.00268 (2023)
Li, R., Zhang, S., Lin, D., Chen, K., He, X.: From pixels to graphs: open-vocabulary scene graph generation with vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 28076–28086 (2024)
Yang, S., Wang, Y., Ji, X., Wu, X.: Multi-modal prompting for open-vocabulary video visual relationship detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 6513–6521 (2024)
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Wu, Z., Gao, J., Xu, C.: Open-vocabulary video scene graph generation via union-aware semantic alignment. In: ACM Multimedia (2024)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers and distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X.V., et al.: Opt: open pre-trained transformer language models. Preprint arXiv:2205.01068 (2022)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. Preprint arXiv:2010.11929 (2020)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666 (2019)
Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G.: Learning to prompt for open-vocabulary object detection with vision-language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14084–14093 (2022)
Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., et al.: Llama-adapter v2: parameter-efficient visual instruction model. Preprint arXiv:2304.15010 (2023)
Zhang, F., Liang, T., Wu, Z., Yin, Y.: Pill: plug into LLM with adapter expert and attention gate. Preprint arXiv:2311.02126 (2023)
Luo, G., Zhou, Y., Ren, T., Chen, S., Sun, X., Ji, R.: Cheap and quick: efficient vision-language instruction tuning for large language models. Adv. Neural Inform. Process. Syst. 36, 1 (2024)
Ebrahimi, S., Arik, S.O., Nama, T., Pfister, T.: Crome: cross-modal adapters for efficient multimodal LLM. Preprint arXiv:2408.06610 (2024)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUS). Preprint arXiv:1606.08415 (2016)
Elfwing, S., Uchibe, E., Doya, K.: Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 107, 3–11 (2018)
Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., Le, Q.V.: Finetuned language models are zero-shot learners. Preprint arXiv:2109.01652 (2021)
Xie, H., Peng, C.-J., Tseng, Y.-W., Chen, H.-J., Hsu, C.-F., Shuai, H.-H., Cheng, W.-H.: Emovit: revolutionizing emotion insights with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26596–26605 (2024)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306 (2024)
Shang, X., Di, D., Xiao, J., Cao, Y., Yang, X., Chua, T.-S.: Annotating objects and relations in user-generated videos. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 279–287 (2019)
Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., Gao, J.: Vinvl: making visual representations matter in vision-language models. Preprint arXiv:2101.00529 1(6), 8 (2021)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: low-rank adaptation of large language models. Preprint arXiv:2106.09685 (2021)
Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al.: Scaling instruction-finetuned language models. J. Mach. Learn. Res. 25(70), 1–53 (2024)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. Preprint arXiv:1412.6980 (2014)
Li, D., Li, J., Li, H., Niebles, J.C., Hoi, S.C.: Align and prompt: video-and-language pre-training with entity prompts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4953–4963 (2022)
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. See https://vicuna.lmsys.org. Accessed 14 April 2023, 2(3), 6 (2023)
Acknowledgements
This work was supported by the National Natural Science Foundation of China (62303046), in part by Beijing Natural Science Foundation (4234088), in part by the Open Projects Program of State Key Laboratory of Multimodal Artificial Intelligence Systems.
Author information
Authors and Affiliations
Contributions
Y.H.: Designed the major modules and experiments. Wrote the paper, including the introduction, methods, and parts of the related work and experiments. F.Z.: Conducted experimental design and model tuning. Wrote parts of the related work and experiments, and performed result analysis. R.W.: Designed some modules and conducted some experiments. Reviewed and revised the introduction and methods. Optimized the overall paper structure and polished the whole paper. J.G.: Conceptualized and designed the framework. Wrote parts of the introduction. Reviewed and revised the whole paper, and oversaw the entire project. All authors approved the final version of the article as accepted for publication, including references.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Communicated by An-An Liu.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Original article has been updated to correct affiliation.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hu, Y., Zhang, F., Wei, R. et al. Learning semantic-unified cross-modal representations for open-vocabulary video scene graph generation. Multimedia Systems 31, 188 (2025). https://doi.org/10.1007/s00530-025-01767-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-025-01767-9