Enhancing video temporal grounding with large language model-based data augmentation

Tian, Yun; Guo, Xiaobo; Wang, Jinsong; Li, Bin

doi:10.1007/s11227-025-07159-0

Enhancing video temporal grounding with large language model-based data augmentation

Published: 25 March 2025

Volume 81, article number 658, (2025)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yun Tian¹,
Xiaobo Guo¹,
Jinsong Wang¹ &
…
Bin Li²

69 Accesses
Explore all metrics

Abstract

Given an untrimmed video and a natural language query, the task of video temporal grounding (VTG) aims to precisely identify the temporal segment in the video that semantically matches the query. Existing datasets for this task often provide natural language queries that are overly simplistic and manually annotated, which lack sufficient semantic richness to fully capture the video’s content. This limitation hinders the model’s ability to comprehend complex semantic scenarios and degrades its overall performance. To address these challenges, we introduce a novel, low-cost, large language model-based data augmentation method, that can enrich the original samples and expand the dataset without requiring external data. We propose a fine-grained image captioning module with a noise filter to extract unexploited information from videos. Additionally, we design a hierarchical semantic prompting framework to guide GPT-3.5 in producing semantically rich and contextually coherent natural language queries. Our method outperforms the SOTA method MRTNet when combined with 2D-TAN and VSLNet across three public VTG datasets, particularly excelling in complex semantics and long-duration segment localization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Training-Free Video Temporal Grounding Using Large-Scale Pre-trained Models

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

ViGT: proposal-free video grounding with a learnable token in the transformer

Article 26 September 2023

Data availability

The data presented in this study are available on request from the corresponding author.

References

Gao J, Sun C, Yang Z, Nevatia R (2017) Tall: Temporal activity localization via language query. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 5277–5285. https://doi.org/10.1109/ICCV.2017.563
Hendricks LA, Wang O, Shechtman E, Sivic J, Darrell T, Russell B (2017) Localizing moments in video with natural language. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 5804–5813. https://doi.org/10.1109/ICCV.2017.618
Voigtlaender P, Changpinyo S, Pont-Tuset J, Soricut R, Ferrari V (2023) Connecting vision and language with video localized narratives. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Vancouver, BC, Canada, pp 2461–2471. https://doi.org/10.1109/CVPR52729.2023.00243
Li S, Li B, Sun B, Weng Y (2024) Towards visual-prompt temporal answer grounding in instructional video. IEEE Trans Pattern Anal Mach Intell 46(12):8836–8853. https://doi.org/10.1109/TPAMI.2024.3411045
Article MATH Google Scholar
S D.S, Khan Z, Tapaswi M (2024) FiGCLIP: Fine-Grained CLIP adaptation via densely annotated videos. arXiv. https://doi.org/10.48550/arXiv.2401.07669
Qu M, Chen X, Liu W, Li A, Zhao Y (2024) Chatvtg: Video temporal grounding via chat with video dialogue large language models. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, Seattle, WA, USA, pp 1847–1856. https://doi.org/10.1109/CVPRW63382.2024.00191
Pérez-Mayos L, Sukno FM, Wanner L (2018) Improving the quality of video-to-language models by optimizing annotation of the training material. In: Schoeffmann K, Chalidabhongse TH, Ngo CW, Aramvith S, O’Connor NE, Ho Y-S, Gabbouj M, Elgammal A (ed) MultiMedia modeling, Springer, Cham, pp 279–290. https://doi.org/10.1007/978-3-319-73603-7_23
Chen T-S, Siarohin A, Menapace W, Deyneka E, Chao H-W, Jeon BE, Fang Y, Lee H-Y, Ren J, Yang M-H, Tulyakov S (2024) Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, pp 13320–13331. https://doi.org/10.1109/CVPR52733.2024.01265
Rafiq G, Rafiq M, Choi GS (2023) Video description: a comprehensive survey of deep learning approaches. Artif Intell Rev 56(11):13293–13372. https://doi.org/10.1007/s10462-023-10414-6
Article MATH Google Scholar
Shi Y, Xu H, Yuan C, Li B, Hu W, Zha Z-J (2023) Learning video-text aligned representations for video captioning. ACM Trans Multimed Comput Commun Appl 19(2):63–16321. https://doi.org/10.1145/3546828
Article MATH Google Scholar
Dong J, Chen X, Zhang M, Yang X, Chen S, Li X, Wang X (2022) Partially relevant video retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia. MM ’22, Association for Computing Machinery, New York, NY, USA, pp 246–257. https://doi.org/10.1145/3503161.3547976
Hou D, Pang L, Shen H, Cheng X (2024) Improving video corpus moment retrieval with partial relevance enhancement. In: Proceedings of the 2024 International Conference on Multimedia Retrieval. ICMR ’24, Association for Computing Machinery, New York, NY, USA, pp 394–403. https://doi.org/10.1145/3652583.3658088
Mahmud T, Liang F, Qing Y, Marculescu D (2023) Clip4videocap: Rethinking clip for video captioning with multiscale temporal fusion and commonsense knowledge. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1–5. https://doi.org/10.1109/ICASSP49357.2023.10097128
Olivastri S, Singh G, Cuzzolin F (2019) End-to-end video captioning. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), IEEE, Seoul, Korea (South), pp 1474–1482. https://doi.org/10.1109/ICCVW.2019.00185
Adewale S, Ige T, Matti BH (2023) Encoder-decoder based long short-term memory (LSTM) model for video captioning. arXiv. https://doi.org/10.48550/arXiv.2401.02052
Liu H, Singh P (2004) Conceptnet – a practical commonsense reasoning tool-kit. BT Tech J 22(4):211–226. https://doi.org/10.1023/B:BTTJ.0000047600.45421.6d
Article MATH Google Scholar
Miech A, Alayrac J-B, Smaira L, Laptev I, Sivic J, Zisserman A (2020) End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) Videobert: A joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR, Vienna, Austria
Bain M, Nagrani A, Varol G, Zisserman A (2021) Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 1728–1738
Li J, Li D, Savarese S, Hoi S (2023) Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, vol. 202, pp 19730–19742. JMLR.org, Honolulu, Hawaii, USA
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20, pp 1877–1901. Curran Associates Inc., Red Hook, NY, USA
Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 961–970. https://doi.org/10.1109/CVPR.2015.7298698
Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M (2013) Grounding action descriptions in videos. Trans Assoc Comput Linguist 1:25–36. https://doi.org/10.1162/tacl_a_00207
Article Google Scholar
Zhang S, Peng H, Fu J, Luo J (2020) Learning 2d temporal adjacent networks for moment localization with natural language. Proceedings of the AAAI Conference on Artificial Intelligence 34(07):12870–12877. https://doi.org/10.1609/aaai.v34i07.6984
Zhang H, Sun A, Jing W, Zhou JT (2020) Span-based localizing network for natural language video localization. In: Jurafsky D, Chai J, Schluter N, Tetreault J (ed) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online. pp 6543–6554. https://doi.org/10.18653/v1/2020.acl-main.585
Zhang H, Sun A, Jing W, Zhen L, Zhou JT, Goh RSM (2022) Natural language video localization: a revisit in span-based question answering framework. IEEE Trans Pattern Anal Mach Intell 44(8):4252–4266. https://doi.org/10.1109/TPAMI.2021.3060449
Article MATH Google Scholar
Ma J, Ushiku Y, Sagara M (2022) The effect of improving annotation quality on object detection datasets: a preliminary study. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 4849–4858. https://doi.org/10.1109/CVPRW56347.2022.00532
Li S-Y, Jiang Y (2018) Multi-label crowdsourcing learning with incomplete annotations. In: Geng X, Kang B-H (ed) PRICAI 2018: Trends in Artificial Intelligence, Springer, Cham, pp 232–245. https://doi.org/10.1007/978-3-319-97304-3_18
Mun J, Cho M, Han B (2020) Local-global video-text interactions for temporal grounding. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10807–10816.https://doi.org/10.1109/CVPR42600.2020.01082
Jie Z, Xie P, Lu W, Ding R, Li L (2019) Better modeling of incomplete annotations for named entity recognition. In: Burstein J, Doran C, Solorio T (ed) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp 729–734. https://doi.org/10.18653/v1/N19-1079
Lan X, Yuan Y, Chen H, Wang X, Jie Z, Ma L, Wang Z, Zhu W (2023) Curriculum multi-negative augmentation for debiased video grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence 37(1):1213–1221. https://doi.org/10.1609/aaai.v37i1.25204
Kim T, Kim J, Shim M, Yun S, Kang M, Wee D, Lee S (2022) Exploring temporally dynamic data augmentation for video recognition. arXiv. https://doi.org/10.48550/arXiv.2206.15015
Gorpincenko A, Mackiewicz M (2023) Extending temporal data augmentation for video action recognition. In: Yan WQ, Nguyen M, Stommel M (ed) Image and Vision Computing, Springer, Cham, pp 104–118. https://doi.org/10.1007/978-3-031-25825-1_8
Kumar V, Choudhary A, Cho E (2020) Data augmentation using pre-trained transformer models. In: Campbell WM, Waibel A, Hakkani-Tur D, Hazen TJ, Kilgour K, Cho E, Kumar V, Glaude H (ed) Proceedings of the 2nd Workshop on Life-Long Learning for Spoken Language Systems, Association for Computational Linguistics, Suzhou, China, pp 18–26. https://doi.org/10.18653/v1/2020.lifelongnlp-1.3
Shorten C, Khoshgoftaar TM, Furht B (2021) Text data augmentation for deep learning. J Big Data 8(1):101. https://doi.org/10.1186/s40537-021-00492-0
Article MATH Google Scholar
Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, Bikel D, Blecher L, Ferrer CC, Chen M, Cucurull G, Esiobu D, Fernandes J, Fu J, Fu W, Fuller B, Gao C, Goswami V, Goyal N, Hartshorn A, Hosseini S, Hou R, Inan H, Kardas M, Kerkez V, Khabsa M, Kloumann I, Korenev A, Koura PS, Lachaux M-A, Lavril T, Lee J, Liskovich D, Lu Y, Mao Y, Martinet X, Mihaylov T, Mishra P, Molybog I, Nie Y, Poulton A, Reizenstein J, Rungta R, Saladi K, Schelten A, Silva R, Smith EM, Subramanian R, Tan XE, Tang B, Taylor R, Williams A, Kuan JX, Xu P, Yan Z, Zarov I, Zhang Y, Fan A, Kambadur M, Narang S, Rodriguez A, Stojnic R, Edunov S, Scialom T (2023) Llama 2: open foundation and fine-tuned chat models. arXiv. https://doi.org/10.48550/arXiv.2307.09288
Almazrouei E, Alobeidli H, Alshamsi A, Cappelli A, Cojocaru R, Debbah M, Goffinet É, Hesslow D, Launay J, Malartic Q, Mazzotta D, Noune B, Pannier B, Penedo G (2023) The falcon series of open language models. arXiv. https://doi.org/10.48550/arXiv.2311.16867
Yang A, Yang B, Zhang B, Hui B, Zheng B, Yu B, Li C, Liu D, Huang F, Wei H, Lin H, Yang J, Tu J, Zhang J, Yang J, Yang J, Zhou J, Lin J, Dang K, Lu K, Bao K, Yang K, Yu L, Li M, Xue M, Zhang P, Zhu Q, Men R, Lin R, Li T, Xia T, Ren X, Ren X, Fan Y, Su Y, Zhang Y, Wan Y, Liu Y, Cui Z, Zhang Z, Qiu Z (2024) Qwen2.5 technical report. arXiv. https://doi.org/10.48550/arXiv.2412.15115
Ye J, Chen X, Xu N, Zu C, Shao Z, Liu S, Cui Y, Zhou Z, Gong C, Shen Y, Zhou J, Chen S, Gui T, Zhang Q, Huang X (2023) A comprehensive capability analysis of GPT-3 and GPT-3.5 Series Models. arXiv. https://doi.org/10.48550/arXiv.2303.10420
Wang Y, Li D, Shen J, Xu Y, Xu M, Funakoshi K, Okumura M (2024) LAMBDA: large language model-based data augmentation for multi-modal machine translation. In: Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 15240–15253. Association for Computational Linguistics, Miami, Florida, USA . https://doi.org/10.18653/v1/2024.findings-emnlp.893 . https://aclanthology.org/2024.findings-emnlp.893
Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe B, Matas J, Sebe N, Welling, M (ed) Computer Vision – ECCV 2016, Springer, Cham, pp 510–526. https://doi.org/10.1007/978-3-319-46448-0_31
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Isabelle P, Charniak E, Lin D (ed) Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp 311–318. https://doi.org/10.3115/1073083.1073135
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Goldstein J, Lavie A, Lin C.-Y., Voss C (ed) Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization, Association for Computational Linguistics, Ann Arbor, Michigan, pp 65–72
Ji W, Qin Y, Chen L, Wei Y, Wu Y, Zimmermann R (2024) Mrtnet: multi-resolution temporal network for video sentence grounding. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 2770–2774. https://doi.org/10.1109/ICASSP48485.2024.10447846
Liu M, Wang X, Nie L, Tian Q, Chen B, Chua T-S (2018) Cross-modal moment localization in videos. In: Proceedings of the 26th ACM International Conference on Multimedia. MM ’18, Association for Computing Machinery, New York, NY, USA, pp 843–851. https://doi.org/10.1145/3240508.3240549
Ge R, Gao J, Chen K, Nevatia R (2019) Mac: Mining activity concepts for language-based temporal localization. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 245–253. https://doi.org/10.1109/WACV.2019.00032
Chen S, Jiang Y-G (2019) Semantic proposal for activity localization in videos via sentence query. In: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI’19/IAAI’19/EAAI’19, vol. 33, pp. 8199–8206. AAAI Press, Honolulu, Hawaii, USA. https://doi.org/10.1609/aaai.v33i01.33018199
Wang W, Huang Y, Wang L (2019) Language-driven temporal activity localization: a semantic matching reinforcement learning model. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 334–343. https://doi.org/10.1109/CVPR.2019.00042
Xu H, He K, Plummer BA, Sigal L, Sclaroff S, Saenko K (2019) Multilevel language and vision integration for text-to-clip retrieval. In: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI’19/IAAI’19/EAAI’19, vol 33, pp 9062–9069. AAAI Press, Honolulu, Hawaii, USA. https://doi.org/10.1609/aaai.v33i01.33019062
Lu C, Chen L, Tan C, Li X, Xiao J (2019) Debug: a dense bottom-up grounding approach for natural language video localization. In: Inui K, Jiang J, Ng V, Wan X (ed) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, pp 5144–5153. https://doi.org/10.18653/v1/D19-1518
Ghosh S, Agarwal A, Parekh Z, Hauptmann A (2019) Excl: Extractive clip localization using natural language descriptions. In: Burstein J, Doran C, Solorio T (ed) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol 1 (Long and Short Papers), pp 1984–1990. Association for Computational Linguistics, Minneapolis, Minnesota. https://doi.org/10.18653/v1/N19-1198
Zhang D, Dai X, Wang X, Wang Y-F, Davis LS (2019) Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 1247–1257. https://doi.org/10.1109/CVPR.2019.00134
Chen L, Lu C, Tang S, Xiao J, Zhang D, Tan C, Li X (2020) Rethinking the bottom-up framework for query-based video localization. In: Proceedings of the AAAI Conference on Artificial Intelligence 34(07):10551–10558. https://doi.org/10.1609/aaai.v34i07.6627
Zeng R, Xu H, Huang W, Chen P, Tan M, Gan C (2020) Dense regression network for video grounding. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10284–10293 . https://doi.org/10.1109/CVPR42600.2020.01030
Liu D, Qu X, Dong J, Zhou P, Cheng Y, Wei W, Xu Z, Xie Y (2021) Context-aware biaffine localizing network for temporal sentence grounding. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Nashville, TN, USA, pp 11230–11239. https://doi.org/10.1109/CVPR46437.2021.01108
Yu X, Malmir M, He X, Chen J, Wang T, Wu Y, Liu Y, Liu Y (2021) Cross interaction network for natural language guided video moment retrieval. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’21, Association for Computing Machinery, New York, NY, USA, pp 1860–1864. https://doi.org/10.1145/3404835.3463021
Wang Z, Wang L, Wu T, Li T, Wu G (2022) Negative sample matters: a renaissance of metric learning for temporal grounding. Proceedings of the AAAI Conference on Artificial Intelligence 36(3):2613–2623. https://doi.org/10.1609/aaai.v36i3.20163
Xu Z, Wei K, Yang X, Deng C (2023) Point-supervised video temporal grounding. IEEE Trans Multimed 25:6121–6131. https://doi.org/10.1109/TMM.2022.3205404
Article Google Scholar
Li H, Shu X, He S, Qiao R, Wen W, Guo T, Gan B, Sun X (2023) D3g: Exploring gaussian prior for temporal sentence grounding with glance annotation. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Paris, France, pp 13688–13700. https://doi.org/10.1109/ICCV51070.2023.01263
Chen J, Chen X, Ma L, Jie Z, Chua T-S (2018) Temporally grounding natural sentence in video. In: Riloff E, Chiang D, Hockenmaier J, Tsujii J (ed) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 162–171. Association for Computational Linguistics, Brussels, Belgium. https://doi.org/10.18653/v1/D18-1015
Zhang Z, Lin Z, Zhao Z, Xiao Z (2019) Cross-modal interaction networks for query-based moment retrieval in videos. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’19, Association for Computing Machinery, New York, NY, USA, pp 655–664. https://doi.org/10.1145/3331184.3331235
Yuan Y, Mei T, Zhu W (2019) To find where you talk: Temporal sentence localization in video with attention based location regression. Proceedings of the AAAI Conference on Artificial Intelligence 33(01):9159–9166. https://doi.org/10.1609/aaai.v33i01.33019159
Zhang H, Sun A, Jing W, Zhen L, Zhou JT, Goh RSM (2021) Parallel attention network with sequence matching for video grounding. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 776–790 . https://doi.org/10.18653/v1/2021.findings-acl.69
Ju C, Wang H, Liu J, Ma C, Zhang Y, Zhao P, Chang J, Tian Q (2023) Constraint and union for partially-supervised temporal sentence grounding. arXiv. https://doi.org/10.48550/arXiv.2302.09850
Liu M, Wang X, Nie L, He X, Chen B, Chua T-S (2018) Attentive moment retrieval in videos. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. SIGIR ’18, Association for Computing Machinery, New York, NY, USA, pp 15–24. https://doi.org/10.1145/3209978.3210003
Mishra S, Seth S, Jain S, Pant V, Parikh J, Jain R, Islam SMN (2024) Image caption generation using vision transformer and gpt architecture. In: 2024 2nd International Conference on Advancement in Computation & Computer Technologies (InCACCT), pp 1–6. https://doi.org/10.1109/InCACCT61598.2024.10551257

Download references

Author information

Authors and Affiliations

School of Optoelectronic Engineering, Changchun University of Science and Technology, Changchun, 130022, China
Yun Tian, Xiaobo Guo & Jinsong Wang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
Bin Li

Authors

Yun Tian
View author publications
You can also search for this author inPubMed Google Scholar
Xiaobo Guo
View author publications
You can also search for this author inPubMed Google Scholar
Jinsong Wang
View author publications
You can also search for this author inPubMed Google Scholar
Bin Li
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Yun Tian: Conceptualization of the research idea; development of the multimodal data augmentation framework; performing experiments and data analysis; drafting the manuscript. Xiaobo Guo: Supervision of the research process; providing technical guidance; reviewing and revising the manuscript critically for important intellectual content. Jinsong Wang: Assisting in the experimental setup and implementation; conducting performance evaluations on benchmark datasets; contributing to the interpretation of results. Bin Li: Supporting the integration of large language models into the framework; helping with the literature review and manuscript refinement; providing domain-specific insights. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Xiaobo Guo.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tian, Y., Guo, X., Wang, J. et al. Enhancing video temporal grounding with large language model-based data augmentation. J Supercomput 81, 658 (2025). https://doi.org/10.1007/s11227-025-07159-0

Download citation

Accepted: 05 March 2025
Published: 25 March 2025
DOI: https://doi.org/10.1007/s11227-025-07159-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing video temporal grounding with large language model-based data augmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Training-Free Video Temporal Grounding Using Large-Scale Pre-trained Models

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

ViGT: proposal-free video grounding with a learnable token in the transformer

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now