Skip to main content

Advertisement

Log in

Enhancing video temporal grounding with large language model-based data augmentation

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Given an untrimmed video and a natural language query, the task of video temporal grounding (VTG) aims to precisely identify the temporal segment in the video that semantically matches the query. Existing datasets for this task often provide natural language queries that are overly simplistic and manually annotated, which lack sufficient semantic richness to fully capture the video’s content. This limitation hinders the model’s ability to comprehend complex semantic scenarios and degrades its overall performance. To address these challenges, we introduce a novel, low-cost, large language model-based data augmentation method, that can enrich the original samples and expand the dataset without requiring external data. We propose a fine-grained image captioning module with a noise filter to extract unexploited information from videos. Additionally, we design a hierarchical semantic prompting framework to guide GPT-3.5 in producing semantically rich and contextually coherent natural language queries. Our method outperforms the SOTA method MRTNet when combined with 2D-TAN and VSLNet across three public VTG datasets, particularly excelling in complex semantics and long-duration segment localization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The data presented in this study are available on request from the corresponding author. 

References

  1. Gao J, Sun C, Yang Z, Nevatia R (2017) Tall: Temporal activity localization via language query. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 5277–5285. https://doi.org/10.1109/ICCV.2017.563

  2. Hendricks LA, Wang O, Shechtman E, Sivic J, Darrell T, Russell B (2017) Localizing moments in video with natural language. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 5804–5813. https://doi.org/10.1109/ICCV.2017.618

  3. Voigtlaender P, Changpinyo S, Pont-Tuset J, Soricut R, Ferrari V (2023) Connecting vision and language with video localized narratives. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Vancouver, BC, Canada, pp 2461–2471. https://doi.org/10.1109/CVPR52729.2023.00243

  4. Li S, Li B, Sun B, Weng Y (2024) Towards visual-prompt temporal answer grounding in instructional video. IEEE Trans Pattern Anal Mach Intell 46(12):8836–8853. https://doi.org/10.1109/TPAMI.2024.3411045

    Article  MATH  Google Scholar 

  5. S D.S, Khan Z, Tapaswi M (2024) FiGCLIP: Fine-Grained CLIP adaptation via densely annotated videos. arXiv. https://doi.org/10.48550/arXiv.2401.07669

  6. Qu M, Chen X, Liu W, Li A, Zhao Y (2024) Chatvtg: Video temporal grounding via chat with video dialogue large language models. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, Seattle, WA, USA, pp 1847–1856. https://doi.org/10.1109/CVPRW63382.2024.00191

  7. Pérez-Mayos L, Sukno FM, Wanner L (2018) Improving the quality of video-to-language models by optimizing annotation of the training material. In: Schoeffmann K, Chalidabhongse TH, Ngo CW, Aramvith S, O’Connor NE, Ho Y-S, Gabbouj M, Elgammal A (ed) MultiMedia modeling, Springer, Cham, pp 279–290. https://doi.org/10.1007/978-3-319-73603-7_23

  8. Chen T-S, Siarohin A, Menapace W, Deyneka E, Chao H-W, Jeon BE, Fang Y, Lee H-Y, Ren J, Yang M-H, Tulyakov S (2024) Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, pp 13320–13331. https://doi.org/10.1109/CVPR52733.2024.01265

  9. Rafiq G, Rafiq M, Choi GS (2023) Video description: a comprehensive survey of deep learning approaches. Artif Intell Rev 56(11):13293–13372. https://doi.org/10.1007/s10462-023-10414-6

    Article  MATH  Google Scholar 

  10. Shi Y, Xu H, Yuan C, Li B, Hu W, Zha Z-J (2023) Learning video-text aligned representations for video captioning. ACM Trans Multimed Comput Commun Appl 19(2):63–16321. https://doi.org/10.1145/3546828

    Article  MATH  Google Scholar 

  11. Dong J, Chen X, Zhang M, Yang X, Chen S, Li X, Wang X (2022) Partially relevant video retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia. MM ’22, Association for Computing Machinery, New York, NY, USA, pp 246–257. https://doi.org/10.1145/3503161.3547976

  12. Hou D, Pang L, Shen H, Cheng X (2024) Improving video corpus moment retrieval with partial relevance enhancement. In: Proceedings of the 2024 International Conference on Multimedia Retrieval. ICMR ’24, Association for Computing Machinery, New York, NY, USA, pp 394–403. https://doi.org/10.1145/3652583.3658088

  13. Mahmud T, Liang F, Qing Y, Marculescu D (2023) Clip4videocap: Rethinking clip for video captioning with multiscale temporal fusion and commonsense knowledge. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1–5. https://doi.org/10.1109/ICASSP49357.2023.10097128

  14. Olivastri S, Singh G, Cuzzolin F (2019) End-to-end video captioning. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), IEEE, Seoul, Korea (South), pp 1474–1482. https://doi.org/10.1109/ICCVW.2019.00185

  15. Adewale S, Ige T, Matti BH (2023) Encoder-decoder based long short-term memory (LSTM) model for video captioning. arXiv. https://doi.org/10.48550/arXiv.2401.02052

  16. Liu H, Singh P (2004) Conceptnet – a practical commonsense reasoning tool-kit. BT Tech J 22(4):211–226. https://doi.org/10.1023/B:BTTJ.0000047600.45421.6d

    Article  MATH  Google Scholar 

  17. Miech A, Alayrac J-B, Smaira L, Laptev I, Sivic J, Zisserman A (2020) End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  18. Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) Videobert: A joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

  19. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I (2021) Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR, Vienna, Austria

  20. Bain M, Nagrani A, Varol G, Zisserman A (2021) Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 1728–1738

  21. Li J, Li D, Savarese S, Hoi S (2023) Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, vol. 202, pp 19730–19742. JMLR.org, Honolulu, Hawaii, USA

  22. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20, pp 1877–1901. Curran Associates Inc., Red Hook, NY, USA

  23. Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 961–970. https://doi.org/10.1109/CVPR.2015.7298698

  24. Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M (2013) Grounding action descriptions in videos. Trans Assoc Comput Linguist 1:25–36. https://doi.org/10.1162/tacl_a_00207

    Article  Google Scholar 

  25. Zhang S, Peng H, Fu J, Luo J (2020) Learning 2d temporal adjacent networks for moment localization with natural language. Proceedings of the AAAI Conference on Artificial Intelligence 34(07):12870–12877. https://doi.org/10.1609/aaai.v34i07.6984

  26. Zhang H, Sun A, Jing W, Zhou JT (2020) Span-based localizing network for natural language video localization. In: Jurafsky D, Chai J, Schluter N, Tetreault J (ed) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online. pp 6543–6554. https://doi.org/10.18653/v1/2020.acl-main.585

  27. Zhang H, Sun A, Jing W, Zhen L, Zhou JT, Goh RSM (2022) Natural language video localization: a revisit in span-based question answering framework. IEEE Trans Pattern Anal Mach Intell 44(8):4252–4266. https://doi.org/10.1109/TPAMI.2021.3060449

    Article  MATH  Google Scholar 

  28. Ma J, Ushiku Y, Sagara M (2022) The effect of improving annotation quality on object detection datasets: a preliminary study. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 4849–4858. https://doi.org/10.1109/CVPRW56347.2022.00532

  29. Li S-Y, Jiang Y (2018) Multi-label crowdsourcing learning with incomplete annotations. In: Geng X, Kang B-H (ed) PRICAI 2018: Trends in Artificial Intelligence, Springer, Cham, pp 232–245. https://doi.org/10.1007/978-3-319-97304-3_18

  30. Mun J, Cho M, Han B (2020) Local-global video-text interactions for temporal grounding. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10807–10816.https://doi.org/10.1109/CVPR42600.2020.01082

  31. Jie Z, Xie P, Lu W, Ding R, Li L (2019) Better modeling of incomplete annotations for named entity recognition. In: Burstein J, Doran C, Solorio T (ed) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp 729–734. https://doi.org/10.18653/v1/N19-1079

  32. Lan X, Yuan Y, Chen H, Wang X, Jie Z, Ma L, Wang Z, Zhu W (2023) Curriculum multi-negative augmentation for debiased video grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence 37(1):1213–1221. https://doi.org/10.1609/aaai.v37i1.25204

  33. Kim T, Kim J, Shim M, Yun S, Kang M, Wee D, Lee S (2022) Exploring temporally dynamic data augmentation for video recognition. arXiv. https://doi.org/10.48550/arXiv.2206.15015

  34. Gorpincenko A, Mackiewicz M (2023) Extending temporal data augmentation for video action recognition. In: Yan WQ, Nguyen M, Stommel M (ed) Image and Vision Computing, Springer, Cham, pp 104–118. https://doi.org/10.1007/978-3-031-25825-1_8

  35. Kumar V, Choudhary A, Cho E (2020) Data augmentation using pre-trained transformer models. In: Campbell WM, Waibel A, Hakkani-Tur D, Hazen TJ, Kilgour K, Cho E, Kumar V, Glaude H (ed) Proceedings of the 2nd Workshop on Life-Long Learning for Spoken Language Systems, Association for Computational Linguistics, Suzhou, China, pp 18–26. https://doi.org/10.18653/v1/2020.lifelongnlp-1.3

  36. Shorten C, Khoshgoftaar TM, Furht B (2021) Text data augmentation for deep learning. J Big Data 8(1):101. https://doi.org/10.1186/s40537-021-00492-0

    Article  MATH  Google Scholar 

  37. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, Bikel D, Blecher L, Ferrer CC, Chen M, Cucurull G, Esiobu D, Fernandes J, Fu J, Fu W, Fuller B, Gao C, Goswami V, Goyal N, Hartshorn A, Hosseini S, Hou R, Inan H, Kardas M, Kerkez V, Khabsa M, Kloumann I, Korenev A, Koura PS, Lachaux M-A, Lavril T, Lee J, Liskovich D, Lu Y, Mao Y, Martinet X, Mihaylov T, Mishra P, Molybog I, Nie Y, Poulton A, Reizenstein J, Rungta R, Saladi K, Schelten A, Silva R, Smith EM, Subramanian R, Tan XE, Tang B, Taylor R, Williams A, Kuan JX, Xu P, Yan Z, Zarov I, Zhang Y, Fan A, Kambadur M, Narang S, Rodriguez A, Stojnic R, Edunov S, Scialom T (2023) Llama 2: open foundation and fine-tuned chat models. arXiv. https://doi.org/10.48550/arXiv.2307.09288

  38. Almazrouei E, Alobeidli H, Alshamsi A, Cappelli A, Cojocaru R, Debbah M, Goffinet É, Hesslow D, Launay J, Malartic Q, Mazzotta D, Noune B, Pannier B, Penedo G (2023) The falcon series of open language models. arXiv. https://doi.org/10.48550/arXiv.2311.16867

  39. Yang A, Yang B, Zhang B, Hui B, Zheng B, Yu B, Li C, Liu D, Huang F, Wei H, Lin H, Yang J, Tu J, Zhang J, Yang J, Yang J, Zhou J, Lin J, Dang K, Lu K, Bao K, Yang K, Yu L, Li M, Xue M, Zhang P, Zhu Q, Men R, Lin R, Li T, Xia T, Ren X, Ren X, Fan Y, Su Y, Zhang Y, Wan Y, Liu Y, Cui Z, Zhang Z, Qiu Z (2024) Qwen2.5 technical report. arXiv. https://doi.org/10.48550/arXiv.2412.15115

  40. Ye J, Chen X, Xu N, Zu C, Shao Z, Liu S, Cui Y, Zhou Z, Gong C, Shen Y, Zhou J, Chen S, Gui T, Zhang Q, Huang X (2023) A comprehensive capability analysis of GPT-3 and GPT-3.5 Series Models. arXiv. https://doi.org/10.48550/arXiv.2303.10420

  41. Wang Y, Li D, Shen J, Xu Y, Xu M, Funakoshi K, Okumura M (2024) LAMBDA: large language model-based data augmentation for multi-modal machine translation. In: Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 15240–15253. Association for Computational Linguistics, Miami, Florida, USA . https://doi.org/10.18653/v1/2024.findings-emnlp.893 . https://aclanthology.org/2024.findings-emnlp.893

  42. Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe B, Matas J, Sebe N, Welling, M (ed) Computer Vision – ECCV 2016, Springer, Cham, pp 510–526. https://doi.org/10.1007/978-3-319-46448-0_31

  43. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Isabelle P, Charniak E, Lin D (ed) Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp 311–318. https://doi.org/10.3115/1073083.1073135

  44. Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Goldstein J, Lavie A, Lin C.-Y., Voss C (ed) Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization, Association for Computational Linguistics, Ann Arbor, Michigan, pp 65–72

  45. Ji W, Qin Y, Chen L, Wei Y, Wu Y, Zimmermann R (2024) Mrtnet: multi-resolution temporal network for video sentence grounding. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 2770–2774. https://doi.org/10.1109/ICASSP48485.2024.10447846

  46. Liu M, Wang X, Nie L, Tian Q, Chen B, Chua T-S (2018) Cross-modal moment localization in videos. In: Proceedings of the 26th ACM International Conference on Multimedia. MM ’18, Association for Computing Machinery, New York, NY, USA, pp 843–851. https://doi.org/10.1145/3240508.3240549

  47. Ge R, Gao J, Chen K, Nevatia R (2019) Mac: Mining activity concepts for language-based temporal localization. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp 245–253. https://doi.org/10.1109/WACV.2019.00032

  48. Chen S, Jiang Y-G (2019) Semantic proposal for activity localization in videos via sentence query. In: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI’19/IAAI’19/EAAI’19, vol. 33, pp. 8199–8206. AAAI Press, Honolulu, Hawaii, USA. https://doi.org/10.1609/aaai.v33i01.33018199

  49. Wang W, Huang Y, Wang L (2019) Language-driven temporal activity localization: a semantic matching reinforcement learning model. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 334–343. https://doi.org/10.1109/CVPR.2019.00042

  50. Xu H, He K, Plummer BA, Sigal L, Sclaroff S, Saenko K (2019) Multilevel language and vision integration for text-to-clip retrieval. In: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI’19/IAAI’19/EAAI’19, vol 33, pp 9062–9069. AAAI Press, Honolulu, Hawaii, USA. https://doi.org/10.1609/aaai.v33i01.33019062

  51. Lu C, Chen L, Tan C, Li X, Xiao J (2019) Debug: a dense bottom-up grounding approach for natural language video localization. In: Inui K, Jiang J, Ng V, Wan X (ed) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, pp 5144–5153. https://doi.org/10.18653/v1/D19-1518

  52. Ghosh S, Agarwal A, Parekh Z, Hauptmann A (2019) Excl: Extractive clip localization using natural language descriptions. In: Burstein J, Doran C, Solorio T (ed) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol 1 (Long and Short Papers), pp 1984–1990. Association for Computational Linguistics, Minneapolis, Minnesota. https://doi.org/10.18653/v1/N19-1198

  53. Zhang D, Dai X, Wang X, Wang Y-F, Davis LS (2019) Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 1247–1257. https://doi.org/10.1109/CVPR.2019.00134

  54. Chen L, Lu C, Tang S, Xiao J, Zhang D, Tan C, Li X (2020) Rethinking the bottom-up framework for query-based video localization. In: Proceedings of the AAAI Conference on Artificial Intelligence 34(07):10551–10558. https://doi.org/10.1609/aaai.v34i07.6627

  55. Zeng R, Xu H, Huang W, Chen P, Tan M, Gan C (2020) Dense regression network for video grounding. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10284–10293 . https://doi.org/10.1109/CVPR42600.2020.01030

  56. Liu D, Qu X, Dong J, Zhou P, Cheng Y, Wei W, Xu Z, Xie Y (2021) Context-aware biaffine localizing network for temporal sentence grounding. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Nashville, TN, USA, pp 11230–11239. https://doi.org/10.1109/CVPR46437.2021.01108

  57. Yu X, Malmir M, He X, Chen J, Wang T, Wu Y, Liu Y, Liu Y (2021) Cross interaction network for natural language guided video moment retrieval. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’21, Association for Computing Machinery, New York, NY, USA, pp 1860–1864. https://doi.org/10.1145/3404835.3463021

  58. Wang Z, Wang L, Wu T, Li T, Wu G (2022) Negative sample matters: a renaissance of metric learning for temporal grounding. Proceedings of the AAAI Conference on Artificial Intelligence 36(3):2613–2623. https://doi.org/10.1609/aaai.v36i3.20163

  59. Xu Z, Wei K, Yang X, Deng C (2023) Point-supervised video temporal grounding. IEEE Trans Multimed 25:6121–6131. https://doi.org/10.1109/TMM.2022.3205404

    Article  Google Scholar 

  60. Li H, Shu X, He S, Qiao R, Wen W, Guo T, Gan B, Sun X (2023) D3g: Exploring gaussian prior for temporal sentence grounding with glance annotation. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Paris, France, pp 13688–13700. https://doi.org/10.1109/ICCV51070.2023.01263

  61. Chen J, Chen X, Ma L, Jie Z, Chua T-S (2018) Temporally grounding natural sentence in video. In: Riloff E, Chiang D, Hockenmaier J, Tsujii J (ed) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 162–171. Association for Computational Linguistics, Brussels, Belgium. https://doi.org/10.18653/v1/D18-1015

  62. Zhang Z, Lin Z, Zhao Z, Xiao Z (2019) Cross-modal interaction networks for query-based moment retrieval in videos. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’19, Association for Computing Machinery, New York, NY, USA, pp 655–664. https://doi.org/10.1145/3331184.3331235

  63. Yuan Y, Mei T, Zhu W (2019) To find where you talk: Temporal sentence localization in video with attention based location regression. Proceedings of the AAAI Conference on Artificial Intelligence 33(01):9159–9166. https://doi.org/10.1609/aaai.v33i01.33019159

  64. Zhang H, Sun A, Jing W, Zhen L, Zhou JT, Goh RSM (2021) Parallel attention network with sequence matching for video grounding. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 776–790 . https://doi.org/10.18653/v1/2021.findings-acl.69

  65. Ju C, Wang H, Liu J, Ma C, Zhang Y, Zhao P, Chang J, Tian Q (2023) Constraint and union for partially-supervised temporal sentence grounding. arXiv. https://doi.org/10.48550/arXiv.2302.09850

  66. Liu M, Wang X, Nie L, He X, Chen B, Chua T-S (2018) Attentive moment retrieval in videos. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. SIGIR ’18, Association for Computing Machinery, New York, NY, USA, pp 15–24. https://doi.org/10.1145/3209978.3210003

  67. Mishra S, Seth S, Jain S, Pant V, Parikh J, Jain R, Islam SMN (2024) Image caption generation using vision transformer and gpt architecture. In: 2024 2nd International Conference on Advancement in Computation & Computer Technologies (InCACCT), pp 1–6. https://doi.org/10.1109/InCACCT61598.2024.10551257

Download references

Author information

Authors and Affiliations

Authors

Contributions

Yun Tian: Conceptualization of the research idea; development of the multimodal data augmentation framework; performing experiments and data analysis; drafting the manuscript. Xiaobo Guo: Supervision of the research process; providing technical guidance; reviewing and revising the manuscript critically for important intellectual content. Jinsong Wang: Assisting in the experimental setup and implementation; conducting performance evaluations on benchmark datasets; contributing to the interpretation of results. Bin Li: Supporting the integration of large language models into the framework; helping with the literature review and manuscript refinement; providing domain-specific insights. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Xiaobo Guo.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tian, Y., Guo, X., Wang, J. et al. Enhancing video temporal grounding with large language model-based data augmentation. J Supercomput 81, 658 (2025). https://doi.org/10.1007/s11227-025-07159-0

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-025-07159-0

Keywords