skip to main content
research-article

Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning

Published: 23 January 2023 Publication History

Abstract

Video captioning has been an emerging research topic in computer vision, which aims to generate a natural sentence to correctly reflect the visual content of a video. The well-established way of doing so is to rely on encoder-decoder paradigm by learning to encode the input video and decode the variable-length output sentence in a sequence-to-sequence manner. Nevertheless, these approaches often fail to produce complex and descriptive sentences as natural as those from human being, since the models are incapable of memorizing all visual contents and syntactic structures in the human-annotated video-sentence pairs. In this article, we uniquely introduce a Retrieval Augmentation Mechanism (RAM) that enables the explicit reference to existing video-sentence pairs within any encoder-decoder captioning model. Specifically, for each query video, a video-sentence retrieval model is first utilized to fetch semantically relevant sentences from the training sentence pool, coupled with the corresponding training videos. RAM then writes the relevant video-sentence pairs into memory and reads the memorized visual contents/syntactic structures in video-sentence pairs from memory to facilitate the word prediction at each timestep. Furthermore, we present Retrieval Augmented Convolutional Encoder-Decoder Network (R-ConvED), which novelly integrates RAM into convolutional encoder-decoder structure to boost video captioning. Extensive experiments on MSVD, MSR-VTT, Activity Net Captions, and VATEX datasets validate the superiority of our proposals and demonstrate quantitatively compelling results.

References

[1]
Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, and Ajmal Mian. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR. 12487–12496.
[2]
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. 2017. VQA: Visual question answering. Int. J. Comput. Vis. 123, 1 (2017), 4–31.
[3]
George Awad, Jonathan Fiscus, David Joy, Martial Michel, Alan F. Smeaton, Wessel Kraaij, Georges Quénot, Maria Eskevich, Robin Aly, Roeland Ordelman, Gareth J. F. Jones, Benoit Huet, and Martha Larson. 2016. Evaluating video search, video event detection, localization, and hyperlinking. In TRECVID.
[4]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
[5]
Nicolas Ballas, Li Yao, Chris Pal, and Aaron C. Courville. 2016. Delving deeper into convolutional networks for learning video representations. In ICLR.
[6]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL. Association for Computational Linguistics, 65–72.
[7]
Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In CVPR. IEEE Computer Society, 3185–3194.
[8]
David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In ACL-HLT. The Association for Computer Linguistics, 190–200.
[9]
Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, and Tao Mei. 2019. Temporal deformable convolutional encoder-decoder networks for video captioning. In AAAI. 8167–8174.
[10]
Shizhe Chen, Jia Chen, Qin Jin, and Alexander G. Hauptmann. 2017. Video captioning with guidance of multimodal latent topics. In MM. ACM, 1838–1846.
[11]
Shaoxiang Chen, Wenhao Jiang, Wei Liu, and Yu-Gang Jiang. 2020. Learning modality interaction for temporal sentence localization and event captioning in videos. In ECCV. Springer, 333–351.
[12]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).
[13]
Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In ECCV. Springer, 367–384.
[14]
Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In ICML. PMLR, 933–941.
[15]
Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Trans. Multim. 20, 12 (2018), 3377–3388.
[16]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In CVPR. Computer Vision Foundation/IEEE, 9346–9355.
[17]
Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Kumar Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In CVPR. 1473–1482.
[18]
Lianli Gao, Yu Lei, Pengpeng Zeng, Jingkuan Song, Meng Wang, and Heng Tao Shen. 2022. Hierarchical representation network with auxiliary tasks for video captioning and video question answering. IEEE Trans. Image Process. 31 (2022), 202–215.
[19]
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence-to-sequence learning. In ICML. PMLR, 1243–1252.
[20]
Konstantinos Gkountakos, Anastasios Dimou, Georgios Th. Papadopoulos, and Petros Daras. 2019. Incorporating textual similarity in video captioning schemes. In ICE/ITMC. 1–6.
[21]
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV. 2712–2719.
[22]
Shikha Gupta, Krishan Sharma, Dileep Aroor Dinesh, and Veena Thenkanidiyoor. 2021. Visual semantic-based representation learning using deep CNNs for scene recognition. ACM Trans. Multim. Comput. Commun. Appl. 17, 2 (2021), 53:1–53:24.
[23]
Patrick Hanckmann, Klamer Schutte, and Gertjan J. Burghouts. 2012. Automated textual descriptions for a wide range of video events with 48 human actions. In ECCV. 372–380.
[24]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. IEEE Computer Society, 770–778.
[25]
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853–899.
[26]
Yunseok Jang, Yale Song, Chris Dongjoo Kim, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2019. Video question answering with spatio-temporal reasoning. Int. J. Comput. Vis. 127, 10 (2019), 1385–1412.
[27]
Wanting Ji and Ruili Wang. 2021. A multi-instance multi-label dual learning approach for video captioning. ACM Trans. Multim. Comput. Commun. Appl. 17, 2s (June2021).
[28]
Tao Jin, Siyu Huang, Ming Chen, Yingming Li, and Zhongfei Zhang. 2020. SBAT: Video captioning with sparse boundary-aware transformer. In IJCAI. ijcai.org, 630–636.
[29]
Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS. 1889–1897.
[30]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In CVPR. IEEE Computer Society, 1725–1732.
[31]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
[32]
Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171–184.
[33]
Ranjay Krishna, Kenji Hata, Frederic Ren, Fei-Fei Li, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In ICCV. 706–715.
[34]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In NIPS.1106–1114.
[35]
Xuelong Li, Bin Zhao, and Xiaoqiang Lu. 2017. MAM-RNN: Multi-level attention model based RNN for video captioning. In IJCAI. ijcai.org, 2208–2214.
[36]
Xiangpeng Li, Zhilong Zhou, Lijiang Chen, and Lianli Gao. 2019. Residual attention-based LSTM for video captioning. World Wide Web 22, 2 (2019), 621–636.
[37]
Yehao Li, Jiahao Fan, Yingwei Pan, Ting Yao, Weiyao Lin, and Tao Mei. 2022. Uni-EDEN: Universal encoder-decoder network by multi-granular vision-language pre-training. arXiv preprint arXiv:2201.04026 (2022).
[38]
Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, and Tao Mei. 2021. X-modaler: A versatile and high-performance codebase for cross-modal analytics. In MM. 3799–3802.
[39]
Yehao Li, Yingwei Pan, Ting Yao, Jingwen Chen, and Tao Mei. 2021. Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network. In AAAI. 8518–8526.
[40]
Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In CVPR. 7492–7500.
[41]
Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2019. Pointing novel objects in image captioning. In IEEE/CVF CVPR. 12497–12506.
[42]
Yehao Li, Ting Yao, Yingwei Pan, and Tao Mei. 2022. Contextual transformer networks for visual recognition. IEEE Trans. Patt. Anal. Mach. Intell. (2022). https://ieeexplore.ieee.org/abstract/document/9747984.
[43]
Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2021. SwinBERT: End-to-end transformers with sparse attention for video captioning. CoRR abs/2111.13196 (2021).
[44]
Sheng Liu, Zhou Ren, and Junsong Yuan. 2021. SibNet: Sibling convolutional encoder for video captioning. IEEE Trans. Patt. Anal. Mach. Intell. 43, 9 (2021), 3259–3272.
[45]
Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Hongyang Chao, and Tao Mei. 2021. CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising. In MM. 5600–5608.
[46]
Amir Mazaheri, Boqing Gong, and Mubarak Shah. 2018. Learning a multi-concept video retrieval model with multiple latent variables. ACM Trans. Multim. Comput. Commun. Appl. 14, 2 (2018), 46:1–46:21.
[47]
Fandong Meng, Zhengdong Lu, Mingxuan Wang, Hang Li, Wenbin Jiang, and Qun Liu. 2015. Encoding source language with convolutional neural network for machine translation. In ACL-IJCNLP. The Association for Computer Linguistics, 20–30.
[48]
Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR, Kiyoharu Aizawa. ACM, 19–27.
[49]
Taylor Mordan, Nicolas Thome, Gilles Hénaff, and Matthieu Cord. 2019. End-to-end learning of latent deformable part-based representations for object detection. Int. J. Comput. Vis. 127, 11–12 (2019), 1659–1679.
[50]
Masoomeh Nabati and Alireza Behrad. 2020. Video captioning using boosted and parallel Long Short-Term Memory networks. Comput. Vis. Image Underst. 190 (2020).
[51]
Vicente Ordonez, Xufeng Han, Polina Kuznetsova, Girish Kulkarni, Margaret Mitchell, Kota Yamaguchi, Karl Stratos, Amit Goyal, Jesse Dodge, Alyssa C. Mensch, Hal Daumé III, Alexander C. Berg, Yejin Choi, and Tamara L. Berg. 2016. Large scale retrieval and generation of image descriptions. Int. J. Comput. Vis. 119, 1 (2016), 46–59.
[52]
Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2Text: Describing images using 1 million captioned photographs. In NIPS. 1143–1151.
[53]
Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In CVPR. Computer Vision Foundation/IEEE, 10867–10876.
[54]
Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, and Tao Mei. 2020. Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. arXiv preprint arXiv:2007.02375 (2020).
[55]
Yingwei Pan, Yehao Li, Ting Yao, Tao Mei, Houqiang Li, and Yong Rui. 2016. Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In JCAI. 3832–3838.
[56]
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In CVPR. IEEE Computer Society, 4594–4602.
[57]
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In CVPR. IEEE Computer Society, 984–992.
[58]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In CVPR. 10971–10980.
[59]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In ACL. 311–318.
[60]
Ramakanth Pasunuru and Mohit Bansal. 2017. Multi-task video captioning with video and entailment generation. In ACL. Association for Computational Linguistics, 1273–1283.
[61]
Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. 2016. Multimodal video description. In MM. ACM, 1092–1096.
[62]
Joseph Redmon and Ali Farhadi. 2017. YOLO9000: Better, faster, stronger. In CVPR. 6517–6525.
[63]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Patt. Anal. Mach. Intell. 39, 6 (2017), 1137–1149.
[64]
Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In ICCV. 433–440.
[65]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211–252.
[66]
Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D. Yoo. 2021. Semantic grouping network for video captioning. In AAAI, EAAI. AAAI Press, 2514–2522.
[67]
Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D. Yoo. 2021. Semantic grouping network for video captioning. In AAAI, IAAI, EAAI. AAAI Press, 2514–2522.
[68]
Shagan Sah, Thang Nguyen, and Ray Ptucha. 2020. Understanding temporal structure for video captioning. Patt. Anal. Appl. 23, 1 (2020), 147–159.
[69]
Xiangxi Shi, Jianfei Cai, Jiuxiang Gu, and Shafiq R. Joty. 2020. Video captioning with boundary-aware hierarchical language decoding and joint video prediction. Neurocomputing 417 (2020), 347–356.
[70]
Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Computat. Ling. 2 (2014), 207–218.
[71]
Jingkuan Song, Lianli Gao, Zhao Guo, Wu Liu, Dongxiang Zhang, and Heng Tao Shen. 2017. Hierarchical LSTM with adjusted temporal attention for video captioning. In IJCAI. ijcai.org, 2737–2743.
[72]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In AAAI. 4278–4284.
[73]
Pengjie Tang, Hanli Wang, and Qinyu Li. 2019. Rich visual and language representation with complementary semantics for video captioning. ACM Trans. Multim. Comput. Commun. Appl. 15, 2 (2019), 31:1–31:23.
[74]
Atousa Torabi, Niket Tandon, and Leonid Sigal. 2016. Learning language-visual embedding for movie understanding with natural-language. CoRR abs/1609.08124 (2016).
[75]
Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In ICCV. IEEE Computer Society, 4489–4497.
[76]
Yunbin Tu, Chang Zhou, Junjun Guo, Shengxiang Gao, and Zhengtao Yu. 2021. Enhancing the alignment between target words and corresponding frames for video captioning. Patt. Recog. 111 (2021), 107702.
[77]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. 5998–6008.
[78]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In CVPR. IEEE Computer Society, 4566–4575.
[79]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence—Video to text. In ICCV. IEEE Computer Society, 4534–4542.
[80]
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In NAACL HLT. The Association for Computational Linguistics, 1494–1504.
[81]
Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018. Reconstruction network for video captioning. In CVPR. IEEE Computer Society, 7622–7631.
[82]
Junbo Wang, Wei Wang, Yan Huang, Liang Wang, and Tieniu Tan. 2018. M3: Multimodal memory modelling for video captioning. In CVPR. Computer Vision Foundation/IEEE Computer Society, 7512–7520.
[83]
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019. VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research. In IEEE/CVF ICCV. IEEE, 4580–4590.
[84]
Aming Wu and Yahong Han. 2018. Multi-modal circulant fusion for video-to-language and backward. In IJCAI.
[85]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In CVPR. IEEE Computer Society, 5288–5296.
[86]
Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning multimodal attention LSTM networks for video captioning. In MM. ACM, 537–545.
[87]
Ran Xu, Caiming Xiong, Wei Chen, and Jason J. Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In AAAI. AAAI Press, 2346–2352.
[88]
Chenggang Yan, Yunbin Tu, Xingzheng Wang, Yongbing Zhang, Xinhong Hao, Yongdong Zhang, and Qionghai Dai. 2020. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Trans. Multim. 22, 1 (2020), 229–241.
[89]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. 2015. Describing videos by exploiting temporal structure. In ICCV. IEEE Computer Society, 4507–4515.
[90]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2017. Incorporating copying mechanism in image captioning for learning novel objects. In CVPR.
[91]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In ECCV.
[92]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2019. Hierarchy parsing for image captioning. In IEEE/CVF ICCV. 2621–2629.
[93]
Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In ICCV.
[94]
Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In CVPR. IEEE Computer Society, 4584–4593.
[95]
Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Ying Shan, Bing Li, Ying Deng, and Weiming Hu. 2021. Open-book video captioning with retrieve-copy-generate network. In IEEE CVPR. Computer Vision Foundation/IEEE, 9837–9846.
[96]
Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object relational graph with teacher-recommended learning for video captioning. In IEEE/CVF CVPR. Computer Vision Foundation/ IEEE, 13275–13285.
[97]
Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2018. Video captioning with tube features. In IJCAI. ijcai.org, 1177–1183.
[98]
Bin Zhao, Xuelong Li, and Xiaoqiang Lu. 2019. CAM-RNN: Co-attention model based RNN for video captioning. IEEE Trans. Image Process. 28, 11 (2019), 5552–5565.
[99]
Wentian Zhao, Xinxiao Wu, and Jiebo Luo. 2021. Multi-modal dependency tree for video captioning. Adv. Neural Inf. Process. Syst. 34 (2021).

Cited By

View all
  • (2025)An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement LearningEURASIP Journal on Image and Video Processing10.1186/s13640-024-00662-z2025:1Online publication date: 9-Jan-2025
  • (2025)Hierarchical Banzhaf Interaction for General Video-Language Representation LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.352212447:3(2125-2139)Online publication date: Mar-2025
  • (2025)SRVC-LA: Sparse regularization of visual context and latent attention based model for video descriptionNeurocomputing10.1016/j.neucom.2025.129639630(129639)Online publication date: May-2025
  • Show More Cited By

Index Terms

  1. Retrieval Augmented Convolutional Encoder-decoder Networks for Video Captioning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 1s
      February 2023
      504 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3572859
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 23 January 2023
      Online AM: 26 May 2022
      Accepted: 16 May 2022
      Revised: 12 April 2022
      Received: 03 December 2021
      Published in TOMM Volume 19, Issue 1s

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Video captioning
      2. deep convolutional neural networks

      Qualifiers

      • Research-article
      • Refereed

      Funding Sources

      • NSF of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)400
      • Downloads (Last 6 weeks)19
      Reflects downloads up to 23 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)An effective video captioning based on language description using a novel Graylag Deep Kookaburra Reinforcement LearningEURASIP Journal on Image and Video Processing10.1186/s13640-024-00662-z2025:1Online publication date: 9-Jan-2025
      • (2025)Hierarchical Banzhaf Interaction for General Video-Language Representation LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.352212447:3(2125-2139)Online publication date: Mar-2025
      • (2025)SRVC-LA: Sparse regularization of visual context and latent attention based model for video descriptionNeurocomputing10.1016/j.neucom.2025.129639630(129639)Online publication date: May-2025
      • (2024)Dual Dynamic Threshold Adjustment StrategyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604720:7(1-18)Online publication date: 15-May-2024
      • (2024)Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video CommentingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363333420:4(1-24)Online publication date: 11-Jan-2024
      • (2024)Memory-Based Augmentation Network for Video CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.329509826(2367-2379)Online publication date: 1-Jan-2024
      • (2024)Dual-Adversarial Representation Disentanglement for Visible Infrared Person Re-IdentificationIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334428919(2186-2200)Online publication date: 1-Jan-2024
      • (2024)Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01318(13894-13904)Online publication date: 16-Jun-2024
      • (2024)Video captioning based on dual learning via multiple reconstruction blocksImage and Vision Computing10.1016/j.imavis.2024.105119148(105119)Online publication date: Aug-2024
      • (2024)Twinenet: coupling features for synthesizing volume rendered images via convolutional encoder–decoders and multilayer perceptronsThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-024-03368-540:10(7201-7220)Online publication date: 1-Oct-2024
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media