Abstract
Mainstream video captioning models (VCMs) are trained under fully supervised learning that relies heavily on large-scaled high-quality video-caption pairs. Unfortunately, evaluating the corpora of benchmark datasets shows that there are many defects associated with humanly labeled annotations, such as variation of the caption length and quality for one video and word imbalance in captions. Such defects may pose a significant impact on model training. In this study, we propose to lower down the adverse impact of annotations and encourage VCMs to learn high-quality captions and more informative words via Consensus-Guided Keyword Targeting (CGKT) training strategy. Specifically, CGKT firstly aims at re-weighting each training caption using a consensus-based metric named CIDEr. Secondly, CGKT attaches more weights to those informative and uncommonly used words based on their frequency. Extensive experiments on MSVD and MSR-VTT show that the proposed CGKT can easily work with three VCMs to achieve significant CIDEr improvements. Moreover, compared with the conventional cross-entropy objective, our CGKT facilitates the generation of more comprehensive and better-quality captions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Darrell, T., Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Saenko, K.: Sequence to sequence - video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)
Zhang, Y., Xu, J., Yao, T., Mei, T.: Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 537–545 (2017)
Wang, C., Zheng, Q., Tao, D.: Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 13093–13102 (2020)
Malkarnenkar, G., et al.: YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712–2719 (2013)
Yao, T., Xu, J., Mei, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296 (2016)
Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 171–184 (2002). https://doi.org/10.1023/A:1020346032608
Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less is more: picking informative frames for video captioning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 367–384. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_22
Zhang, W., Wang, B., Ma, L., Liu, W.: Reconstruction network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7622–7631 (2018)
Wang, M., Tan, G., Liu, D., Zha, Z.: Learning to discretely compose reasoning module networks for video captioning. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp. 745–752 (2020)
Zhang, W., Jiang, W., Wang, J., Wang, B., Ma, L., Liu, W.: Controllable video captioning with POS sequence guidance based on gated fusion network. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2650 (2019)
Jin, Q., Chen, S., Chen, J., Hauptmann, A.: Video captioning with guidance of multimodal latent topics. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1838–1846 (2017)
Ward, T., Papineni, K., Roukos, S., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Denkowski, M.J., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Lin, C.-Y.: ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the ACL Workshop: Text Summarization Branches Out, p. 10 (2004)
Zitnick, C.L., Vedantam, R., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Johnson, J.M., Khoshgoftaar, T.M.: Survey on deep learning with class imbalance. J. Big Data 6(1), 1–54 (2019). https://doi.org/10.1186/s40537-019-0192-5
Shi, J., Feng, H., Ouyang, W., Pang, J., Chen, K., Lin, D.: Libra R-CNN: towards balanced learning for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 821–830 (2020)
Li, Y., Vasconcelos, N.: REPAIR: removing representation bias by dataset resampling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9564–9573 (2019)
Girshick, R.B., He, K., Lin, T., Goyal, P., Dollar, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42, 318–327 (2020)
Maye, A., Li, J., Chen, H., Ke, L., Hu, X.: A semantics assisted video captioning model trained with scheduled sampling. Front. Robot. AI 7, 475767 (2020)
Zolfaghari, M., Singh, K., Brox, T.: ECO: efficient convolutional network for online video understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 713–730. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_43
Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5987–5995 (2017)
Ioffe, S., Shlens, J., Szegedy, C., Vanhoucke, V., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Yao, L., et al.: Describing videos by exploiting temporal structure. In: International Conference on Computer Vision, pp. 4507–4515 (2015)
Ye, H., Li, G., Qi, Y., Wang, S., Huang, Q., Yang, M.: Hierarchical modular network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2022)
Acknowledgements
This paper was partially supported by NSFC (No: 62176008) and Shenzhen Science and Technology Research Program (No: GXWD202012311658 07007-20200814115301001).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ji, P., Yang, B., Zhang, T., Zou, Y. (2022). Consensus-Guided Keyword Targeting for Video Captioning. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-18913-5_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18912-8
Online ISBN: 978-3-031-18913-5
eBook Packages: Computer ScienceComputer Science (R0)