Abstract
Video captioning refers to the automatic generation of natural language sentences for a given video. There are two open problems for this task: how to effectively combine the multimodal features to better represent video content, and how to effectively extract useful features from complex visual and linguistic information to generate more detailed descriptions. Considering these two difficulties together, we propose a video captioning method named ACF-Net: Appearance-guided Content Filter Network, utilizing appearance information as a Content Filter to guide the network to aware discrimination information from both motion information and object information. Specifically, we propose a new multimodal fusion method to alleviate the problem of insufficient video information fusion. Distinguished with previous feature fusion methods that directly concatenate features, our fusion mechanism fuses relevant content information through content filters to form unified multimodal features. Moreover, a hierarchical decoder with temporal semantic aggregation is proposed, which can dynamically aggregate visual and linguistic features while generating corresponding words, focusing on the most relevant temporal and semantic information. Extensive experiments are conducted on two benchmark datasets, including MSVD and MSR-VTT, which demonstrate the effectiveness of our proposed method.
Similar content being viewed by others
Data Availability
All data generated or analysed during this study are included in this published article (and its supplementary information files).
References
Aafaq N, Akhtar N, Liu W, et al. (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,487–12,496
Anderson P, He X, Buehler C, et al. (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Bai Y, Wang J, Long Y, et al. (2021) Discriminative latent semantic graph for video captioning. In: Proceedings of the 29th ACM international conference on multimedia, pp 3556–3564
Barbu A, Bridge A, Burchill Z, et al. (2012) Video in sentences out. arXiv:1204.2742
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200
Chen H, Lin K, Maye A, et al. (2020) A semantics-assisted video captioning model trained with scheduled sampling. Frontiers in Robotics and AI 7
Chen S, Jiang YG (2019) Motion guided spatial attention for video captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 8191–8198
Chen S, Jiang YG (2021) Motion guided region message passing for video captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1543–1552
Chen X, Fang H, Lin TY, et al. (2015) Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325
Deng J, Li L, Zhang B et al (2021) Syntax-guided hierarchical attention network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology 32(2):880–892
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380
Foo LG, Li T, Rahmani H, et al. (2022) Era: Expert retrieval and assembly for early action prediction. arXiv:2207.09675
Gan Z, Gan C, He X, et al. (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639
Gu X, Chen G, Wang Y, et al. (2023) Text with knowledge graph augmented transformer for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18,941–18,951
Hori C, Hori T, Lee TY, et al. (2017) Attention-based multimodal fusion for video description. In: Proceedings of the IEEE international conference on computer vision, pp 4193–4202
Hu W, Zhang Y, Li Y, et al. (2023) Query-based video summarization with multi-label classification network. Multimedia Tools and Applications, pp 1–21
Jin T, Huang S, Li Y, et al. (2019) Low-rank hoca: Efficient high-order cross-modal attention for video captioning. arXiv:1911.00212
Lee JY (2019) Deep multimodal embedding for video captioning. Multimedia Tools and Applications 78(22):31,793–31,805
Li L, Gao X, Deng J et al (2022) Long short-term relation transformer with global gating for video captioning. IEEE Trans Image Process 31:2726–2738
Li T, Liu J, Zhang W, et al. (2020) Hard-net: Hardness-aware discrimination network for 3d early activity prediction. In: European conference on computer vision. Springer, pp 420–436
Li T, Ke Q, Rahmani H, et al. (2021a) Else-net: Elastic semantic network for continual action recognition from skeleton data. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13,434–13,443
Li T, Liu J, Zhang W, et al. (2021b) Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16,266–16,275
Li T, Foo LG, Ke Q, et al. (2022b) Dynamic spatio-temporal specialization learning for fine-grained action recognition. In: European conference on computer vision. Springer, pp 386–403
Lin CY, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL-04), pp 605–612
Liu AA, Shao Z, Wong Y et al (2019) Lstm-based multi-label video event detection. Multimedia Tools and Applications 78:677–695
Lu J, Yang J, Batra D et al (2016) Hierarchical question-image co-attention for visual question answering. Advances in Neural Information Processing Systems 29:289–297
Pan P, Xu Z, Yang Y, et al. (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1029–1038
Pan Y, Yao T, Li H, et al. (2017) Video captioning with transferred semantic attributes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6504–6512
Papineni K, Roukos S, Ward T, et al. (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Pei W, Zhang J, Wang X, et al. (2019) Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8347–8356
Perez-Martin J, Bustos B, Pérez J (2021) Improving video captioning with temporal composition of a visual-syntactic embedding. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3039–3049
Ren S, He K, Girshick R et al (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28:91–99
Shao Z, Han J, Marnerides D, et al. (2022) Region-object relation-aware dense captioning via transformer. IEEE Transactions on Neural Networks and Learning Systems, pp 1–12
Shao Z, Han J, Debattista K, et al. (2023) Textual context-aware dense captioning with diverse words. IEEE Transactions on Multimedia, pp 1–15
Singh A, Singh TD, Bandyopadhyay S (2022) V2t: video to text framework using a novel automatic shot boundary detection algorithm. Multimedia Tools and Applications 81(13):17,989–18,009
Szegedy C, Ioffe S, Vanhoucke V, et al. (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence
Tan G, Liu D, Wang M, et al. (2020) Learning to discretely compose reasoning module networks for video captioning. arXiv:2007.09049
Thomason J, Venugopalan S, Guadarrama S, et al. (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, pp 1218–1227
Tran D, Bourdev L, Fergus R, et al. (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Tran K, He X, Zhang L, et al. (2016) Rich image captioning in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 49–56
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Venugopalan S, Xu H, Donahue J, et al. (2014) Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729
Venugopalan S, Rohrbach M, Donahue J, et al. (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542
Venugopalan S, Hendricks LA, Mooney R, et al. (2016) Improving lstm-based video description with linguistic knowledge mined from text. arXiv:1604.01729
Wang B, Ma L, Zhang W, et al. (2019) Controllable video captioning with pos sequence guidance based on gated fusion network. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2641–2650
Wang B, Liu C, Chang F, et al. (2022) Ae-net:adjoint enhancement network for efficient action recognition in video understanding. IEEE Transactions on Multimedia, pp 1–12
Wang S, Lan L, Zhang X et al (2020) Object-aware semantics of attention for image captioning. Multimedia Tools and Applications 79:2013–2030
Wang W, Chang F, Liu C, et al. (2021a) Ga-net: A guidance aware network for skeleton-based early activity recognition. IEEE Transactions on Multimedia, pp 1–1
Wang W, Chang F, Mi H (2021) Intermediate fused network with multiple timescales for anomaly detection. Neurocomputing 433:37–49
Wang W, Chang F, Zhang J, et al. (2023) Magi-net: Meta negative network for early activity prediction. IEEE Transactions on Image Processing
Wu B, Niu G, Yu J et al (2022) Towards knowledge-aware video captioning via transitive visual relationship detection. IEEE Trans Circuits Syst Video Technol 32(10):6753–6765
Wu B, Liu B, Huang P, et al. (2023a) Concept parser with multi-modal graph learning for video captioning. IEEE Transactions on Circuits and Systems for Video Technology, pp 1–1
Wu P, Wang W, Chang F, et al. (2023b) Dss-net: Dynamic self-supervised network for video anomaly detection. IEEE Transactions on Multimedia
Xu J, Mei T, Yao T, et al. (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296
Yan C, Xie H, Liu S et al (2017) Effective uyghur language text detection in complex background images for traffic prompt identification. IEEE Trans Intell Transp Syst 19(1):220–229
Yan C, Xie H, Yang D et al (2018) Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans Intell Transp Syst 19(1):284–295
Yao L, Torabi A, Cho K, et al. (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515
You Q, Jin H, Wang Z, et al. (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Yu L, Lin Z, Shen X, et al. (2018) Mattnet: Modular attention network for referring expression comprehension. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1307–1315
Yuan J, Tian C, Zhang X, et al. (2018) Video captioning with semantic guiding. In: 2018 IEEE fourth international conference on multimedia big data (BigMM), IEEE, pp 1–5
Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 8327–8336
Zhang Z, Shi Y, Yuan C, et al. (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13,278–13,288
Funding
This work was supported by the National Natural Science Foundation of China (NO.U22A2058, 62176138, 62176136), National Key R &D Program of China (NO.2018YFB1305300), Shandong Provincial Key Research and Development Program (Major Scientific and Technological Innovation Project)(NO.2020CXGC010207, 2019JZZY010130).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflicts of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, M., Liu, D., Liu, C. et al. ACF-net: appearance-guided content filter network for video captioning. Multimed Tools Appl 83, 31103–31122 (2024). https://doi.org/10.1007/s11042-023-16580-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16580-7