ACF-net: appearance-guided content filter network for video captioning

Li, Min; Liu, Dongmei; Liu, Chunsheng; Chang, Faliang; Wang, Wenqian; Wang, Bin

doi:10.1007/s11042-023-16580-7

ACF-net: appearance-guided content filter network for video captioning

Published: 05 September 2023

Volume 83, pages 31103–31122, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Min Li^1,2,
Dongmei Liu³,
Chunsheng Liu¹,
Faliang Chang ORCID: orcid.org/0000-0003-1276-2267¹,
Wenqian Wang¹ &
…
Bin Wang¹

101 Accesses
Explore all metrics

Abstract

Video captioning refers to the automatic generation of natural language sentences for a given video. There are two open problems for this task: how to effectively combine the multimodal features to better represent video content, and how to effectively extract useful features from complex visual and linguistic information to generate more detailed descriptions. Considering these two difficulties together, we propose a video captioning method named ACF-Net: Appearance-guided Content Filter Network, utilizing appearance information as a Content Filter to guide the network to aware discrimination information from both motion information and object information. Specifically, we propose a new multimodal fusion method to alleviate the problem of insufficient video information fusion. Distinguished with previous feature fusion methods that directly concatenate features, our fusion mechanism fuses relevant content information through content filters to form unified multimodal features. Moreover, a hierarchical decoder with temporal semantic aggregation is proposed, which can dynamically aggregate visual and linguistic features while generating corresponding words, focusing on the most relevant temporal and semantic information. Extensive experiments are conducted on two benchmark datasets, including MSVD and MSR-VTT, which demonstrate the effectiveness of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MIVCN: Multimodal interaction video captioning network based on semantic association graph

Article 07 August 2021

Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion

Article 25 August 2023

Hierarchical Multimodal Attention Network Based on Semantically Textual Guidance for Video Captioning

Data Availability

All data generated or analysed during this study are included in this published article (and its supplementary information files).

References

Aafaq N, Akhtar N, Liu W, et al. (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,487–12,496
Anderson P, He X, Buehler C, et al. (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Bai Y, Wang J, Long Y, et al. (2021) Discriminative latent semantic graph for video captioning. In: Proceedings of the 29th ACM international conference on multimedia, pp 3556–3564
Barbu A, Bridge A, Burchill Z, et al. (2012) Video in sentences out. arXiv:1204.2742
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200
Chen H, Lin K, Maye A, et al. (2020) A semantics-assisted video captioning model trained with scheduled sampling. Frontiers in Robotics and AI 7
Chen S, Jiang YG (2019) Motion guided spatial attention for video captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 8191–8198
Chen S, Jiang YG (2021) Motion guided region message passing for video captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1543–1552
Chen X, Fang H, Lin TY, et al. (2015) Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325
Deng J, Li L, Zhang B et al (2021) Syntax-guided hierarchical attention network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology 32(2):880–892
Article Google Scholar
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380
Foo LG, Li T, Rahmani H, et al. (2022) Era: Expert retrieval and assembly for early action prediction. arXiv:2207.09675
Gan Z, Gan C, He X, et al. (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639
Gu X, Chen G, Wang Y, et al. (2023) Text with knowledge graph augmented transformer for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18,941–18,951
Hori C, Hori T, Lee TY, et al. (2017) Attention-based multimodal fusion for video description. In: Proceedings of the IEEE international conference on computer vision, pp 4193–4202
Hu W, Zhang Y, Li Y, et al. (2023) Query-based video summarization with multi-label classification network. Multimedia Tools and Applications, pp 1–21
Jin T, Huang S, Li Y, et al. (2019) Low-rank hoca: Efficient high-order cross-modal attention for video captioning. arXiv:1911.00212
Lee JY (2019) Deep multimodal embedding for video captioning. Multimedia Tools and Applications 78(22):31,793–31,805
Li L, Gao X, Deng J et al (2022) Long short-term relation transformer with global gating for video captioning. IEEE Trans Image Process 31:2726–2738
Article ADS PubMed Google Scholar
Li T, Liu J, Zhang W, et al. (2020) Hard-net: Hardness-aware discrimination network for 3d early activity prediction. In: European conference on computer vision. Springer, pp 420–436
Li T, Ke Q, Rahmani H, et al. (2021a) Else-net: Elastic semantic network for continual action recognition from skeleton data. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13,434–13,443
Li T, Liu J, Zhang W, et al. (2021b) Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16,266–16,275
Li T, Foo LG, Ke Q, et al. (2022b) Dynamic spatio-temporal specialization learning for fine-grained action recognition. In: European conference on computer vision. Springer, pp 386–403
Lin CY, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL-04), pp 605–612
Liu AA, Shao Z, Wong Y et al (2019) Lstm-based multi-label video event detection. Multimedia Tools and Applications 78:677–695
Article Google Scholar
Lu J, Yang J, Batra D et al (2016) Hierarchical question-image co-attention for visual question answering. Advances in Neural Information Processing Systems 29:289–297
Google Scholar
Pan P, Xu Z, Yang Y, et al. (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1029–1038
Pan Y, Yao T, Li H, et al. (2017) Video captioning with transferred semantic attributes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6504–6512
Papineni K, Roukos S, Ward T, et al. (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Pei W, Zhang J, Wang X, et al. (2019) Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8347–8356
Perez-Martin J, Bustos B, Pérez J (2021) Improving video captioning with temporal composition of a visual-syntactic embedding. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3039–3049
Ren S, He K, Girshick R et al (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28:91–99
Google Scholar
Shao Z, Han J, Marnerides D, et al. (2022) Region-object relation-aware dense captioning via transformer. IEEE Transactions on Neural Networks and Learning Systems, pp 1–12
Shao Z, Han J, Debattista K, et al. (2023) Textual context-aware dense captioning with diverse words. IEEE Transactions on Multimedia, pp 1–15
Singh A, Singh TD, Bandyopadhyay S (2022) V2t: video to text framework using a novel automatic shot boundary detection algorithm. Multimedia Tools and Applications 81(13):17,989–18,009
Szegedy C, Ioffe S, Vanhoucke V, et al. (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence
Tan G, Liu D, Wang M, et al. (2020) Learning to discretely compose reasoning module networks for video captioning. arXiv:2007.09049
Thomason J, Venugopalan S, Guadarrama S, et al. (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, pp 1218–1227
Tran D, Bourdev L, Fergus R, et al. (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Tran K, He X, Zhang L, et al. (2016) Rich image captioning in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 49–56
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Venugopalan S, Xu H, Donahue J, et al. (2014) Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729
Venugopalan S, Rohrbach M, Donahue J, et al. (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542
Venugopalan S, Hendricks LA, Mooney R, et al. (2016) Improving lstm-based video description with linguistic knowledge mined from text. arXiv:1604.01729
Wang B, Ma L, Zhang W, et al. (2019) Controllable video captioning with pos sequence guidance based on gated fusion network. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2641–2650
Wang B, Liu C, Chang F, et al. (2022) Ae-net:adjoint enhancement network for efficient action recognition in video understanding. IEEE Transactions on Multimedia, pp 1–12
Wang S, Lan L, Zhang X et al (2020) Object-aware semantics of attention for image captioning. Multimedia Tools and Applications 79:2013–2030
Article Google Scholar
Wang W, Chang F, Liu C, et al. (2021a) Ga-net: A guidance aware network for skeleton-based early activity recognition. IEEE Transactions on Multimedia, pp 1–1
Wang W, Chang F, Mi H (2021) Intermediate fused network with multiple timescales for anomaly detection. Neurocomputing 433:37–49
Article Google Scholar
Wang W, Chang F, Zhang J, et al. (2023) Magi-net: Meta negative network for early activity prediction. IEEE Transactions on Image Processing
Wu B, Niu G, Yu J et al (2022) Towards knowledge-aware video captioning via transitive visual relationship detection. IEEE Trans Circuits Syst Video Technol 32(10):6753–6765
Article Google Scholar
Wu B, Liu B, Huang P, et al. (2023a) Concept parser with multi-modal graph learning for video captioning. IEEE Transactions on Circuits and Systems for Video Technology, pp 1–1
Wu P, Wang W, Chang F, et al. (2023b) Dss-net: Dynamic self-supervised network for video anomaly detection. IEEE Transactions on Multimedia
Xu J, Mei T, Yao T, et al. (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296
Yan C, Xie H, Liu S et al (2017) Effective uyghur language text detection in complex background images for traffic prompt identification. IEEE Trans Intell Transp Syst 19(1):220–229
Article Google Scholar
Yan C, Xie H, Yang D et al (2018) Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans Intell Transp Syst 19(1):284–295
Article Google Scholar
Yao L, Torabi A, Cho K, et al. (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515
You Q, Jin H, Wang Z, et al. (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Yu L, Lin Z, Shen X, et al. (2018) Mattnet: Modular attention network for referring expression comprehension. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1307–1315
Yuan J, Tian C, Zhang X, et al. (2018) Video captioning with semantic guiding. In: 2018 IEEE fourth international conference on multimedia big data (BigMM), IEEE, pp 1–5
Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 8327–8336
Zhang Z, Shi Y, Yuan C, et al. (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13,278–13,288

Download references

Funding

This work was supported by the National Natural Science Foundation of China (NO.U22A2058, 62176138, 62176136), National Key R &D Program of China (NO.2018YFB1305300), Shandong Provincial Key Research and Development Program (Major Scientific and Technological Innovation Project)(NO.2020CXGC010207, 2019JZZY010130).

Author information

Authors and Affiliations

School of Control Science and Engineering, Shandong University, Jinan, 250061, China
Min Li, Chunsheng Liu, Faliang Chang, Wenqian Wang & Bin Wang
Shandong Key Laboratory of Low-altitude Airspace Surveillance Network Technology, Aerospace Information Research Institute of Qilu, Aerospace Information Research Institute, Chinese Academy of Sciences, Jinan, 250100, China
Min Li
School of Information Science and Engineering, Shandong Normal University, Jinan, 250358, China
Dongmei Liu

Authors

Min Li
View author publications
You can also search for this author in PubMed Google Scholar
Dongmei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chunsheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Faliang Chang
View author publications
You can also search for this author in PubMed Google Scholar
Wenqian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Chunsheng Liu or Faliang Chang.

Ethics declarations

Conflicts of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, M., Liu, D., Liu, C. et al. ACF-net: appearance-guided content filter network for video captioning. Multimed Tools Appl 83, 31103–31122 (2024). https://doi.org/10.1007/s11042-023-16580-7

Download citation

Received: 15 April 2023
Revised: 25 July 2023
Accepted: 18 August 2023
Published: 05 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-16580-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ACF-net: appearance-guided content filter network for video captioning

Abstract

Access this article

Similar content being viewed by others

MIVCN: Multimodal interaction video captioning network based on semantic association graph

Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion

Hierarchical Multimodal Attention Network Based on Semantically Textual Guidance for Video Captioning

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ACF-net: appearance-guided content filter network for video captioning

Abstract

Access this article

Similar content being viewed by others

MIVCN: Multimodal interaction video captioning network based on semantic association graph

Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion

Hierarchical Multimodal Attention Network Based on Semantically Textual Guidance for Video Captioning

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation