Skip to main content
Log in

ACF-net: appearance-guided content filter network for video captioning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Video captioning refers to the automatic generation of natural language sentences for a given video. There are two open problems for this task: how to effectively combine the multimodal features to better represent video content, and how to effectively extract useful features from complex visual and linguistic information to generate more detailed descriptions. Considering these two difficulties together, we propose a video captioning method named ACF-Net: Appearance-guided Content Filter Network, utilizing appearance information as a Content Filter to guide the network to aware discrimination information from both motion information and object information. Specifically, we propose a new multimodal fusion method to alleviate the problem of insufficient video information fusion. Distinguished with previous feature fusion methods that directly concatenate features, our fusion mechanism fuses relevant content information through content filters to form unified multimodal features. Moreover, a hierarchical decoder with temporal semantic aggregation is proposed, which can dynamically aggregate visual and linguistic features while generating corresponding words, focusing on the most relevant temporal and semantic information. Extensive experiments are conducted on two benchmark datasets, including MSVD and MSR-VTT, which demonstrate the effectiveness of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data Availability

All data generated or analysed during this study are included in this published article (and its supplementary information files).

References

  1. Aafaq N, Akhtar N, Liu W, et al. (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,487–12,496

  2. Anderson P, He X, Buehler C, et al. (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

  3. Bai Y, Wang J, Long Y, et al. (2021) Discriminative latent semantic graph for video captioning. In: Proceedings of the 29th ACM international conference on multimedia, pp 3556–3564

  4. Barbu A, Bridge A, Burchill Z, et al. (2012) Video in sentences out. arXiv:1204.2742

  5. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

  6. Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200

  7. Chen H, Lin K, Maye A, et al. (2020) A semantics-assisted video captioning model trained with scheduled sampling. Frontiers in Robotics and AI 7

  8. Chen S, Jiang YG (2019) Motion guided spatial attention for video captioning. In: Proceedings of the AAAI conference on artificial intelligence, pp 8191–8198

  9. Chen S, Jiang YG (2021) Motion guided region message passing for video captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1543–1552

  10. Chen X, Fang H, Lin TY, et al. (2015) Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325

  11. Deng J, Li L, Zhang B et al (2021) Syntax-guided hierarchical attention network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology 32(2):880–892

    Article  Google Scholar 

  12. Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380

  13. Foo LG, Li T, Rahmani H, et al. (2022) Era: Expert retrieval and assembly for early action prediction. arXiv:2207.09675

  14. Gan Z, Gan C, He X, et al. (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639

  15. Gu X, Chen G, Wang Y, et al. (2023) Text with knowledge graph augmented transformer for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 18,941–18,951

  16. Hori C, Hori T, Lee TY, et al. (2017) Attention-based multimodal fusion for video description. In: Proceedings of the IEEE international conference on computer vision, pp 4193–4202

  17. Hu W, Zhang Y, Li Y, et al. (2023) Query-based video summarization with multi-label classification network. Multimedia Tools and Applications, pp 1–21

  18. Jin T, Huang S, Li Y, et al. (2019) Low-rank hoca: Efficient high-order cross-modal attention for video captioning. arXiv:1911.00212

  19. Lee JY (2019) Deep multimodal embedding for video captioning. Multimedia Tools and Applications 78(22):31,793–31,805

  20. Li L, Gao X, Deng J et al (2022) Long short-term relation transformer with global gating for video captioning. IEEE Trans Image Process 31:2726–2738

    Article  ADS  PubMed  Google Scholar 

  21. Li T, Liu J, Zhang W, et al. (2020) Hard-net: Hardness-aware discrimination network for 3d early activity prediction. In: European conference on computer vision. Springer, pp 420–436

  22. Li T, Ke Q, Rahmani H, et al. (2021a) Else-net: Elastic semantic network for continual action recognition from skeleton data. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13,434–13,443

  23. Li T, Liu J, Zhang W, et al. (2021b) Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16,266–16,275

  24. Li T, Foo LG, Ke Q, et al. (2022b) Dynamic spatio-temporal specialization learning for fine-grained action recognition. In: European conference on computer vision. Springer, pp 386–403

  25. Lin CY, Och FJ (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In: Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL-04), pp 605–612

  26. Liu AA, Shao Z, Wong Y et al (2019) Lstm-based multi-label video event detection. Multimedia Tools and Applications 78:677–695

    Article  Google Scholar 

  27. Lu J, Yang J, Batra D et al (2016) Hierarchical question-image co-attention for visual question answering. Advances in Neural Information Processing Systems 29:289–297

    Google Scholar 

  28. Pan P, Xu Z, Yang Y, et al. (2016) Hierarchical recurrent neural encoder for video representation with application to captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1029–1038

  29. Pan Y, Yao T, Li H, et al. (2017) Video captioning with transferred semantic attributes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6504–6512

  30. Papineni K, Roukos S, Ward T, et al. (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318

  31. Pei W, Zhang J, Wang X, et al. (2019) Memory-attended recurrent network for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8347–8356

  32. Perez-Martin J, Bustos B, Pérez J (2021) Improving video captioning with temporal composition of a visual-syntactic embedding. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3039–3049

  33. Ren S, He K, Girshick R et al (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28:91–99

    Google Scholar 

  34. Shao Z, Han J, Marnerides D, et al. (2022) Region-object relation-aware dense captioning via transformer. IEEE Transactions on Neural Networks and Learning Systems, pp 1–12

  35. Shao Z, Han J, Debattista K, et al. (2023) Textual context-aware dense captioning with diverse words. IEEE Transactions on Multimedia, pp 1–15

  36. Singh A, Singh TD, Bandyopadhyay S (2022) V2t: video to text framework using a novel automatic shot boundary detection algorithm. Multimedia Tools and Applications 81(13):17,989–18,009

  37. Szegedy C, Ioffe S, Vanhoucke V, et al. (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence

  38. Tan G, Liu D, Wang M, et al. (2020) Learning to discretely compose reasoning module networks for video captioning. arXiv:2007.09049

  39. Thomason J, Venugopalan S, Guadarrama S, et al. (2014) Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, pp 1218–1227

  40. Tran D, Bourdev L, Fergus R, et al. (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  41. Tran K, He X, Zhang L, et al. (2016) Rich image captioning in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 49–56

  42. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  43. Venugopalan S, Xu H, Donahue J, et al. (2014) Translating videos to natural language using deep recurrent neural networks. arXiv:1412.4729

  44. Venugopalan S, Rohrbach M, Donahue J, et al. (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542

  45. Venugopalan S, Hendricks LA, Mooney R, et al. (2016) Improving lstm-based video description with linguistic knowledge mined from text. arXiv:1604.01729

  46. Wang B, Ma L, Zhang W, et al. (2019) Controllable video captioning with pos sequence guidance based on gated fusion network. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2641–2650

  47. Wang B, Liu C, Chang F, et al. (2022) Ae-net:adjoint enhancement network for efficient action recognition in video understanding. IEEE Transactions on Multimedia, pp 1–12

  48. Wang S, Lan L, Zhang X et al (2020) Object-aware semantics of attention for image captioning. Multimedia Tools and Applications 79:2013–2030

    Article  Google Scholar 

  49. Wang W, Chang F, Liu C, et al. (2021a) Ga-net: A guidance aware network for skeleton-based early activity recognition. IEEE Transactions on Multimedia, pp 1–1

  50. Wang W, Chang F, Mi H (2021) Intermediate fused network with multiple timescales for anomaly detection. Neurocomputing 433:37–49

    Article  Google Scholar 

  51. Wang W, Chang F, Zhang J, et al. (2023) Magi-net: Meta negative network for early activity prediction. IEEE Transactions on Image Processing

  52. Wu B, Niu G, Yu J et al (2022) Towards knowledge-aware video captioning via transitive visual relationship detection. IEEE Trans Circuits Syst Video Technol 32(10):6753–6765

    Article  Google Scholar 

  53. Wu B, Liu B, Huang P, et al. (2023a) Concept parser with multi-modal graph learning for video captioning. IEEE Transactions on Circuits and Systems for Video Technology, pp 1–1

  54. Wu P, Wang W, Chang F, et al. (2023b) Dss-net: Dynamic self-supervised network for video anomaly detection. IEEE Transactions on Multimedia

  55. Xu J, Mei T, Yao T, et al. (2016) Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296

  56. Yan C, Xie H, Liu S et al (2017) Effective uyghur language text detection in complex background images for traffic prompt identification. IEEE Trans Intell Transp Syst 19(1):220–229

    Article  Google Scholar 

  57. Yan C, Xie H, Yang D et al (2018) Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans Intell Transp Syst 19(1):284–295

    Article  Google Scholar 

  58. Yao L, Torabi A, Cho K, et al. (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision, pp 4507–4515

  59. You Q, Jin H, Wang Z, et al. (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659

  60. Yu L, Lin Z, Shen X, et al. (2018) Mattnet: Modular attention network for referring expression comprehension. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1307–1315

  61. Yuan J, Tian C, Zhang X, et al. (2018) Video captioning with semantic guiding. In: 2018 IEEE fourth international conference on multimedia big data (BigMM), IEEE, pp 1–5

  62. Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 8327–8336

  63. Zhang Z, Shi Y, Yuan C, et al. (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13,278–13,288

Download references

Funding

This work was supported by the National Natural Science Foundation of China (NO.U22A2058, 62176138, 62176136), National Key R &D Program of China (NO.2018YFB1305300), Shandong Provincial Key Research and Development Program (Major Scientific and Technological Innovation Project)(NO.2020CXGC010207, 2019JZZY010130).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chunsheng Liu or Faliang Chang.

Ethics declarations

Conflicts of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, M., Liu, D., Liu, C. et al. ACF-net: appearance-guided content filter network for video captioning. Multimed Tools Appl 83, 31103–31122 (2024). https://doi.org/10.1007/s11042-023-16580-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16580-7

Keywords

Navigation