Skip to main content

Advertisement

Log in

Spatial-aware topic-driven-based image Chinese caption for disaster news

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Automatically generating descriptions for disaster news images could effectively accelerate the spread of disaster message and lighten the burden of news editors from tedious news materials. Image caption algorithms are remarkable for generating captions directly from the content of the image. However, current image caption algorithms trained on existing image caption datasets fail to describe the disaster images with fundamental news elements. In this paper, we developed a large-scale disaster news image Chinese caption dataset (DNICC19k), which collected and annotated enormous news images related to disaster. Furthermore, we proposed a spatial-aware topic driven caption network (STCNet) to encode the interrelationships between these news objects and generate descriptive sentences related to news topics. STCNet firstly constructs a graph representation based on objects feature similarity. The graph reasoning module uses the spatial information to infer the weights of aggregated adjacent nodes according to a learnable Gaussian kernel function. Finally, the generation of news sentences are driven by the spatial-aware graph representations and the news topics distribution. Experimental results demonstrate that STCNet trained on DNICC19k could not only automatically creates descriptive sentences related to news topics for disaster news images, but also outperforms benchmark models such as Bottom-up, NIC, Show attend and AoANet on multiple evaluation metrics, achieving CIDEr/BLEU-4 scores of 60.26 and 17.01, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Data availability

As the dataset of this study is being further enlarged and organized, the dataset generated in this study is not publicly available at this time, but can be obtained from the corresponding authors upon reasonable request.

Notes

  1. https://github.com/fi94/News_Dataset.

  2. We regard parades and riots with serious social impacts as types of disasters in this version.

  3. https://github.com/fxsjy/jieba.

References

  1. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935

  2. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, vol 37, pp 2048–2057. PMLR. https://doi.org/10.5555/3045118.3045336

  3. Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383. https://doi.org/10.1109/CVPR.2017.345

  4. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086. https://doi.org/10.1109/CVPR.2018.00636

  5. Feng Y, Lapata M (2010) How many words is a picture worth? Automatic caption generation for news images. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 1239–1249. https://doi.org/10.5555/1858681.1858807

  6. Vijay K, Ramya D (2015) Generation of caption selection for news images using stemming algorithm. In: International conference on computation of power, energy, information and communication, pp 0536–0540. IEEE. https://doi.org/10.1109/ICCPEIC.2015.7259513

  7. Lu D, Whitehead S, Huang L, Ji H, Chang S-F (2018) Entity-aware image caption generation. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 4013–4023. https://doi.org/10.18653/v1/D18-1435

  8. Yang Z, Okazaki N (2020) Image caption generation for news articles. In: Proceedings of the 28th international conference on computational linguistics, pp 1941–1951. https://doi.org/10.18653/v1/2020.coling-main.176

  9. Jing Y, Zhiwei X, Guanglai G (2020) Context-driven image caption with global semantic relations of the named entities. IEEE Access 8:143584–143594. https://doi.org/10.1109/ACCESS.2020.3013321

    Article  Google Scholar 

  10. Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using Amazon’s Mechanical Turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, pp 139–147. https://doi.org/10.5555/1866696.1866717

  11. Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78. https://doi.org/10.1162/tacl_a_00166

    Article  Google Scholar 

  12. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755. https://doi.org/10.1007/978-3-319-10602-1_48

  13. Yao T, Pan Y, Li Y, Tao M (2018) Exploring visual relationship for image captioning. In: European conference on computer vision. Springer, pp 711–727. https://doi.org/10.1007/978-3-030-01264-9_42

  14. Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10685–10694. https://doi.org/10.1109/TPAMI.2020.3042192

  15. Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304. https://doi.org/10.1016/j.neucom.2018.05.080

    Article  Google Scholar 

  16. Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, pp 15–29. https://doi.org/10.1007/978-3-642-15561-1_2

  17. Ordonez V, Kulkarni G, Berg T (2011) Im2text: describing images using 1 million captioned photographs. Adv Neural Inf Process Syst 24:1143–1151. https://doi.org/10.5555/2986459.2986587

    Article  Google Scholar 

  18. Yang Y, Teo C, Daumé III H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 444–454. https://doi.org/10.5555/2145432.2145484

  19. Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903. https://doi.org/10.1109/CVPR.2011.5995466

    Article  Google Scholar 

  20. Bai L, Li K, Pei J, Jiang S (2015) Main objects interaction activity recognition in real images. Neural Comput Appl 27:335–348. https://doi.org/10.1007/s00521-015-1846-7

    Article  Google Scholar 

  21. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024. https://doi.org/10.1109/CVPR.2017.131

  22. Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212. https://doi.org/10.1109/CVPR.2016.29

  23. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659. https://doi.org/10.1109/CVPR.2016.503

  24. Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639. https://doi.org/10.1109/CVPR.2017.127

  25. Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228. https://doi.org/10.1109/CVPR.2018.00754

  26. Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, Batra D, Parikh D, Lee S, Anderson P (2019) Nocaps: novel object captioning at scale. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8948–8957. https://doi.org/10.1109/ICCV.2019.00904

  27. Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10313–10322. https://doi.org/10.1109/ICCV.2019.01041

  28. Johnson J, Krishna R, Stark M, Li L-J, Shamma D, Bernstein M, Fei-Fei L (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3668–3678. https://doi.org/10.1109/CVPR.2015.7298990

  29. Gupta N, Jalal A (2020) Integration of textual cues for fine-grained image captioning using deep CNN and LSTM. Neural Comput Appl 32:17899–17908. https://doi.org/10.1007/s00521-019-04515-z

    Article  Google Scholar 

  30. Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolin-guistic representations for vision-and-language tasks. In: Advances in neural in- formation processing systems, pp 13–23. https://doi.org/10.5555/3454287.3454289

  31. Li G, Duan N, Fang Y, Jiang D, Zhou M (2020) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In: AAAI, pp 11336–11344. https://doi.org/10.1609/aaai.v34i07.6795

  32. Zhou L, Palangi H, Zhang L, Hu H, Corso JJ, Gao J (2019) Unified vision-language pre-training for image captioning and vqa. In: Proceedings of the AAAI conference on artificial intelligence, pp 13041–13049. https://doi.org/10.1609/AAAI.V34I07.7005

  33. Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In: Computer vision–ECCV 2020. Lecture notes in computer science, pp 121–137. https://doi.org/10.1007/978-3-030-58577-8_8

  34. Liu M, Hu H, Li L, Yu Y, Guan W (2020) Chinese image caption generation via visual attention and topic modeling. IEEE Trans Cybern 1:1–11. https://doi.org/10.1109/TCYB.2020.2997034

    Article  Google Scholar 

  35. Liu M, Li L, Hu H, Guan W, Tian J (2020) Image caption generation with dual attention mechanism. Inf Process Manag 57(2):102178. https://doi.org/10.1016/j.ipm.2019.102178

    Article  Google Scholar 

  36. Wang B, Wang C, Zhang Q, Su Y, Wang Y, Xu Y (2020) Cross-lingual image caption generation based on visual attention model. IEEE Access 8:104543–104554. https://doi.org/10.1109/ACCESS.2020.2999568

    Article  Google Scholar 

  37. Wu J, Zheng H, Zhao B, Li Y, Yan B, Liang R, Wang W, Zhou S, Lin G, Fu Y, et al (2017) Ai challenger: a large-scale dataset for going deeper in image understanding. In: IEEE international conference on multimedia and expo, pp 1480–1485. https://doi.org/10.1109/ICME.2019.00256

  38. Tran A, Mathews A, Xie L (2020) Transform and tell: entity-aware news image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13035–13045

  39. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73. https://doi.org/10.1007/s11263-016-0981-7

    Article  MathSciNet  Google Scholar 

  40. Xu H, Jiang C, Liang X, Li Z (2019) Spatial-aware graph relation network for large-scale object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9298–9307. https://doi.org/10.1109/CVPR.2019.00952

  41. Feng Y, Lapata M (2012) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812. https://doi.org/10.1109/TPAMI.2012.118

    Article  Google Scholar 

  42. Biten AF, Gomez L, Rusinol M, Karatzas D (2019) Good news, everyone! context driven entity-aware captioning for news images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12466–12475. https://doi.org/10.1109/CVPR.2019.01275

  43. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  44. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  45. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 6000–6010. https://doi.org/10.5555/3295222.3295349

  46. Monti F, Boscaini D, Masci J, Rodola E, Svoboda J, Bronstein MM (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5115–5124. https://doi.org/10.1109/CVPR.2017.576

  47. Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125. https://doi.org/10.1109/CVPR.2017.106

  48. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):89–90. https://doi.org/10.1145/3065386

    Article  Google Scholar 

  49. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd international conference on learning representations, ICLR, pp 1–15

  50. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318. https://doi.org/10.3115/1073083.1073135

  51. Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72. https://doi.org/10.3115/1626355.1626389

  52. Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Proceedings of the 42nd association for computational linguistics, pp 74–81

  53. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575. https://doi.org/10.1109/CVPR.2015.7299087

  54. Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, pp 382–398. https://doi.org/10.1007/978-3-319-46454-1_24

  55. Ramos J et al (2003) Using tf-idf to determine word relevance in document queries. In: Proceedings of the 1st instructional conference on machine learning, vol 242. Citeseer, pp 29–48. https://doi.org/10.5120/ijca2018917395.

  56. Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643. https://doi.org/10.1109/ICCV.2019.00473

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities [grant number CUC210B018].

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: JZ, YZ, YZ; Methodology: JZ; Validation: JZ; Formal analysis: JZ, YZ; Investigation: JZ, YZ, YZ; Data Curation: JZ, YZ, YZ; Writing-Original Draft: JZ; Visualization: JZ; Resources: YZ, YZ, CY; Writing-Review & Editing: YZ, YZ, CY, HP; Supervision: YZ, YZ, CY; Funding acquisition: YZ; Project administration: YZ, YZ.

Corresponding author

Correspondence to Yaping Zhu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest concerning the publication of this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, J., Zhu, Y., Zhang, Y. et al. Spatial-aware topic-driven-based image Chinese caption for disaster news. Neural Comput & Applic 35, 9481–9500 (2023). https://doi.org/10.1007/s00521-022-08072-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-08072-w

Keywords