Abstract
Automatically generating descriptions for disaster news images could effectively accelerate the spread of disaster message and lighten the burden of news editors from tedious news materials. Image caption algorithms are remarkable for generating captions directly from the content of the image. However, current image caption algorithms trained on existing image caption datasets fail to describe the disaster images with fundamental news elements. In this paper, we developed a large-scale disaster news image Chinese caption dataset (DNICC19k), which collected and annotated enormous news images related to disaster. Furthermore, we proposed a spatial-aware topic driven caption network (STCNet) to encode the interrelationships between these news objects and generate descriptive sentences related to news topics. STCNet firstly constructs a graph representation based on objects feature similarity. The graph reasoning module uses the spatial information to infer the weights of aggregated adjacent nodes according to a learnable Gaussian kernel function. Finally, the generation of news sentences are driven by the spatial-aware graph representations and the news topics distribution. Experimental results demonstrate that STCNet trained on DNICC19k could not only automatically creates descriptive sentences related to news topics for disaster news images, but also outperforms benchmark models such as Bottom-up, NIC, Show attend and AoANet on multiple evaluation metrics, achieving CIDEr/BLEU-4 scores of 60.26 and 17.01, respectively.













Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Data availability
As the dataset of this study is being further enlarged and organized, the dataset generated in this study is not publicly available at this time, but can be obtained from the corresponding authors upon reasonable request.
Notes
We regard parades and riots with serious social impacts as types of disasters in this version.
References
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, vol 37, pp 2048–2057. PMLR. https://doi.org/10.5555/3045118.3045336
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383. https://doi.org/10.1109/CVPR.2017.345
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086. https://doi.org/10.1109/CVPR.2018.00636
Feng Y, Lapata M (2010) How many words is a picture worth? Automatic caption generation for news images. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 1239–1249. https://doi.org/10.5555/1858681.1858807
Vijay K, Ramya D (2015) Generation of caption selection for news images using stemming algorithm. In: International conference on computation of power, energy, information and communication, pp 0536–0540. IEEE. https://doi.org/10.1109/ICCPEIC.2015.7259513
Lu D, Whitehead S, Huang L, Ji H, Chang S-F (2018) Entity-aware image caption generation. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 4013–4023. https://doi.org/10.18653/v1/D18-1435
Yang Z, Okazaki N (2020) Image caption generation for news articles. In: Proceedings of the 28th international conference on computational linguistics, pp 1941–1951. https://doi.org/10.18653/v1/2020.coling-main.176
Jing Y, Zhiwei X, Guanglai G (2020) Context-driven image caption with global semantic relations of the named entities. IEEE Access 8:143584–143594. https://doi.org/10.1109/ACCESS.2020.3013321
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using Amazon’s Mechanical Turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, pp 139–147. https://doi.org/10.5555/1866696.1866717
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78. https://doi.org/10.1162/tacl_a_00166
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
Yao T, Pan Y, Li Y, Tao M (2018) Exploring visual relationship for image captioning. In: European conference on computer vision. Springer, pp 711–727. https://doi.org/10.1007/978-3-030-01264-9_42
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10685–10694. https://doi.org/10.1109/TPAMI.2020.3042192
Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304. https://doi.org/10.1016/j.neucom.2018.05.080
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, pp 15–29. https://doi.org/10.1007/978-3-642-15561-1_2
Ordonez V, Kulkarni G, Berg T (2011) Im2text: describing images using 1 million captioned photographs. Adv Neural Inf Process Syst 24:1143–1151. https://doi.org/10.5555/2986459.2986587
Yang Y, Teo C, Daumé III H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 444–454. https://doi.org/10.5555/2145432.2145484
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903. https://doi.org/10.1109/CVPR.2011.5995466
Bai L, Li K, Pei J, Jiang S (2015) Main objects interaction activity recognition in real images. Neural Comput Appl 27:335–348. https://doi.org/10.1007/s00521-015-1846-7
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024. https://doi.org/10.1109/CVPR.2017.131
Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212. https://doi.org/10.1109/CVPR.2016.29
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659. https://doi.org/10.1109/CVPR.2016.503
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639. https://doi.org/10.1109/CVPR.2017.127
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228. https://doi.org/10.1109/CVPR.2018.00754
Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, Batra D, Parikh D, Lee S, Anderson P (2019) Nocaps: novel object captioning at scale. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8948–8957. https://doi.org/10.1109/ICCV.2019.00904
Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10313–10322. https://doi.org/10.1109/ICCV.2019.01041
Johnson J, Krishna R, Stark M, Li L-J, Shamma D, Bernstein M, Fei-Fei L (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3668–3678. https://doi.org/10.1109/CVPR.2015.7298990
Gupta N, Jalal A (2020) Integration of textual cues for fine-grained image captioning using deep CNN and LSTM. Neural Comput Appl 32:17899–17908. https://doi.org/10.1007/s00521-019-04515-z
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolin-guistic representations for vision-and-language tasks. In: Advances in neural in- formation processing systems, pp 13–23. https://doi.org/10.5555/3454287.3454289
Li G, Duan N, Fang Y, Jiang D, Zhou M (2020) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In: AAAI, pp 11336–11344. https://doi.org/10.1609/aaai.v34i07.6795
Zhou L, Palangi H, Zhang L, Hu H, Corso JJ, Gao J (2019) Unified vision-language pre-training for image captioning and vqa. In: Proceedings of the AAAI conference on artificial intelligence, pp 13041–13049. https://doi.org/10.1609/AAAI.V34I07.7005
Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In: Computer vision–ECCV 2020. Lecture notes in computer science, pp 121–137. https://doi.org/10.1007/978-3-030-58577-8_8
Liu M, Hu H, Li L, Yu Y, Guan W (2020) Chinese image caption generation via visual attention and topic modeling. IEEE Trans Cybern 1:1–11. https://doi.org/10.1109/TCYB.2020.2997034
Liu M, Li L, Hu H, Guan W, Tian J (2020) Image caption generation with dual attention mechanism. Inf Process Manag 57(2):102178. https://doi.org/10.1016/j.ipm.2019.102178
Wang B, Wang C, Zhang Q, Su Y, Wang Y, Xu Y (2020) Cross-lingual image caption generation based on visual attention model. IEEE Access 8:104543–104554. https://doi.org/10.1109/ACCESS.2020.2999568
Wu J, Zheng H, Zhao B, Li Y, Yan B, Liang R, Wang W, Zhou S, Lin G, Fu Y, et al (2017) Ai challenger: a large-scale dataset for going deeper in image understanding. In: IEEE international conference on multimedia and expo, pp 1480–1485. https://doi.org/10.1109/ICME.2019.00256
Tran A, Mathews A, Xie L (2020) Transform and tell: entity-aware news image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13035–13045
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73. https://doi.org/10.1007/s11263-016-0981-7
Xu H, Jiang C, Liang X, Li Z (2019) Spatial-aware graph relation network for large-scale object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9298–9307. https://doi.org/10.1109/CVPR.2019.00952
Feng Y, Lapata M (2012) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812. https://doi.org/10.1109/TPAMI.2012.118
Biten AF, Gomez L, Rusinol M, Karatzas D (2019) Good news, everyone! context driven entity-aware captioning for news images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12466–12475. https://doi.org/10.1109/CVPR.2019.01275
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 6000–6010. https://doi.org/10.5555/3295222.3295349
Monti F, Boscaini D, Masci J, Rodola E, Svoboda J, Bronstein MM (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5115–5124. https://doi.org/10.1109/CVPR.2017.576
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125. https://doi.org/10.1109/CVPR.2017.106
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):89–90. https://doi.org/10.1145/3065386
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd international conference on learning representations, ICLR, pp 1–15
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318. https://doi.org/10.3115/1073083.1073135
Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72. https://doi.org/10.3115/1626355.1626389
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Proceedings of the 42nd association for computational linguistics, pp 74–81
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575. https://doi.org/10.1109/CVPR.2015.7299087
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, pp 382–398. https://doi.org/10.1007/978-3-319-46454-1_24
Ramos J et al (2003) Using tf-idf to determine word relevance in document queries. In: Proceedings of the 1st instructional conference on machine learning, vol 242. Citeseer, pp 29–48. https://doi.org/10.5120/ijca2018917395.
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643. https://doi.org/10.1109/ICCV.2019.00473
Acknowledgements
This work was supported by the Fundamental Research Funds for the Central Universities [grant number CUC210B018].
Author information
Authors and Affiliations
Contributions
Conceptualization: JZ, YZ, YZ; Methodology: JZ; Validation: JZ; Formal analysis: JZ, YZ; Investigation: JZ, YZ, YZ; Data Curation: JZ, YZ, YZ; Writing-Original Draft: JZ; Visualization: JZ; Resources: YZ, YZ, CY; Writing-Review & Editing: YZ, YZ, CY, HP; Supervision: YZ, YZ, CY; Funding acquisition: YZ; Project administration: YZ, YZ.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest concerning the publication of this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, J., Zhu, Y., Zhang, Y. et al. Spatial-aware topic-driven-based image Chinese caption for disaster news. Neural Comput & Applic 35, 9481–9500 (2023). https://doi.org/10.1007/s00521-022-08072-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-08072-w