Spatial-aware topic-driven-based image Chinese caption for disaster news

Zhou, Jinfei; Zhu, Yaping; Zhang, Yana; Yang, Cheng; Pan, Hong

doi:10.1007/s00521-022-08072-w

Spatial-aware topic-driven-based image Chinese caption for disaster news

Original Article
Published: 16 March 2023

Volume 35, pages 9481–9500, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Jinfei Zhou¹,
Yaping Zhu ORCID: orcid.org/0000-0002-8150-6366¹,
Yana Zhang¹,
Cheng Yang¹ &
…
Hong Pan²

1740 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Automatically generating descriptions for disaster news images could effectively accelerate the spread of disaster message and lighten the burden of news editors from tedious news materials. Image caption algorithms are remarkable for generating captions directly from the content of the image. However, current image caption algorithms trained on existing image caption datasets fail to describe the disaster images with fundamental news elements. In this paper, we developed a large-scale disaster news image Chinese caption dataset (DNICC19k), which collected and annotated enormous news images related to disaster. Furthermore, we proposed a spatial-aware topic driven caption network (STCNet) to encode the interrelationships between these news objects and generate descriptive sentences related to news topics. STCNet firstly constructs a graph representation based on objects feature similarity. The graph reasoning module uses the spatial information to infer the weights of aggregated adjacent nodes according to a learnable Gaussian kernel function. Finally, the generation of news sentences are driven by the spatial-aware graph representations and the news topics distribution. Experimental results demonstrate that STCNet trained on DNICC19k could not only automatically creates descriptive sentences related to news topics for disaster news images, but also outperforms benchmark models such as Bottom-up, NIC, Show attend and AoANet on multiple evaluation metrics, achieving CIDEr/BLEU-4 scores of 60.26 and 17.01, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dual Transformer with Gated-Attention Fusion for News Disaster Image Captioning

CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation

Intensive Positioning Network for Remote Sensing Image Captioning

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Artificial Intelligence

Data availability

As the dataset of this study is being further enlarged and organized, the dataset generated in this study is not publicly available at this time, but can be obtained from the corresponding authors upon reasonable request.

Notes

https://github.com/fi94/News_Dataset.
We regard parades and riots with serious social impacts as types of disasters in this version.
https://github.com/fxsjy/jieba.

References

Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, vol 37, pp 2048–2057. PMLR. https://doi.org/10.5555/3045118.3045336
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383. https://doi.org/10.1109/CVPR.2017.345
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086. https://doi.org/10.1109/CVPR.2018.00636
Feng Y, Lapata M (2010) How many words is a picture worth? Automatic caption generation for news images. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 1239–1249. https://doi.org/10.5555/1858681.1858807
Vijay K, Ramya D (2015) Generation of caption selection for news images using stemming algorithm. In: International conference on computation of power, energy, information and communication, pp 0536–0540. IEEE. https://doi.org/10.1109/ICCPEIC.2015.7259513
Lu D, Whitehead S, Huang L, Ji H, Chang S-F (2018) Entity-aware image caption generation. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 4013–4023. https://doi.org/10.18653/v1/D18-1435
Yang Z, Okazaki N (2020) Image caption generation for news articles. In: Proceedings of the 28th international conference on computational linguistics, pp 1941–1951. https://doi.org/10.18653/v1/2020.coling-main.176
Jing Y, Zhiwei X, Guanglai G (2020) Context-driven image caption with global semantic relations of the named entities. IEEE Access 8:143584–143594. https://doi.org/10.1109/ACCESS.2020.3013321
Article Google Scholar
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using Amazon’s Mechanical Turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, pp 139–147. https://doi.org/10.5555/1866696.1866717
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78. https://doi.org/10.1162/tacl_a_00166
Article Google Scholar
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
Yao T, Pan Y, Li Y, Tao M (2018) Exploring visual relationship for image captioning. In: European conference on computer vision. Springer, pp 711–727. https://doi.org/10.1007/978-3-030-01264-9_42
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10685–10694. https://doi.org/10.1109/TPAMI.2020.3042192
Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304. https://doi.org/10.1016/j.neucom.2018.05.080
Article Google Scholar
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, pp 15–29. https://doi.org/10.1007/978-3-642-15561-1_2
Ordonez V, Kulkarni G, Berg T (2011) Im2text: describing images using 1 million captioned photographs. Adv Neural Inf Process Syst 24:1143–1151. https://doi.org/10.5555/2986459.2986587
Article Google Scholar
Yang Y, Teo C, Daumé III H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the 2011 conference on empirical methods in natural language processing, pp 444–454. https://doi.org/10.5555/2145432.2145484
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903. https://doi.org/10.1109/CVPR.2011.5995466
Article Google Scholar
Bai L, Li K, Pei J, Jiang S (2015) Main objects interaction activity recognition in real images. Neural Comput Appl 27:335–348. https://doi.org/10.1007/s00521-015-1846-7
Article Google Scholar
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024. https://doi.org/10.1109/CVPR.2017.131
Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212. https://doi.org/10.1109/CVPR.2016.29
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659. https://doi.org/10.1109/CVPR.2016.503
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639. https://doi.org/10.1109/CVPR.2017.127
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228. https://doi.org/10.1109/CVPR.2018.00754
Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, Batra D, Parikh D, Lee S, Anderson P (2019) Nocaps: novel object captioning at scale. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8948–8957. https://doi.org/10.1109/ICCV.2019.00904
Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10313–10322. https://doi.org/10.1109/ICCV.2019.01041
Johnson J, Krishna R, Stark M, Li L-J, Shamma D, Bernstein M, Fei-Fei L (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3668–3678. https://doi.org/10.1109/CVPR.2015.7298990
Gupta N, Jalal A (2020) Integration of textual cues for fine-grained image captioning using deep CNN and LSTM. Neural Comput Appl 32:17899–17908. https://doi.org/10.1007/s00521-019-04515-z
Article Google Scholar
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolin-guistic representations for vision-and-language tasks. In: Advances in neural in- formation processing systems, pp 13–23. https://doi.org/10.5555/3454287.3454289
Li G, Duan N, Fang Y, Jiang D, Zhou M (2020) Unicoder-vl: a universal encoder for vision and language by cross-modal pre-training. In: AAAI, pp 11336–11344. https://doi.org/10.1609/aaai.v34i07.6795
Zhou L, Palangi H, Zhang L, Hu H, Corso JJ, Gao J (2019) Unified vision-language pre-training for image captioning and vqa. In: Proceedings of the AAAI conference on artificial intelligence, pp 13041–13049. https://doi.org/10.1609/AAAI.V34I07.7005
Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In: Computer vision–ECCV 2020. Lecture notes in computer science, pp 121–137. https://doi.org/10.1007/978-3-030-58577-8_8
Liu M, Hu H, Li L, Yu Y, Guan W (2020) Chinese image caption generation via visual attention and topic modeling. IEEE Trans Cybern 1:1–11. https://doi.org/10.1109/TCYB.2020.2997034
Article Google Scholar
Liu M, Li L, Hu H, Guan W, Tian J (2020) Image caption generation with dual attention mechanism. Inf Process Manag 57(2):102178. https://doi.org/10.1016/j.ipm.2019.102178
Article Google Scholar
Wang B, Wang C, Zhang Q, Su Y, Wang Y, Xu Y (2020) Cross-lingual image caption generation based on visual attention model. IEEE Access 8:104543–104554. https://doi.org/10.1109/ACCESS.2020.2999568
Article Google Scholar
Wu J, Zheng H, Zhao B, Li Y, Yan B, Liang R, Wang W, Zhou S, Lin G, Fu Y, et al (2017) Ai challenger: a large-scale dataset for going deeper in image understanding. In: IEEE international conference on multimedia and expo, pp 1480–1485. https://doi.org/10.1109/ICME.2019.00256
Tran A, Mathews A, Xie L (2020) Transform and tell: entity-aware news image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13035–13045
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73. https://doi.org/10.1007/s11263-016-0981-7
Article MathSciNet Google Scholar
Xu H, Jiang C, Liang X, Li Z (2019) Spatial-aware graph relation network for large-scale object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9298–9307. https://doi.org/10.1109/CVPR.2019.00952
Feng Y, Lapata M (2012) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812. https://doi.org/10.1109/TPAMI.2012.118
Article Google Scholar
Biten AF, Gomez L, Rusinol M, Karatzas D (2019) Good news, everyone! context driven entity-aware captioning for news images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12466–12475. https://doi.org/10.1109/CVPR.2019.01275
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 6000–6010. https://doi.org/10.5555/3295222.3295349
Monti F, Boscaini D, Masci J, Rodola E, Svoboda J, Bronstein MM (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5115–5124. https://doi.org/10.1109/CVPR.2017.576
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125. https://doi.org/10.1109/CVPR.2017.106
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):89–90. https://doi.org/10.1145/3065386
Article Google Scholar
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd international conference on learning representations, ICLR, pp 1–15
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318. https://doi.org/10.3115/1073083.1073135
Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72. https://doi.org/10.3115/1626355.1626389
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Proceedings of the 42nd association for computational linguistics, pp 74–81
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575. https://doi.org/10.1109/CVPR.2015.7299087
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, pp 382–398. https://doi.org/10.1007/978-3-319-46454-1_24
Ramos J et al (2003) Using tf-idf to determine word relevance in document queries. In: Proceedings of the 1st instructional conference on machine learning, vol 242. Citeseer, pp 29–48. https://doi.org/10.5120/ijca2018917395.
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643. https://doi.org/10.1109/ICCV.2019.00473

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities [grant number CUC210B018].

Author information

Authors and Affiliations

State Key Laboratory of Media Convergence and Communication, The Communication University of China, Beijing, 100024, China
Jinfei Zhou, Yaping Zhu, Yana Zhang & Cheng Yang
Data Science Research Institute, Swinburne University of Technology, Melbourne, 3122, Australia
Hong Pan

Authors

Jinfei Zhou
View author publications
You can also search for this author inPubMed Google Scholar
Yaping Zhu
View author publications
You can also search for this author inPubMed Google Scholar
Yana Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Cheng Yang
View author publications
You can also search for this author inPubMed Google Scholar
Hong Pan
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Conceptualization: JZ, YZ, YZ; Methodology: JZ; Validation: JZ; Formal analysis: JZ, YZ; Investigation: JZ, YZ, YZ; Data Curation: JZ, YZ, YZ; Writing-Original Draft: JZ; Visualization: JZ; Resources: YZ, YZ, CY; Writing-Review & Editing: YZ, YZ, CY, HP; Supervision: YZ, YZ, CY; Funding acquisition: YZ; Project administration: YZ, YZ.

Corresponding author

Correspondence to Yaping Zhu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest concerning the publication of this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhou, J., Zhu, Y., Zhang, Y. et al. Spatial-aware topic-driven-based image Chinese caption for disaster news. Neural Comput & Applic 35, 9481–9500 (2023). https://doi.org/10.1007/s00521-022-08072-w

Download citation

Received: 31 March 2022
Accepted: 16 November 2022
Published: 16 March 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00521-022-08072-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatial-aware topic-driven-based image Chinese caption for disaster news

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Dual Transformer with Gated-Attention Fusion for News Disaster Image Captioning

CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation

Intensive Positioning Network for Remote Sensing Image Captioning

Explore related subjects

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now