Fine-grained person-based image captioning via advanced spectrum parsing

Wu, Jianhui; Ni, Fan; Wang, Zijie; Ju, Haoyu; Zhang, Yue; Hu, Fangqiang; Li, Yifeng

doi:10.1007/s11042-023-16893-7

Fine-grained person-based image captioning via advanced spectrum parsing

Published: 23 September 2023

Volume 83, pages 34015–34030, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jianhui Wu¹,
Fan Ni¹,
Zijie Wang¹,
Haoyu Ju¹,
Yue Zhang¹,
Fangqiang Hu ORCID: orcid.org/0000-0002-4276-0281¹ &
…
Yifeng Li¹

116 Accesses
Explore all metrics

Abstract

Recent image captioning models have demonstrated remarkable performance in capturing substantial global semantic information in coarse-grained images and achieving high object coverage rates in generated captions. When applied to fine-grained images that contain heterogeneous object attributes, these models often struggle to maintain the desired granularity due to inadequate attention to local content. This paper investigates a solution for fine-grained caption generation on person-based images and heuristically proposes the Advanced Spectrum Parsing (ASP) model. Specifically, we design a novel spectrum branch to unveil the potential contour features of detected objects in the spectrum domain. We also preserve the spatial feature branch employed in existing methods, and leverage a multi-level feature extraction module to extract both spatial and spectrum features. Further more, we optimize these features, aiming to learn the spatial-spectrum correlation and complete the feature concatenation procedure via a multi-scale feature fusion module. In the inference stage, the integrated features enable the model to focus more on the local semantic regions of the person in the image. Extensive experimental results demonstrate that the proposed ASP for person-based datasets can yield promising results with both comprehensiveness and fine graininess.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring better image captioning with grid features

Article Open access 10 February 2024

Dense Image Captioning Based on Precise Feature Extraction

FocusCap: Object-Focused Image Captioning with CLIP-Guided Language Model

Data Availability

The datasets analysed during the current study are available in the Image-Text-Embedding and RSTPReid-Dataset repositories, https://github.com/layumi/Image-Text-Embedding/tree/master/dataset/CUHK-PEDES-prepare, https://github.com/NjtechCVLab/RSTPReid-Dataset.

References

Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR
Wang S, Lan L, Zhang X, Luo Z (2020) Gatecap: Gated spatial and semantic attention model for image captioning. Multimedia Tools Appl 79:11531–11549
Article Google Scholar
Wang S, Lan L, Zhang X, Dong G, Luo Z (2020) Object-aware semantics of attention for image captioning. Multimedia Tools Appl 79:2013–2030
Article Google Scholar
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47:853–899
Article MathSciNet Google Scholar
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578–10587
Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10327–10336
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: Understanding and generating simple image descriptions. IEEE transactions on pattern analysis and machine intelligence 35(12):2891–2903
Article PubMed Google Scholar
Luo R, Price B, Cohen S, Shakhnarovich G (2018) Discriminability objective for training descriptive captions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6964–6974
Zhu A, Wang Z, Li Y, Wan X, Jin J, Wang T, Hu F, Hua G (2021) Dssl: deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 209–217
Li S, Xiao T, Li H, Zhou B, Yue D, Wang X (2017) Person search with natural language description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1970–1979
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 27
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565–4574
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I et al (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9
Google Scholar
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Prudviraj J, Sravani Y, Mohan CK (2022) Incorporating attentive multi-scale context information for image captioning. Multimedia Tools Appl 1–21
Jia X, Wang Y, Peng Y, Chen S (2022) Semantic association enhancement transformer with relative position for image captioning. Multimedia Tools Appl 81(15):21349–21367
Article Google Scholar
Xu G, Niu S, Tan M, Luo Y, Du Q, Wu Q (2021) Towards accurate textbased image captioning with content diversity exploration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12637–12646
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024
Chang Y-S (2018) Fine-grained attention for image caption generation. Multimedia Tools Appl 77(3):2959–2971
Article Google Scholar
Zha Z-J, Liu D, Zhang H, Zhang Y, Wu F (2019) Context-aware visual policy network for fine-grained image captioning. IEEE Trans Pattern Anal Mach Intell 44(2):710–722
Article Google Scholar
Chen S, Jin Q,Wang P, Wu Q (2020) Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9962–9971
Wu J, Chen T, Wu H, Yang Z, Luo G, Lin L (2020) Fine-grained image captioning with global-local discriminative objective. IEEE Trans Multimedia 23:2413–2427
Article Google Scholar
Cho J, Yoon S, Kale A, Dernoncourt F, Bui T, Bansal M (2022) Fine-grained image captioning with clip reward. arXiv preprint arXiv:2205.13115
Park DH, Darrell T, Rohrbach A (2019) Robust change captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4624–4633
Jhamtani H, Berg–Kirkpatrick T (2018) Learning to describe differences between pairs of similar images. arXiv preprint arXiv:1808.10584
Tan H, Dernoncourt F, Lin Z, Bui T, Bansal M (2019) Expressing visual relationships via language. arXiv preprint arXiv:1906.07689
Harris FJ (1978) On the use of windows for harmonic analysis with the discrete fourier transform. Proc IEEE 66(1):51–83
Article ADS Google Scholar
Papineni K, Roukos S, Ward T, Zhu W–J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the NinthWorkshop on Statistical Machine Translation, pp. 376–380
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575
Lin C–Y (2004) Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086
Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10971–10980
Mokady R, Hertz A, Bermano AH (2021) Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Nanjing Tech University, 211816, Nanjing, China
Jianhui Wu, Fan Ni, Zijie Wang, Haoyu Ju, Yue Zhang, Fangqiang Hu & Yifeng Li

Authors

Jianhui Wu
View author publications
You can also search for this author in PubMed Google Scholar
Fan Ni
View author publications
You can also search for this author in PubMed Google Scholar
Zijie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haoyu Ju
View author publications
You can also search for this author in PubMed Google Scholar
Yue Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Fangqiang Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yifeng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fangqiang Hu.

Ethics declarations

Competing Interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, J., Ni, F., Wang, Z. et al. Fine-grained person-based image captioning via advanced spectrum parsing. Multimed Tools Appl 83, 34015–34030 (2024). https://doi.org/10.1007/s11042-023-16893-7

Download citation

Received: 19 April 2023
Revised: 20 July 2023
Accepted: 01 September 2023
Published: 23 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-16893-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fine-grained person-based image captioning via advanced spectrum parsing

Abstract

Access this article

Similar content being viewed by others

Exploring better image captioning with grid features

Dense Image Captioning Based on Precise Feature Extraction

FocusCap: Object-Focused Image Captioning with CLIP-Guided Language Model

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fine-grained person-based image captioning via advanced spectrum parsing

Abstract

Access this article

Similar content being viewed by others

Exploring better image captioning with grid features

Dense Image Captioning Based on Precise Feature Extraction

FocusCap: Object-Focused Image Captioning with CLIP-Guided Language Model

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation