Abstract
Recent image captioning models have demonstrated remarkable performance in capturing substantial global semantic information in coarse-grained images and achieving high object coverage rates in generated captions. When applied to fine-grained images that contain heterogeneous object attributes, these models often struggle to maintain the desired granularity due to inadequate attention to local content. This paper investigates a solution for fine-grained caption generation on person-based images and heuristically proposes the Advanced Spectrum Parsing (ASP) model. Specifically, we design a novel spectrum branch to unveil the potential contour features of detected objects in the spectrum domain. We also preserve the spatial feature branch employed in existing methods, and leverage a multi-level feature extraction module to extract both spatial and spectrum features. Further more, we optimize these features, aiming to learn the spatial-spectrum correlation and complete the feature concatenation procedure via a multi-scale feature fusion module. In the inference stage, the integrated features enable the model to focus more on the local semantic regions of the person in the image. Extensive experimental results demonstrate that the proposed ASP for person-based datasets can yield promising results with both comprehensiveness and fine graininess.
Similar content being viewed by others
Data Availability
The datasets analysed during the current study are available in the Image-Text-Embedding and RSTPReid-Dataset repositories, https://github.com/layumi/Image-Text-Embedding/tree/master/dataset/CUHK-PEDES-prepare, https://github.com/NjtechCVLab/RSTPReid-Dataset.
References
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057. PMLR
Wang S, Lan L, Zhang X, Luo Z (2020) Gatecap: Gated spatial and semantic attention model for image captioning. Multimedia Tools Appl 79:11531–11549
Wang S, Lan L, Zhang X, Dong G, Luo Z (2020) Object-aware semantics of attention for image captioning. Multimedia Tools Appl 79:2013–2030
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755. Springer
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47:853–899
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578–10587
Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10327–10336
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: Understanding and generating simple image descriptions. IEEE transactions on pattern analysis and machine intelligence 35(12):2891–2903
Luo R, Price B, Cohen S, Shakhnarovich G (2018) Discriminability objective for training descriptive captions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6964–6974
Zhu A, Wang Z, Li Y, Wan X, Jin J, Wang T, Hu F, Hua G (2021) Dssl: deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 209–217
Li S, Xiao T, Li H, Zhou B, Yue D, Wang X (2017) Person search with natural language description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1970–1979
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 27
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565–4574
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I et al (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Prudviraj J, Sravani Y, Mohan CK (2022) Incorporating attentive multi-scale context information for image captioning. Multimedia Tools Appl 1–21
Jia X, Wang Y, Peng Y, Chen S (2022) Semantic association enhancement transformer with relative position for image captioning. Multimedia Tools Appl 81(15):21349–21367
Xu G, Niu S, Tan M, Luo Y, Du Q, Wu Q (2021) Towards accurate textbased image captioning with content diversity exploration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12637–12646
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024
Chang Y-S (2018) Fine-grained attention for image caption generation. Multimedia Tools Appl 77(3):2959–2971
Zha Z-J, Liu D, Zhang H, Zhang Y, Wu F (2019) Context-aware visual policy network for fine-grained image captioning. IEEE Trans Pattern Anal Mach Intell 44(2):710–722
Chen S, Jin Q,Wang P, Wu Q (2020) Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9962–9971
Wu J, Chen T, Wu H, Yang Z, Luo G, Lin L (2020) Fine-grained image captioning with global-local discriminative objective. IEEE Trans Multimedia 23:2413–2427
Cho J, Yoon S, Kale A, Dernoncourt F, Bui T, Bansal M (2022) Fine-grained image captioning with clip reward. arXiv preprint arXiv:2205.13115
Park DH, Darrell T, Rohrbach A (2019) Robust change captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4624–4633
Jhamtani H, Berg–Kirkpatrick T (2018) Learning to describe differences between pairs of similar images. arXiv preprint arXiv:1808.10584
Tan H, Dernoncourt F, Lin Z, Bui T, Bansal M (2019) Expressing visual relationships via language. arXiv preprint arXiv:1906.07689
Harris FJ (1978) On the use of windows for harmonic analysis with the discrete fourier transform. Proc IEEE 66(1):51–83
Papineni K, Roukos S, Ward T, Zhu W–J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the NinthWorkshop on Statistical Machine Translation, pp. 376–380
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575
Lin C–Y (2004) Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086
Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10971–10980
Mokady R, Hertz A, Bermano AH (2021) Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing Interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, J., Ni, F., Wang, Z. et al. Fine-grained person-based image captioning via advanced spectrum parsing. Multimed Tools Appl 83, 34015–34030 (2024). https://doi.org/10.1007/s11042-023-16893-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16893-7