research-article

CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning

Authors:

Meng WangAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 1750 - 1758

https://doi.org/10.1145/3581783.3612245

Published: 27 October 2023 Publication History

Abstract

Transformer-based approaches to image captioning have shown great success by utilizing long-term dependency for visual embedding. However, their coarse long-term dependency, using the multi-head self-attention mechanism to capture the contextual interactions between the visual tokens on the time step and (or) embedded dimension, fail to distinguish fine-grained features of local partition. In this case, some similar features are captured, which leads to feature redundancy that decreases the performance. To respond to this issue, this paper proposes a novel image captioner embedding visual cross-partition dependency, dubbed CropCap. Specifically, the visual sequence generated from the Swin Transformer-based pre-embedding network is fed into the proposed cross-partition dependency module to refinedly model the interaction between partial representations on both the time step and embedded dimension. Furthermore, we formulaically reason the proposed cross-partition dependency, and theoretically prove its correctness. Extensive comparisons on the benchmark MS-COCO dataset demonstrated the effectiveness addressing the information redundancy issue, and verified the superior performance of our method.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. In Proc. International Conference on Learning Representations, Vol. abs/1409.0473.

[3]

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 10575--10584.

[4]

Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.

[5]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. International Conference on Learning Representations, Vol. abs/2010.11929.

[6]

Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lin Liang, Zhe Gan, Lijuan Wang, Yezhou Yang, and Zicheng Liu. 2022. Injecting semantic concepts into end-to-end image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 17988--17998.

[7]

Zhengcong Fei, Junshi Huang, Xiaoming Wei, and Xiaolin Wei. 2022. Efficient Modeling of Future Context for Image Captioning. In Proc. ACM International Conference on Multimedia.

Digital Library

[8]

Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1141--1150.

[9]

Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[10]

Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. In Proc. Neural Information Processing Systems.

[11]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9 (1997), 1735--1780.

Digital Library

[12]

Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proc. IEEE International Conference on Computer Vision. 4633--4642.

[13]

Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, and Rongrong Ji. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proc. Association for the Advancement of Artificial Intelligence.

[14]

Wenhao Jiang, Lin Ma, Yu-Gang Jiang, W. Liu, and T. Zhang. 2018. Recurrent fusion network for image captioning. In Proc. IEEE International Conference on Computer Vision.

[15]

Chia-Wen Kuo and Zsolt Kira. 2022. Beyond a pre-trained object detector: Cross-modal textual and visual context for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 17948--17958.

[16]

Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Entangled Transformer for Image Captioning. In Proc. IEEE International Conference on Computer Vision. 8927--8936.

[17]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proc. European Conference on Computer Vision.

[18]

Fenglin Liu, Xuancheng Ren, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou, and Xu Sun. 2020. Prophet attention: Predicting attention with future attention. In Proc. Neural Information Processing Systems.

[19]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. IEEE International Conference on Computer Vision. 9992--10002.

[20]

Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-level collaborative transformer for image captioning. In Proc. Association for the Advancement of Artificial Intelligence.

[21]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan Loddon Yuille. 2015. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). In Proc. International Conference on Learning Representations.

[22]

Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 10968--10977.

[23]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proc. International Conference on Machine Learning.

[24]

Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1179--1195.

[25]

Hao Shen, Zhong-Qiu Zhao, and Wandi Zhang. 2023. Adaptive dynamic filtering network for image denoising. In Proc. Association for the Advancement of Artificial Intelligence, Vol. 37. 2227--2235.

[26]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proc. International Conference on Learning Representations, Vol. abs/1409.1556.

[27]

Peipei Song, Dan Guo, Jun Cheng, and Meng Wang. 2022. Contextual attention network for emotional video captioning. IEEE Transactions on Multimedia (2022).

Digital Library

[28]

Peipei Song, Dan Guo, Jinxing Zhou, Mingliang Xu, and Meng Wang. 2022. Memorial gan with joint semantic optimization for unpaired image captioning. IEEE Transactions on Cybernetics (2022).

[29]

Zeliang Song, Xiaofei Zhou, Linhua Dong, Jianlong Tan, and Li Guo. 2021. Direction relation transformer for image captioning. Proc. ACM International Conference on Multimedia (2021).

Digital Library

[30]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, D. Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1--9.

[31]

Ming Tao, Bing-Kun Bao, Hao Tang, Fei Wu, Longhui Wei, and Qi Tian. 2023. De-net: Dynamic text-guided image editing adversarial networks. In Proc. Association for the Advancement of Artificial Intelligence, Vol. 37. 9971--9979.

[32]

Ming Tao, Bing-Kun Bao, Hao Tang, and Changsheng Xu. 2023. GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 14214--14223.

[33]

Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng Xu. 2022. Df-gan: A simple and effective baseline for text-to-image synthesis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 16515--16525.

[34]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. Neural Information Processing Systems, Vol. 30.

[35]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.

[36]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.

[37]

Bo Wang, Zhao Zhang, Jicong Fan, Mingbo Zhao, Choujun Zhan, and Mingliang Xu. 2022. FineFormer: Fine-grained adaptive object transformer for image captioning. In Proc. IEEE International Conference on Data Mining. IEEE, 508--517.

[38]

Bo Wang, Zhao Zhang, Ming Zhao, Xiaojie Jin, Mingliang Xu, and Meng Wang. 2022. OSIC: A new one-stage image captioner coined. ArXiv abs/2211.02321 (2022).

[39]

Yiyu Wang, Jungang Xu, and Yingfei Sun. 2022. End-to-end transformer based model for image captioning. In Proc. Association for the Advancement of Artificial Intelligence.

[40]

Xuewen Yang, Yingru Liu, and Xin Wang. 2021. ReFormer: The Relational Transformer for Image Captioning. In Proc. ACM International Conference on Multimedia.

[41]

Pengpeng Zeng, Haonan Zhang, Jingkuan Song, and Lianli Gao. 2022. S2 transformer for image captioning. In Proc. International Joint Conferences on Artificial Intelligence.

[42]

Pengpeng Zeng, Jinkuan Zhu, Jingkuan Song, and Lianli Gao. 2022. Progressive tree-structured prototype network for end-to-end image captioning. In Proc. ACM International Conference on Multimedia.

Digital Library

[43]

Xiaofeng Zhang, Feng Chen, Cailing Wang, Ming Tao, and Guo-Ping Jiang. 2020. Sienet: Siamese expansion network for image extrapolation. IEEE Signal Processing Letters 27 (2020), 1590--1594.

[44]

Xuying Zhang, Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Feiyue Huang, and Rongrong Ji. 2021. RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 15460--15469.

[45]

Zhao Zhang, Yanyan Wei, Haijun Zhang, Yi Yang, Shuicheng Yan, and Meng Wang. 2023. Data-driven single image deraining: A comprehensive review and new perspectives. Pattern Recognition (2023), 109740.

[46]

Suiyi Zhao, Zhao Zhang, Richang Hong, Mingliang Xu, Yi Yang, and Meng Wang. 2022. Fcl-gan: A lightweight and real-time baseline for unsupervised blind image deblurring. In Proc. ACM International Conference on Multimedia. 6220--6229.

Digital Library

[47]

Suiyi Zhao, Zhao Zhang, Richang Hong, Mingliang Xu, Haijun Zhang, Meng Wang, and Shuicheng Yan. 2022. Crnet: Unsupervised color retention network for blind motion deblurring. In Proc. ACM International Conference on Multimedia.

Digital Library

Cited By

Zhang XJia AJi JQu LYe Q(2025)Intra- and Inter-Head Orthogonal Attention for Image CaptioningIEEE Transactions on Image Processing10.1109/TIP.2025.352821634(594-607)Online publication date: 2025
https://doi.org/10.1109/TIP.2025.3528216
Zuo ZZhang ZLuo YZhao YZhang HYang YWang M(2025)Cut-and-Paste: Subject-driven video editing with attention controlNeural Networks10.1016/j.neunet.2024.106818181(106818)Online publication date: Jan-2025
https://doi.org/10.1016/j.neunet.2024.106818
Ye CChen WLi JZhang LMao ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Dual-path Collaborative Generation Network for Emotional Video CaptioningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681603(496-505)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681603
Show More Cited By

Index Terms

CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations

Recommendations

Image Captioning With Visual-Semantic Double Attention

In this article, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a modified visual attention model is used to extract sub-region image features, then a new SEmantic ...
Exploring refined dual visual features cross-combination for image captioning
Abstract
For current image caption tasks used to encode region features and grid features Transformer-based encoders have become commonplace, because of their multi-head self-attention mechanism, the encoder can better capture the relationship between ...
Highlights
- We attempt for the first time to integrate the Informer model with image captioning tasks.
- We propose a Distilled Cross-Combination Transformer (DCCT) to optimize image caption.
- We design a distillation cascade fusion encoder to ...
Towards local visual modeling for image captioning
Highlights
- Local visual modeling with grid features for image captioning.
- Locality-...
Abstract
In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Transformer Network (...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Anhui Provincial Natural Science Fund for the Distinguished Young Scholars
National Natural Science Foundation of China
CAAI-Huawei MindSpore Open Fund

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
214
Total Downloads

Downloads (Last 12 months)90
Downloads (Last 6 weeks)6

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang XJia AJi JQu LYe Q(2025)Intra- and Inter-Head Orthogonal Attention for Image CaptioningIEEE Transactions on Image Processing10.1109/TIP.2025.352821634(594-607)Online publication date: 2025
https://doi.org/10.1109/TIP.2025.3528216
Zuo ZZhang ZLuo YZhao YZhang HYang YWang M(2025)Cut-and-Paste: Subject-driven video editing with attention controlNeural Networks10.1016/j.neunet.2024.106818181(106818)Online publication date: Jan-2025
https://doi.org/10.1016/j.neunet.2024.106818
Ye CChen WLi JZhang LMao ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Dual-path Collaborative Generation Network for Emotional Video CaptioningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681603(496-505)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681603
Shen HDing HZhang YCong XZhao ZJiang X(2024)Spatial-Frequency Adaptive Remote Sensing Image Dehazing With Mixture of ExpertsIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.345898662(1-14)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3458986

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten