ABSTRACT
Transformer-based approaches to image captioning have shown great success by utilizing long-term dependency for visual embedding. However, their coarse long-term dependency, using the multi-head self-attention mechanism to capture the contextual interactions between the visual tokens on the time step and (or) embedded dimension, fail to distinguish fine-grained features of local partition. In this case, some similar features are captured, which leads to feature redundancy that decreases the performance. To respond to this issue, this paper proposes a novel image captioner embedding visual cross-partition dependency, dubbed CropCap. Specifically, the visual sequence generated from the Swin Transformer-based pre-embedding network is fed into the proposed cross-partition dependency module to refinedly model the interaction between partial representations on both the time step and embedded dimension. Furthermore, we formulaically reason the proposed cross-partition dependency, and theoretically prove its correctness. Extensive comparisons on the benchmark MS-COCO dataset demonstrated the effectiveness addressing the information redundancy issue, and verified the superior performance of our method.
- Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.Google ScholarCross Ref
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. In Proc. International Conference on Learning Representations, Vol. abs/1409.0473.Google Scholar
- Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 10575--10584.Google ScholarCross Ref
- Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.Google ScholarCross Ref
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. International Conference on Learning Representations, Vol. abs/2010.11929.Google Scholar
- Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lin Liang, Zhe Gan, Lijuan Wang, Yezhou Yang, and Zicheng Liu. 2022. Injecting semantic concepts into end-to-end image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 17988--17998.Google ScholarCross Ref
- Zhengcong Fei, Junshi Huang, Xiaoming Wei, and Xiaolin Wei. 2022. Efficient Modeling of Future Context for Image Captioning. In Proc. ACM International Conference on Multimedia.Google ScholarDigital Library
- Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1141--1150.Google ScholarCross Ref
- Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
- Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. In Proc. Neural Information Processing Systems.Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9 (1997), 1735--1780.Google ScholarDigital Library
- Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proc. IEEE International Conference on Computer Vision. 4633--4642.Google ScholarCross Ref
- Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, and Rongrong Ji. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proc. Association for the Advancement of Artificial Intelligence.Google ScholarCross Ref
- Wenhao Jiang, Lin Ma, Yu-Gang Jiang, W. Liu, and T. Zhang. 2018. Recurrent fusion network for image captioning. In Proc. IEEE International Conference on Computer Vision.Google Scholar
- Chia-Wen Kuo and Zsolt Kira. 2022. Beyond a pre-trained object detector: Cross-modal textual and visual context for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 17948--17958.Google ScholarCross Ref
- Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Entangled Transformer for Image Captioning. In Proc. IEEE International Conference on Computer Vision. 8927--8936.Google ScholarCross Ref
- Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proc. European Conference on Computer Vision.Google Scholar
- Fenglin Liu, Xuancheng Ren, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou, and Xu Sun. 2020. Prophet attention: Predicting attention with future attention. In Proc. Neural Information Processing Systems.Google Scholar
- Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. IEEE International Conference on Computer Vision. 9992--10002.Google ScholarCross Ref
- Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-level collaborative transformer for image captioning. In Proc. Association for the Advancement of Artificial Intelligence.Google ScholarCross Ref
- Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan Loddon Yuille. 2015. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). In Proc. International Conference on Learning Representations.Google Scholar
- Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 10968--10977.Google ScholarCross Ref
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proc. International Conference on Machine Learning.Google Scholar
- Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1179--1195.Google ScholarCross Ref
- Hao Shen, Zhong-Qiu Zhao, and Wandi Zhang. 2023. Adaptive dynamic filtering network for image denoising. In Proc. Association for the Advancement of Artificial Intelligence, Vol. 37. 2227--2235.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proc. International Conference on Learning Representations, Vol. abs/1409.1556.Google Scholar
- Peipei Song, Dan Guo, Jun Cheng, and Meng Wang. 2022. Contextual attention network for emotional video captioning. IEEE Transactions on Multimedia (2022).Google ScholarDigital Library
- Peipei Song, Dan Guo, Jinxing Zhou, Mingliang Xu, and Meng Wang. 2022. Memorial gan with joint semantic optimization for unpaired image captioning. IEEE Transactions on Cybernetics (2022).Google Scholar
- Zeliang Song, Xiaofei Zhou, Linhua Dong, Jianlong Tan, and Li Guo. 2021. Direction relation transformer for image captioning. Proc. ACM International Conference on Multimedia (2021).Google ScholarDigital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, D. Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarCross Ref
- Ming Tao, Bing-Kun Bao, Hao Tang, Fei Wu, Longhui Wei, and Qi Tian. 2023. De-net: Dynamic text-guided image editing adversarial networks. In Proc. Association for the Advancement of Artificial Intelligence, Vol. 37. 9971--9979.Google Scholar
- Ming Tao, Bing-Kun Bao, Hao Tang, and Changsheng Xu. 2023. GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 14214--14223.Google ScholarCross Ref
- Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng Xu. 2022. Df-gan: A simple and effective baseline for text-to-image synthesis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 16515--16525.Google ScholarCross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. Neural Information Processing Systems, Vol. 30.Google Scholar
- Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.Google ScholarCross Ref
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google Scholar
- Bo Wang, Zhao Zhang, Jicong Fan, Mingbo Zhao, Choujun Zhan, and Mingliang Xu. 2022. FineFormer: Fine-grained adaptive object transformer for image captioning. In Proc. IEEE International Conference on Data Mining. IEEE, 508--517.Google ScholarCross Ref
- Bo Wang, Zhao Zhang, Ming Zhao, Xiaojie Jin, Mingliang Xu, and Meng Wang. 2022. OSIC: A new one-stage image captioner coined. ArXiv abs/2211.02321 (2022).Google Scholar
- Yiyu Wang, Jungang Xu, and Yingfei Sun. 2022. End-to-end transformer based model for image captioning. In Proc. Association for the Advancement of Artificial Intelligence.Google ScholarCross Ref
- Xuewen Yang, Yingru Liu, and Xin Wang. 2021. ReFormer: The Relational Transformer for Image Captioning. In Proc. ACM International Conference on Multimedia.Google Scholar
- Pengpeng Zeng, Haonan Zhang, Jingkuan Song, and Lianli Gao. 2022. S2 transformer for image captioning. In Proc. International Joint Conferences on Artificial Intelligence.Google ScholarCross Ref
- Pengpeng Zeng, Jinkuan Zhu, Jingkuan Song, and Lianli Gao. 2022. Progressive tree-structured prototype network for end-to-end image captioning. In Proc. ACM International Conference on Multimedia.Google ScholarDigital Library
- Xiaofeng Zhang, Feng Chen, Cailing Wang, Ming Tao, and Guo-Ping Jiang. 2020. Sienet: Siamese expansion network for image extrapolation. IEEE Signal Processing Letters 27 (2020), 1590--1594.Google ScholarCross Ref
- Xuying Zhang, Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Feiyue Huang, and Rongrong Ji. 2021. RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 15460--15469.Google ScholarCross Ref
- Zhao Zhang, Yanyan Wei, Haijun Zhang, Yi Yang, Shuicheng Yan, and Meng Wang. 2023. Data-driven single image deraining: A comprehensive review and new perspectives. Pattern Recognition (2023), 109740.Google Scholar
- Suiyi Zhao, Zhao Zhang, Richang Hong, Mingliang Xu, Yi Yang, and Meng Wang. 2022. Fcl-gan: A lightweight and real-time baseline for unsupervised blind image deblurring. In Proc. ACM International Conference on Multimedia. 6220--6229.Google ScholarDigital Library
- Suiyi Zhao, Zhao Zhang, Richang Hong, Mingliang Xu, Haijun Zhang, Meng Wang, and Shuicheng Yan. 2022. Crnet: Unsupervised color retention network for blind motion deblurring. In Proc. ACM International Conference on Multimedia.Google ScholarDigital Library
Index Terms
- CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning
Recommendations
Image Captioning With Visual-Semantic Double Attention
In this article, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a modified visual attention model is used to extract sub-region image features, then a new SEmantic ...
Image captioning improved visual question answering
AbstractBoth Visual Question Answering (VQA) and image captioning are the problems which involve Computer Vision (CV) and Natural Language Processing (NLP) domains. In general, computer vision models are effectively utilized to represent visual contents. ...
Towards local visual modeling for image captioning
Highlights- Local visual modeling with grid features for image captioning.
- Locality-...
AbstractIn this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Transformer Network (...
Comments