skip to main content
10.1145/3581783.3612245acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning

Published:27 October 2023Publication History

ABSTRACT

Transformer-based approaches to image captioning have shown great success by utilizing long-term dependency for visual embedding. However, their coarse long-term dependency, using the multi-head self-attention mechanism to capture the contextual interactions between the visual tokens on the time step and (or) embedded dimension, fail to distinguish fine-grained features of local partition. In this case, some similar features are captured, which leads to feature redundancy that decreases the performance. To respond to this issue, this paper proposes a novel image captioner embedding visual cross-partition dependency, dubbed CropCap. Specifically, the visual sequence generated from the Swin Transformer-based pre-embedding network is fed into the proposed cross-partition dependency module to refinedly model the interaction between partial representations on both the time step and embedded dimension. Furthermore, we formulaically reason the proposed cross-partition dependency, and theoretically prove its correctness. Extensive comparisons on the benchmark MS-COCO dataset demonstrated the effectiveness addressing the information redundancy issue, and verified the superior performance of our method.

References

  1. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.Google ScholarGoogle ScholarCross RefCross Ref
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. In Proc. International Conference on Learning Representations, Vol. abs/1409.0473.Google ScholarGoogle Scholar
  3. Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 10575--10584.Google ScholarGoogle ScholarCross RefCross Ref
  4. Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.Google ScholarGoogle ScholarCross RefCross Ref
  5. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. International Conference on Learning Representations, Vol. abs/2010.11929.Google ScholarGoogle Scholar
  6. Zhiyuan Fang, Jianfeng Wang, Xiaowei Hu, Lin Liang, Zhe Gan, Lijuan Wang, Yezhou Yang, and Zicheng Liu. 2022. Injecting semantic concepts into end-to-end image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 17988--17998.Google ScholarGoogle ScholarCross RefCross Ref
  7. Zhengcong Fei, Junshi Huang, Xiaoming Wei, and Xiaolin Wei. 2022. Efficient Modeling of Future Context for Image Captioning. In Proc. ACM International Conference on Multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1141--1150.Google ScholarGoogle ScholarCross RefCross Ref
  9. Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  10. Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. In Proc. Neural Information Processing Systems.Google ScholarGoogle Scholar
  11. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9 (1997), 1735--1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proc. IEEE International Conference on Computer Vision. 4633--4642.Google ScholarGoogle ScholarCross RefCross Ref
  13. Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, and Rongrong Ji. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proc. Association for the Advancement of Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  14. Wenhao Jiang, Lin Ma, Yu-Gang Jiang, W. Liu, and T. Zhang. 2018. Recurrent fusion network for image captioning. In Proc. IEEE International Conference on Computer Vision.Google ScholarGoogle Scholar
  15. Chia-Wen Kuo and Zsolt Kira. 2022. Beyond a pre-trained object detector: Cross-modal textual and visual context for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 17948--17958.Google ScholarGoogle ScholarCross RefCross Ref
  16. Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019. Entangled Transformer for Image Captioning. In Proc. IEEE International Conference on Computer Vision. 8927--8936.Google ScholarGoogle ScholarCross RefCross Ref
  17. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proc. European Conference on Computer Vision.Google ScholarGoogle Scholar
  18. Fenglin Liu, Xuancheng Ren, Xian Wu, Shen Ge, Wei Fan, Yuexian Zou, and Xu Sun. 2020. Prophet attention: Predicting attention with future attention. In Proc. Neural Information Processing Systems.Google ScholarGoogle Scholar
  19. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. IEEE International Conference on Computer Vision. 9992--10002.Google ScholarGoogle ScholarCross RefCross Ref
  20. Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-level collaborative transformer for image captioning. In Proc. Association for the Advancement of Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  21. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan Loddon Yuille. 2015. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). In Proc. International Conference on Learning Representations.Google ScholarGoogle Scholar
  22. Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 10968--10977.Google ScholarGoogle ScholarCross RefCross Ref
  23. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proc. International Conference on Machine Learning.Google ScholarGoogle Scholar
  24. Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1179--1195.Google ScholarGoogle ScholarCross RefCross Ref
  25. Hao Shen, Zhong-Qiu Zhao, and Wandi Zhang. 2023. Adaptive dynamic filtering network for image denoising. In Proc. Association for the Advancement of Artificial Intelligence, Vol. 37. 2227--2235.Google ScholarGoogle Scholar
  26. Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proc. International Conference on Learning Representations, Vol. abs/1409.1556.Google ScholarGoogle Scholar
  27. Peipei Song, Dan Guo, Jun Cheng, and Meng Wang. 2022. Contextual attention network for emotional video captioning. IEEE Transactions on Multimedia (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Peipei Song, Dan Guo, Jinxing Zhou, Mingliang Xu, and Meng Wang. 2022. Memorial gan with joint semantic optimization for unpaired image captioning. IEEE Transactions on Cybernetics (2022).Google ScholarGoogle Scholar
  29. Zeliang Song, Xiaofei Zhou, Linhua Dong, Jianlong Tan, and Li Guo. 2021. Direction relation transformer for image captioning. Proc. ACM International Conference on Multimedia (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, D. Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  31. Ming Tao, Bing-Kun Bao, Hao Tang, Fei Wu, Longhui Wei, and Qi Tian. 2023. De-net: Dynamic text-guided image editing adversarial networks. In Proc. Association for the Advancement of Artificial Intelligence, Vol. 37. 9971--9979.Google ScholarGoogle Scholar
  32. Ming Tao, Bing-Kun Bao, Hao Tang, and Changsheng Xu. 2023. GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 14214--14223.Google ScholarGoogle ScholarCross RefCross Ref
  33. Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng Xu. 2022. Df-gan: A simple and effective baseline for text-to-image synthesis. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 16515--16525.Google ScholarGoogle ScholarCross RefCross Ref
  34. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. Neural Information Processing Systems, Vol. 30.Google ScholarGoogle Scholar
  35. Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.Google ScholarGoogle ScholarCross RefCross Ref
  36. Oriol Vinyals, Alexander Toshev, Samy Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.Google ScholarGoogle Scholar
  37. Bo Wang, Zhao Zhang, Jicong Fan, Mingbo Zhao, Choujun Zhan, and Mingliang Xu. 2022. FineFormer: Fine-grained adaptive object transformer for image captioning. In Proc. IEEE International Conference on Data Mining. IEEE, 508--517.Google ScholarGoogle ScholarCross RefCross Ref
  38. Bo Wang, Zhao Zhang, Ming Zhao, Xiaojie Jin, Mingliang Xu, and Meng Wang. 2022. OSIC: A new one-stage image captioner coined. ArXiv abs/2211.02321 (2022).Google ScholarGoogle Scholar
  39. Yiyu Wang, Jungang Xu, and Yingfei Sun. 2022. End-to-end transformer based model for image captioning. In Proc. Association for the Advancement of Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  40. Xuewen Yang, Yingru Liu, and Xin Wang. 2021. ReFormer: The Relational Transformer for Image Captioning. In Proc. ACM International Conference on Multimedia.Google ScholarGoogle Scholar
  41. Pengpeng Zeng, Haonan Zhang, Jingkuan Song, and Lianli Gao. 2022. S2 transformer for image captioning. In Proc. International Joint Conferences on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  42. Pengpeng Zeng, Jinkuan Zhu, Jingkuan Song, and Lianli Gao. 2022. Progressive tree-structured prototype network for end-to-end image captioning. In Proc. ACM International Conference on Multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xiaofeng Zhang, Feng Chen, Cailing Wang, Ming Tao, and Guo-Ping Jiang. 2020. Sienet: Siamese expansion network for image extrapolation. IEEE Signal Processing Letters 27 (2020), 1590--1594.Google ScholarGoogle ScholarCross RefCross Ref
  44. Xuying Zhang, Xiaoshuai Sun, Yunpeng Luo, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Feiyue Huang, and Rongrong Ji. 2021. RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 15460--15469.Google ScholarGoogle ScholarCross RefCross Ref
  45. Zhao Zhang, Yanyan Wei, Haijun Zhang, Yi Yang, Shuicheng Yan, and Meng Wang. 2023. Data-driven single image deraining: A comprehensive review and new perspectives. Pattern Recognition (2023), 109740.Google ScholarGoogle Scholar
  46. Suiyi Zhao, Zhao Zhang, Richang Hong, Mingliang Xu, Yi Yang, and Meng Wang. 2022. Fcl-gan: A lightweight and real-time baseline for unsupervised blind image deblurring. In Proc. ACM International Conference on Multimedia. 6220--6229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Suiyi Zhao, Zhao Zhang, Richang Hong, Mingliang Xu, Haijun Zhang, Meng Wang, and Shuicheng Yan. 2022. Crnet: Unsupervised color retention network for blind motion deblurring. In Proc. ACM International Conference on Multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 October 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia
    • Article Metrics

      • Downloads (Last 12 months)157
      • Downloads (Last 6 weeks)26

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader