skip to main content
10.1145/3634814.3634842acmotherconferencesArticle/Chapter ViewAbstractPublication PagesasseConference Proceedingsconference-collections
research-article

Performance and Cost Balancing in Vision Transformer-Based Image Captioning

Published:26 March 2024Publication History

ABSTRACT

Image captioning connects computer vision and natural language processing. Many deep learning models have been proposed for solving this problem, such as Transformer-based, CLIP-based and Diffusion-based models. However, the primary focus has been on increasing the accuracy of generating human-like descriptions for given images, leading to expensive SOTA models that cannot be implemented on computation-limited devices. For better performance, we propose a ViT-LSTM model for solving image captioning tasks by addressing the challenge of long-range dependencies. Our model consists of a ViT model pre-trained using the ImageNet 21k dataset, which captures global context, and an LSTM that generates captions reflecting both local and global visual cues. Additionally, we use convolutional layers to reduce feature map dimensionality while preserving spatial relationships and local patterns. This allows the model to understand local details and relationships between regions, achieving better performance with less computational cost.

References

  1. Cornia M, Stefanini M, Baraldi L, Meshed-memory transformer for image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10578-10587.Google ScholarGoogle Scholar
  2. Fang Z, Wang J, Hu X, Injecting semantic concepts into end-to-end image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 18009-18019.Google ScholarGoogle Scholar
  3. Herdade S, Kappeler A, Boakye K, Image captioning: Transforming objects into words[J]. Advances in neural information processing systems, 2019, 32.Google ScholarGoogle Scholar
  4. Li Y, Pan Y, Yao T, Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(10): 8518-8526.Google ScholarGoogle Scholar
  5. Li Y, Pan Y, Yao T, Comprehending and ordering semantics for image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 17990-17999.Google ScholarGoogle Scholar
  6. Luo J, Li Y, Pan Y, CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 5600-5608.Google ScholarGoogle Scholar
  7. Pan Y, Li Y, Luo J, Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training[C]//Proceedings of the 30th ACM International Conference on Multimedia. 2022: 7070-7074.Google ScholarGoogle Scholar
  8. Sharma P, Ding N, Goodman S, Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 2556-2565.Google ScholarGoogle Scholar
  9. Wu L, Xu M, Sang L, Noise augmented double-stream graph convolutional networks for image captioning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(8): 3118-3127.Google ScholarGoogle ScholarCross RefCross Ref
  10. Yao T, Pan Y, Li Y, Hierarchy parsing for image captioning[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 2621-2629.Google ScholarGoogle Scholar
  11. Vinyals O, Toshev A, Bengio S, Show and tell: A neural image caption generator[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3156-3164.Google ScholarGoogle Scholar
  12. Xu K, Ba J, Kiros R, Show, attend and tell: Neural image caption generation with visual attention[C]//International conference on machine learning. PMLR, 2015: 2048-2057.Google ScholarGoogle Scholar
  13. Anderson P, He X, Buehler C, Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6077-6086.Google ScholarGoogle Scholar
  14. Jiang W, Ma L, Jiang Y G, Recurrent fusion network for image captioning[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 499-515.Google ScholarGoogle Scholar
  15. Yao T, Pan Y, Li Y, Exploring visual relationship for image captioning[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 684-699.Google ScholarGoogle Scholar
  16. Devlin J, Chang M W, Lee K, Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.Google ScholarGoogle Scholar
  17. Radford A, Narasimhan K, Salimans T, Improving language understanding by generative pre-training[J]. 2018.Google ScholarGoogle Scholar
  18. Yang Z, Dai Z, Yang Y, Xlnet: Generalized autoregressive pretraining for language understanding[J]. Advances in neural information processing systems, 2019, 32.Google ScholarGoogle Scholar
  19. Sun Y, Wang S, Li Y, Ernie: Enhanced representation through knowledge integration[J]. arXiv preprint arXiv:1904.09223, 2019.Google ScholarGoogle Scholar
  20. Sun Y, Wang S, Li Y, Ernie 2.0: A continual pre-training framework for language understanding[C]//Proceedings of the AAAI conference on artificial intelligence. 2020, 34(05): 8968-8975.Google ScholarGoogle Scholar
  21. Liu Y, Ott M, Goyal N, Roberta: A robustly optimized bert pretraining approach[J]. arXiv preprint arXiv:1907.11692, 2019.Google ScholarGoogle Scholar
  22. Dong L, Yang N, Wang W, Unified language model pre-training for natural language understanding and generation[J]. Advances in neural information processing systems, 2019, 32.Google ScholarGoogle Scholar
  23. Raffel C, Shazeer N, Roberts A, Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Lewis M, Liu Y, Goyal N, Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[J]. arXiv preprint arXiv:1910.13461, 2019.Google ScholarGoogle Scholar
  25. Chen T, Kornblith S, Norouzi M, A simple framework for contrastive learning of visual representations[C]//International conference on machine learning. PMLR, 2020: 1597-1607.Google ScholarGoogle Scholar
  26. Chen X, Fan H, Girshick R, Improved baselines with momentum contrastive learning[J]. arXiv preprint arXiv:2003.04297, 2020.Google ScholarGoogle Scholar
  27. Grill J B, Strub F, Altché F, Bootstrap your own latent-a new approach to self-supervised learning[J]. Advances in neural information processing systems, 2020, 33: 21271-21284.Google ScholarGoogle Scholar
  28. Chen X, He K. Exploring simple siamese representation learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 15750-15758.Google ScholarGoogle Scholar
  29. Bao H, Dong L, Piao S, Beit: Bert pre-training of image transformers[J]. arXiv preprint arXiv:2106.08254, 2021.Google ScholarGoogle Scholar
  30. He K, Chen X, Xie S, Masked autoencoders are scalable vision learners[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 16000-16009.Google ScholarGoogle Scholar
  31. Dosovitskiy A, Beyer L, Kolesnikov A, An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.Google ScholarGoogle Scholar
  32. Girshick R. Fast r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1440-1448.Google ScholarGoogle Scholar
  33. He K, Zhang X, Ren S, Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.Google ScholarGoogle Scholar
  34. Lu J, Xiong C, Parikh D, Knowing when to look: Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 375-383.Google ScholarGoogle Scholar
  35. Li X, Yuan A, Lu X. Vision-to-language tasks based on attributes and attention mechanism[J]. IEEE transactions on cybernetics, 2019, 51(2): 913-926.Google ScholarGoogle Scholar
  36. Luo Y, Ji J, Sun X, Dual-level collaborative transformer for image captioning[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(3): 2286-2293.Google ScholarGoogle Scholar
  37. Wang P, Yang A, Men R, Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework[C]//International Conference on Machine Learning. PMLR, 2022: 23318-23340.Google ScholarGoogle Scholar
  38. Gan Z, Li L, Li C, Vision-language pre-training: Basics, recent advances, and future trends[J]. Foundations and Trends® in Computer Graphics and Vision, 2022, 14(3–4): 163-352.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Wang W, Bao H, Dong L, Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 19175-19186.Google ScholarGoogle Scholar
  40. Vaswani A, Shazeer N, Parmar N, Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.Google ScholarGoogle Scholar
  41. Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3128-3137.Google ScholarGoogle Scholar
  43. Bhalekar M, Bedekar M. D-CNN: A new model for generating image captions with text extraction using deep learning for visually challenged individuals[J]. Engineering, Technology & Applied Science Research, 2022, 12(2): 8366-8373.Google ScholarGoogle ScholarCross RefCross Ref
  44. Srivastava S, Sharma H, Dixit P. Image Captioning based on Deep Convolutional Neural Networks and LSTM[C]//2022 2nd International Conference on Power Electronics & IoT Applications in Renewable Energy and its Control (PARC). IEEE, 2022: 1-4.Google ScholarGoogle Scholar
  45. Lu J, Yang J, Batra D, Neural baby talk[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7219-7228.Google ScholarGoogle Scholar
  46. Graves A. Generating sequences with recurrent neural networks[J]. arXiv preprint arXiv:1308.0850, 2013.Google ScholarGoogle Scholar
  47. Rennie S J, Marcheret E, Mroueh Y, Self-critical sequence training for image captioning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 7008-7024.Google ScholarGoogle Scholar
  48. Sethi A, Jain A, Dhiman C. Image Caption Generator in Hindi Using Attention[M]//Advanced Production and Industrial Engineering. IOS Press, 2022: 101-107.Google ScholarGoogle Scholar
  49. Huang L, Wang W, Chen J, Attention on attention for image captioning[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 4634-4643.Google ScholarGoogle Scholar
  50. Solomon R, Abebe M. Amharic Language Image Captions Generation Using Hybridized Attention-Based Deep Neural Networks[J]. Applied Computational Intelligence and Soft Computing, 2023, 2023.Google ScholarGoogle Scholar
  51. Zhang T, Zhang T, Zhuo Y, CATANIC: Automatic generation model of image captions based on multiple attention mechanism[J]. 2023.Google ScholarGoogle Scholar

Index Terms

  1. Performance and Cost Balancing in Vision Transformer-Based Image Captioning

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            ASSE '23: Proceedings of the 2023 4th Asia Service Sciences and Software Engineering Conference
            October 2023
            267 pages
            ISBN:9798400708534
            DOI:10.1145/3634814

            Copyright © 2023 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 26 March 2024

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited
          • Article Metrics

            • Downloads (Last 12 months)1
            • Downloads (Last 6 weeks)1

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format