research-article

Performance and Cost Balancing in Vision Transformer-Based Image Captioning

Authors:
Yan Lyu

School of Computer Science and Engineering, University of Aizu, Japan

School of Computer Science and Engineering, University of Aizu, Japan

0000-0001-7329-2225
View Profile

,
Yong Liu

School of Computer Science and Engineering, University of Aizu, Japan

School of Computer Science and Engineering, University of Aizu, Japan

0000-0002-4663-6739
View Profile

,
Qiangfu Zhao

School of Computer Science and Engineering, University of Aizu, Japan

School of Computer Science and Engineering, University of Aizu, Japan

0000-0003-3101-749X
View Profile

ASSE '23: Proceedings of the 2023 4th Asia Service Sciences and Software Engineering ConferenceOctober 2023Pages 210–215https://doi.org/10.1145/3634814.3634842

Published:26 March 2024Publication History

ASSE '23: Proceedings of the 2023 4th Asia Service Sciences and Software Engineering Conference

Pages 210–215

ABSTRACT

Image captioning connects computer vision and natural language processing. Many deep learning models have been proposed for solving this problem, such as Transformer-based, CLIP-based and Diffusion-based models. However, the primary focus has been on increasing the accuracy of generating human-like descriptions for given images, leading to expensive SOTA models that cannot be implemented on computation-limited devices. For better performance, we propose a ViT-LSTM model for solving image captioning tasks by addressing the challenge of long-range dependencies. Our model consists of a ViT model pre-trained using the ImageNet 21k dataset, which captures global context, and an LSTM that generates captions reflecting both local and global visual cues. Additionally, we use convolutional layers to reduce feature map dimensionality while preserving spatial relationships and local patterns. This allows the model to understand local details and relationships between regions, achieving better performance with less computational cost.

References

Cornia M, Stefanini M, Baraldi L, Meshed-memory transformer for image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10578-10587.Google Scholar
Fang Z, Wang J, Hu X, Injecting semantic concepts into end-to-end image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 18009-18019.Google Scholar
Herdade S, Kappeler A, Boakye K, Image captioning: Transforming objects into words[J]. Advances in neural information processing systems, 2019, 32.Google Scholar
Li Y, Pan Y, Yao T, Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(10): 8518-8526.Google Scholar
Li Y, Pan Y, Yao T, Comprehending and ordering semantics for image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 17990-17999.Google Scholar
Luo J, Li Y, Pan Y, CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 5600-5608.Google Scholar
Pan Y, Li Y, Luo J, Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training[C]//Proceedings of the 30th ACM International Conference on Multimedia. 2022: 7070-7074.Google Scholar
Sharma P, Ding N, Goodman S, Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 2556-2565.Google Scholar
Wu L, Xu M, Sang L, Noise augmented double-stream graph convolutional networks for image captioning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(8): 3118-3127.Google ScholarCross Ref
Yao T, Pan Y, Li Y, Hierarchy parsing for image captioning[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 2621-2629.Google Scholar
Vinyals O, Toshev A, Bengio S, Show and tell: A neural image caption generator[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3156-3164.Google Scholar
Xu K, Ba J, Kiros R, Show, attend and tell: Neural image caption generation with visual attention[C]//International conference on machine learning. PMLR, 2015: 2048-2057.Google Scholar
Anderson P, He X, Buehler C, Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6077-6086.Google Scholar
Jiang W, Ma L, Jiang Y G, Recurrent fusion network for image captioning[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 499-515.Google Scholar
Yao T, Pan Y, Li Y, Exploring visual relationship for image captioning[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 684-699.Google Scholar
Devlin J, Chang M W, Lee K, Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.Google Scholar
Radford A, Narasimhan K, Salimans T, Improving language understanding by generative pre-training[J]. 2018.Google Scholar
Yang Z, Dai Z, Yang Y, Xlnet: Generalized autoregressive pretraining for language understanding[J]. Advances in neural information processing systems, 2019, 32.Google Scholar
Sun Y, Wang S, Li Y, Ernie: Enhanced representation through knowledge integration[J]. arXiv preprint arXiv:1904.09223, 2019.Google Scholar
Sun Y, Wang S, Li Y, Ernie 2.0: A continual pre-training framework for language understanding[C]//Proceedings of the AAAI conference on artificial intelligence. 2020, 34(05): 8968-8975.Google Scholar
Liu Y, Ott M, Goyal N, Roberta: A robustly optimized bert pretraining approach[J]. arXiv preprint arXiv:1907.11692, 2019.Google Scholar
Dong L, Yang N, Wang W, Unified language model pre-training for natural language understanding and generation[J]. Advances in neural information processing systems, 2019, 32.Google Scholar
Raffel C, Shazeer N, Roberts A, Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551.Google ScholarDigital Library
Lewis M, Liu Y, Goyal N, Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[J]. arXiv preprint arXiv:1910.13461, 2019.Google Scholar
Chen T, Kornblith S, Norouzi M, A simple framework for contrastive learning of visual representations[C]//International conference on machine learning. PMLR, 2020: 1597-1607.Google Scholar
Chen X, Fan H, Girshick R, Improved baselines with momentum contrastive learning[J]. arXiv preprint arXiv:2003.04297, 2020.Google Scholar
Grill J B, Strub F, Altché F, Bootstrap your own latent-a new approach to self-supervised learning[J]. Advances in neural information processing systems, 2020, 33: 21271-21284.Google Scholar
Chen X, He K. Exploring simple siamese representation learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 15750-15758.Google Scholar
Bao H, Dong L, Piao S, Beit: Bert pre-training of image transformers[J]. arXiv preprint arXiv:2106.08254, 2021.Google Scholar
He K, Chen X, Xie S, Masked autoencoders are scalable vision learners[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 16000-16009.Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.Google Scholar
Girshick R. Fast r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1440-1448.Google Scholar
He K, Zhang X, Ren S, Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.Google Scholar
Lu J, Xiong C, Parikh D, Knowing when to look: Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 375-383.Google Scholar
Li X, Yuan A, Lu X. Vision-to-language tasks based on attributes and attention mechanism[J]. IEEE transactions on cybernetics, 2019, 51(2): 913-926.Google Scholar
Luo Y, Ji J, Sun X, Dual-level collaborative transformer for image captioning[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(3): 2286-2293.Google Scholar
Wang P, Yang A, Men R, Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework[C]//International Conference on Machine Learning. PMLR, 2022: 23318-23340.Google Scholar
Gan Z, Li L, Li C, Vision-language pre-training: Basics, recent advances, and future trends[J]. Foundations and Trends® in Computer Graphics and Vision, 2022, 14(3–4): 163-352.Google ScholarDigital Library
Wang W, Bao H, Dong L, Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 19175-19186.Google Scholar
Vaswani A, Shazeer N, Parmar N, Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.Google Scholar
Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.Google ScholarDigital Library
Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3128-3137.Google Scholar
Bhalekar M, Bedekar M. D-CNN: A new model for generating image captions with text extraction using deep learning for visually challenged individuals[J]. Engineering, Technology & Applied Science Research, 2022, 12(2): 8366-8373.Google ScholarCross Ref
Srivastava S, Sharma H, Dixit P. Image Captioning based on Deep Convolutional Neural Networks and LSTM[C]//2022 2nd International Conference on Power Electronics & IoT Applications in Renewable Energy and its Control (PARC). IEEE, 2022: 1-4.Google Scholar
Lu J, Yang J, Batra D, Neural baby talk[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7219-7228.Google Scholar
Graves A. Generating sequences with recurrent neural networks[J]. arXiv preprint arXiv:1308.0850, 2013.Google Scholar
Rennie S J, Marcheret E, Mroueh Y, Self-critical sequence training for image captioning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 7008-7024.Google Scholar
Sethi A, Jain A, Dhiman C. Image Caption Generator in Hindi Using Attention[M]//Advanced Production and Industrial Engineering. IOS Press, 2022: 101-107.Google Scholar
Huang L, Wang W, Chen J, Attention on attention for image captioning[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 4634-4643.Google Scholar
Solomon R, Abebe M. Amharic Language Image Captions Generation Using Hybridized Attention-Based Deep Neural Networks[J]. Applied Computational Intelligence and Soft Computing, 2023, 2023.Google Scholar
Zhang T, Zhang T, Zhuo Y, CATANIC: Automatic generation model of image captions based on multiple attention mechanism[J]. 2023.Google Scholar

Index Terms

Performance and Cost Balancing in Vision Transformer-Based Image Captioning
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Repeated review based image captioning for image evidence review

We propose a repeated review deep learning model for image captioning in image evidence review process. It consists of two subnetworks. One is the convolutional neural network which is employed to extract the image features and the other is the ...
Read More
Image Fusion Method for Transformer Substation Based on NSCT and Visual Saliency
Human Centered Computing
Abstract
To solve the problems of the existing infrared and visible image fusion algorithms, such as the decrease of the contrast of the fusion image, the lack of the visual target, and the lack of the detail texture, an image fusion algorithm based on the ...
Read More
ReFormer: The Relational Transformer for Image Captioning
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Image captioning is shown to be able to achieve a better performance by using scene graphs to represent the relations of objects in the image. The current captioning encoders generally use a Graph Convolutional Net (GCN) to represent the relation ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ASSE '23: Proceedings of the 2023 4th Asia Service Sciences and Software Engineering Conference
October 2023
267 pages
ISBN:9798400708534
DOI:10.1145/3634814

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 March 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 1
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Performance and Cost Balancing in Vision Transformer-Based Image Captioning

ASSE '23: Proceedings of the 2023 4th Asia Service Sciences and Software Engineering Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Repeated review based image captioning for image evidence review

Image Fusion Method for Transformer Substation Based on NSCT and Visual Saliency

ReFormer: The Relational Transformer for Image Captioning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media