ABSTRACT
Image captioning connects computer vision and natural language processing. Many deep learning models have been proposed for solving this problem, such as Transformer-based, CLIP-based and Diffusion-based models. However, the primary focus has been on increasing the accuracy of generating human-like descriptions for given images, leading to expensive SOTA models that cannot be implemented on computation-limited devices. For better performance, we propose a ViT-LSTM model for solving image captioning tasks by addressing the challenge of long-range dependencies. Our model consists of a ViT model pre-trained using the ImageNet 21k dataset, which captures global context, and an LSTM that generates captions reflecting both local and global visual cues. Additionally, we use convolutional layers to reduce feature map dimensionality while preserving spatial relationships and local patterns. This allows the model to understand local details and relationships between regions, achieving better performance with less computational cost.
- Cornia M, Stefanini M, Baraldi L, Meshed-memory transformer for image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10578-10587.Google Scholar
- Fang Z, Wang J, Hu X, Injecting semantic concepts into end-to-end image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 18009-18019.Google Scholar
- Herdade S, Kappeler A, Boakye K, Image captioning: Transforming objects into words[J]. Advances in neural information processing systems, 2019, 32.Google Scholar
- Li Y, Pan Y, Yao T, Scheduled sampling in vision-language pretraining with decoupled encoder-decoder network[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(10): 8518-8526.Google Scholar
- Li Y, Pan Y, Yao T, Comprehending and ordering semantics for image captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 17990-17999.Google Scholar
- Luo J, Li Y, Pan Y, CoCo-BERT: Improving video-language pre-training with contrastive cross-modal matching and denoising[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 5600-5608.Google Scholar
- Pan Y, Li Y, Luo J, Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training[C]//Proceedings of the 30th ACM International Conference on Multimedia. 2022: 7070-7074.Google Scholar
- Sharma P, Ding N, Goodman S, Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 2556-2565.Google Scholar
- Wu L, Xu M, Sang L, Noise augmented double-stream graph convolutional networks for image captioning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 31(8): 3118-3127.Google ScholarCross Ref
- Yao T, Pan Y, Li Y, Hierarchy parsing for image captioning[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 2621-2629.Google Scholar
- Vinyals O, Toshev A, Bengio S, Show and tell: A neural image caption generator[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3156-3164.Google Scholar
- Xu K, Ba J, Kiros R, Show, attend and tell: Neural image caption generation with visual attention[C]//International conference on machine learning. PMLR, 2015: 2048-2057.Google Scholar
- Anderson P, He X, Buehler C, Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 6077-6086.Google Scholar
- Jiang W, Ma L, Jiang Y G, Recurrent fusion network for image captioning[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 499-515.Google Scholar
- Yao T, Pan Y, Li Y, Exploring visual relationship for image captioning[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 684-699.Google Scholar
- Devlin J, Chang M W, Lee K, Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.Google Scholar
- Radford A, Narasimhan K, Salimans T, Improving language understanding by generative pre-training[J]. 2018.Google Scholar
- Yang Z, Dai Z, Yang Y, Xlnet: Generalized autoregressive pretraining for language understanding[J]. Advances in neural information processing systems, 2019, 32.Google Scholar
- Sun Y, Wang S, Li Y, Ernie: Enhanced representation through knowledge integration[J]. arXiv preprint arXiv:1904.09223, 2019.Google Scholar
- Sun Y, Wang S, Li Y, Ernie 2.0: A continual pre-training framework for language understanding[C]//Proceedings of the AAAI conference on artificial intelligence. 2020, 34(05): 8968-8975.Google Scholar
- Liu Y, Ott M, Goyal N, Roberta: A robustly optimized bert pretraining approach[J]. arXiv preprint arXiv:1907.11692, 2019.Google Scholar
- Dong L, Yang N, Wang W, Unified language model pre-training for natural language understanding and generation[J]. Advances in neural information processing systems, 2019, 32.Google Scholar
- Raffel C, Shazeer N, Roberts A, Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551.Google ScholarDigital Library
- Lewis M, Liu Y, Goyal N, Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[J]. arXiv preprint arXiv:1910.13461, 2019.Google Scholar
- Chen T, Kornblith S, Norouzi M, A simple framework for contrastive learning of visual representations[C]//International conference on machine learning. PMLR, 2020: 1597-1607.Google Scholar
- Chen X, Fan H, Girshick R, Improved baselines with momentum contrastive learning[J]. arXiv preprint arXiv:2003.04297, 2020.Google Scholar
- Grill J B, Strub F, Altché F, Bootstrap your own latent-a new approach to self-supervised learning[J]. Advances in neural information processing systems, 2020, 33: 21271-21284.Google Scholar
- Chen X, He K. Exploring simple siamese representation learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 15750-15758.Google Scholar
- Bao H, Dong L, Piao S, Beit: Bert pre-training of image transformers[J]. arXiv preprint arXiv:2106.08254, 2021.Google Scholar
- He K, Chen X, Xie S, Masked autoencoders are scalable vision learners[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 16000-16009.Google Scholar
- Dosovitskiy A, Beyer L, Kolesnikov A, An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.Google Scholar
- Girshick R. Fast r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1440-1448.Google Scholar
- He K, Zhang X, Ren S, Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.Google Scholar
- Lu J, Xiong C, Parikh D, Knowing when to look: Adaptive attention via a visual sentinel for image captioning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 375-383.Google Scholar
- Li X, Yuan A, Lu X. Vision-to-language tasks based on attributes and attention mechanism[J]. IEEE transactions on cybernetics, 2019, 51(2): 913-926.Google Scholar
- Luo Y, Ji J, Sun X, Dual-level collaborative transformer for image captioning[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(3): 2286-2293.Google Scholar
- Wang P, Yang A, Men R, Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework[C]//International Conference on Machine Learning. PMLR, 2022: 23318-23340.Google Scholar
- Gan Z, Li L, Li C, Vision-language pre-training: Basics, recent advances, and future trends[J]. Foundations and Trends® in Computer Graphics and Vision, 2022, 14(3–4): 163-352.Google ScholarDigital Library
- Wang W, Bao H, Dong L, Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 19175-19186.Google Scholar
- Vaswani A, Shazeer N, Parmar N, Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.Google Scholar
- Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.Google ScholarDigital Library
- Karpathy A, Fei-Fei L. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3128-3137.Google Scholar
- Bhalekar M, Bedekar M. D-CNN: A new model for generating image captions with text extraction using deep learning for visually challenged individuals[J]. Engineering, Technology & Applied Science Research, 2022, 12(2): 8366-8373.Google ScholarCross Ref
- Srivastava S, Sharma H, Dixit P. Image Captioning based on Deep Convolutional Neural Networks and LSTM[C]//2022 2nd International Conference on Power Electronics & IoT Applications in Renewable Energy and its Control (PARC). IEEE, 2022: 1-4.Google Scholar
- Lu J, Yang J, Batra D, Neural baby talk[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7219-7228.Google Scholar
- Graves A. Generating sequences with recurrent neural networks[J]. arXiv preprint arXiv:1308.0850, 2013.Google Scholar
- Rennie S J, Marcheret E, Mroueh Y, Self-critical sequence training for image captioning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 7008-7024.Google Scholar
- Sethi A, Jain A, Dhiman C. Image Caption Generator in Hindi Using Attention[M]//Advanced Production and Industrial Engineering. IOS Press, 2022: 101-107.Google Scholar
- Huang L, Wang W, Chen J, Attention on attention for image captioning[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 4634-4643.Google Scholar
- Solomon R, Abebe M. Amharic Language Image Captions Generation Using Hybridized Attention-Based Deep Neural Networks[J]. Applied Computational Intelligence and Soft Computing, 2023, 2023.Google Scholar
- Zhang T, Zhang T, Zhuo Y, CATANIC: Automatic generation model of image captions based on multiple attention mechanism[J]. 2023.Google Scholar
Index Terms
- Performance and Cost Balancing in Vision Transformer-Based Image Captioning
Recommendations
Repeated review based image captioning for image evidence review
We propose a repeated review deep learning model for image captioning in image evidence review process. It consists of two subnetworks. One is the convolutional neural network which is employed to extract the image features and the other is the ...
Image Fusion Method for Transformer Substation Based on NSCT and Visual Saliency
Human Centered ComputingAbstractTo solve the problems of the existing infrared and visible image fusion algorithms, such as the decrease of the contrast of the fusion image, the lack of the visual target, and the lack of the detail texture, an image fusion algorithm based on the ...
ReFormer: The Relational Transformer for Image Captioning
MM '22: Proceedings of the 30th ACM International Conference on MultimediaImage captioning is shown to be able to achieve a better performance by using scene graphs to represent the relations of objects in the image. The current captioning encoders generally use a Graph Convolutional Net (GCN) to represent the relation ...
Comments