research-article

End-to-end image captioning based on reduced feature maps of deep learners pre-trained for object detection

Authors:
Yan Lyu

The University of Aizu, Aizu-Wakamatsu, Fukushima, Japan

The University of Aizu, Aizu-Wakamatsu, Fukushima, Japan
View Profile

,
Qiangfu Zhao

The University of Aizu, Aizu-Wakamatsu, Fukushima, Japan

The University of Aizu, Aizu-Wakamatsu, Fukushima, Japan
View Profile

,
Yong Liu

The University of Aizu, Aizu-Wakamatsu, Fukushima, Japan

The University of Aizu, Aizu-Wakamatsu, Fukushima, Japan
View Profile

RACS '22: Proceedings of the Conference on Research in Adaptive and Convergent SystemsOctober 2022Pages 71–76https://doi.org/10.1145/3538641.3561491

Published:20 October 2022Publication History

RACS '22: Proceedings of the Conference on Research in Adaptive and Convergent Systems

Pages 71–76

ABSTRACT

Image caption generation is a task to generate a descriptive sentence automatically for a given image which combines image process and natural language process together. Most existing approaches divided the whole process into two parts, encoding and decoding. Our research proposed an image caption generation method based on the most popular architecture with the YOLO as encoder and the LSTM as decoder. The encoder is used to encode images into feature vectors while the decoder is used to decode the feature maps for natural language sentences generation to describe the images. The proposed method is compared with other existing methods.

References

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597--1607.Google Scholar
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 2020. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020).Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR) 51, 6 (2019), 1--36.Google ScholarDigital Library
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128--3137.Google ScholarCross Ref
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).Google Scholar
Xuelong Li, Aihong Yuan, and Xiaoqiang Lu. 2019. Vision-to-language tasks based on attributes and attention mechanism. IEEE transactions on cybernetics 51, 2 (2019), 913--926.Google ScholarCross Ref
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.Google ScholarCross Ref
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google Scholar
Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-level collaborative transformer for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2286--2293.Google ScholarCross Ref
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 (2014).Google Scholar
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.Google ScholarDigital Library
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).Google Scholar
Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using amazon's mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon's Mechanical Turk. 139--147.Google ScholarDigital Library
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779--788.Google ScholarCross Ref
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
Raimonda Staniūtė and Dmitrij Šešok. 2019. A systematic literature review on image captioning. Applied Sciences 9, 10 (2019), 2024.Google ScholarCross Ref
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019).Google Scholar
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2020. Ernie 2.0: A continual pre-training framework for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8968--8975.Google ScholarCross Ref
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.Google ScholarCross Ref
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.Google ScholarCross Ref
Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Van Den Hengel. 2016. What value do explicit high level concepts have in vision to language problems?. In Proceedings of the IEEE conference on computer vision and pattern recognition. 203--212.Google ScholarCross Ref
Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton Van Den Hengel. 2017. Image captioning and visual question answering based on attributes and external knowledge. IEEE transactions on pattern analysis and machine intelligence 40, 6 (2017), 1367--1381.Google Scholar
Miao Xin, Hong Zhang, Ding Yuan, and Mingui Sun. 2017. Learning discriminative action and context representations for action recognition in still images. In 2017 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 757--762.Google ScholarCross Ref
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).Google Scholar
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67--78.Google ScholarCross Ref
Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818--833.Google ScholarCross Ref

Index Terms

End-to-end image captioning based on reduced feature maps of deep learners pre-trained for object detection
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

A Comprehensive Survey of Deep Learning for Image Captioning

Generating a description of an image is called image captioning. Image captioning requires recognizing the important objects, their attributes, and their relationships in an image. It also needs to generate syntactically and semantically correct ...
Read More
Image Captioning with Deep Bidirectional LSTMs
MM '16: Proceedings of the 24th ACM international conference on Multimedia

This work presents an end-to-end trainable deep bidirectional LSTM (Long-Short Term Memory) model for image captioning. Our model builds on a deep convolutional neural network (CNN) and two separate LSTM networks. It is capable of learning long term ...
Read More
Deep image captioning using an ensemble of CNN and LSTM based deep neural networks
Ethical Computational Intelligence for Cyber Market

The paper is concerned with the problem of Image Caption Generation. The purpose of this paper is to create a deep learning model to generate captions for a given image by decoding the information available in the image. For this purpose, a custom ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
RACS '22: Proceedings of the Conference on Research in Adaptive and Convergent Systems
October 2022
208 pages
ISBN:9781450393980
DOI:10.1145/3538641
Conference Chair:
Peng Li
The University of Aizu, Japan
,
Program Chairs:
Junyoung Heo
Hansung University, Korea
,
Tomas Cerny
Baylor University
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 October 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
LSTM
darknet
end-to-end
image captioning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate393of1,581submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 35
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

End-to-end image captioning based on reduced feature maps of deep learners pre-trained for object detection

RACS '22: Proceedings of the Conference on Research in Adaptive and Convergent Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Comprehensive Survey of Deep Learning for Image Captioning

Image Captioning with Deep Bidirectional LSTMs

Deep image captioning using an ensemble of CNN and LSTM based deep neural networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media