skip to main content
10.1145/3538641.3561491acmconferencesArticle/Chapter ViewAbstractPublication PagesracsConference Proceedingsconference-collections
research-article

End-to-end image captioning based on reduced feature maps of deep learners pre-trained for object detection

Authors Info & Claims
Published:20 October 2022Publication History

ABSTRACT

Image caption generation is a task to generate a descriptive sentence automatically for a given image which combines image process and natural language process together. Most existing approaches divided the whole process into two parts, encoding and decoding. Our research proposed an image caption generation method based on the most popular architecture with the YOLO as encoder and the LSTM as decoder. The encoder is used to encode images into feature vectors while the decoder is used to decode the feature maps for natural language sentences generation to describe the images. The proposed method is compared with other existing methods.

References

  1. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597--1607.Google ScholarGoogle Scholar
  2. Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 2020. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020).Google ScholarGoogle Scholar
  3. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  4. MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and Hamid Laga. 2019. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR) 51, 6 (2019), 1--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128--3137.Google ScholarGoogle ScholarCross RefCross Ref
  6. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).Google ScholarGoogle Scholar
  7. Xuelong Li, Aihong Yuan, and Xiaoqiang Lu. 2019. Vision-to-language tasks based on attributes and attention mechanism. IEEE transactions on cybernetics 51, 2 (2019), 913--926.Google ScholarGoogle ScholarCross RefCross Ref
  8. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.Google ScholarGoogle ScholarCross RefCross Ref
  9. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).Google ScholarGoogle Scholar
  10. Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-level collaborative transformer for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2286--2293.Google ScholarGoogle ScholarCross RefCross Ref
  11. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 (2014).Google ScholarGoogle Scholar
  12. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).Google ScholarGoogle Scholar
  14. Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using amazon's mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon's Mechanical Turk. 139--147.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779--788.Google ScholarGoogle ScholarCross RefCross Ref
  16. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  17. Raimonda Staniūtė and Dmitrij Šešok. 2019. A systematic literature review on image captioning. Applied Sciences 9, 10 (2019), 2024.Google ScholarGoogle ScholarCross RefCross Ref
  18. Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019).Google ScholarGoogle Scholar
  19. Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. 2020. Ernie 2.0: A continual pre-training framework for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8968--8975.Google ScholarGoogle ScholarCross RefCross Ref
  20. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  21. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  22. Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Van Den Hengel. 2016. What value do explicit high level concepts have in vision to language problems?. In Proceedings of the IEEE conference on computer vision and pattern recognition. 203--212.Google ScholarGoogle ScholarCross RefCross Ref
  23. Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton Van Den Hengel. 2017. Image captioning and visual question answering based on attributes and external knowledge. IEEE transactions on pattern analysis and machine intelligence 40, 6 (2017), 1367--1381.Google ScholarGoogle Scholar
  24. Miao Xin, Hong Zhang, Ding Yuan, and Mingui Sun. 2017. Learning discriminative action and context representations for action recognition in still images. In 2017 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 757--762.Google ScholarGoogle ScholarCross RefCross Ref
  25. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).Google ScholarGoogle Scholar
  26. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014), 67--78.Google ScholarGoogle ScholarCross RefCross Ref
  27. Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818--833.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. End-to-end image captioning based on reduced feature maps of deep learners pre-trained for object detection

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            RACS '22: Proceedings of the Conference on Research in Adaptive and Convergent Systems
            October 2022
            208 pages
            ISBN:9781450393980
            DOI:10.1145/3538641

            Copyright © 2022 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 20 October 2022

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate393of1,581submissions,25%
          • Article Metrics

            • Downloads (Last 12 months)23
            • Downloads (Last 6 weeks)0

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader