skip to main content
10.1145/3647649.3647700acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicigpConference Proceedingsconference-collections
research-article

Underwater Image Captioning Based on Feature Fusion

Authors Info & Claims
Published:03 May 2024Publication History

ABSTRACT

Image captioning employs artificial intelligence to translate visual content into natural language text descriptions. Underwater image captioning offers specialized interpretation for scenarios such as underwater environmental monitoring, underwater archaeology, and offshore platforms. It proves effective in compressing information for the real-time transmission of extensive underwater images via underwater acoustic communication. In this article, we annotate underwater image caption dataset for this task, and create a baseline using the encoder-decoder neural image caption model. It output complete sentences related to image content. The description of underwater images mainly focuses on the underwater scene and objects. The object detection model based on the Faster RCNN is applied to extract the full-image features and regional features corresponding to the target in the image. For the caption model, we enhanced the input features of the language generator by combining global information, regional details, contextual cues, and pre-ordered text information through feature fusion. It enables the generator to output precise semantic expressions related to salient objects. The method was applied to the annotated underwater image caption dataset, resulting in more accurate descriptions of underwater targets compared to sentences generated by a basic neural network model. The evaluation metrics reflected higher scores, affirming the effectiveness of our approach.

References

  1. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6077–6086.Google ScholarGoogle Scholar
  2. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.Google ScholarGoogle Scholar
  3. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google ScholarGoogle Scholar
  4. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11. Springer, 15–29.Google ScholarGoogle ScholarCross RefCross Ref
  5. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137.Google ScholarGoogle ScholarCross RefCross Ref
  6. Chongyi Li, Chunle Guo, Wenqi Ren, Runmin Cong, Junhui Hou, Sam Kwong, and Dacheng Tao. 2019. An underwater image enhancement benchmark dataset and beyond. IEEE Transactions on Image Processing 29 (2019), 4376–4389.Google ScholarGoogle ScholarCross RefCross Ref
  7. Siming Li, Girish Kulkarni, Tamara Berg, Alexander Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the fifteenth conference on computational natural language learning. 220–228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ying Liu, Luyao Geng, Weidong Zhang, Yanchao Gong, and Zhijie Xu. 2021. Survey of video based small target detection. Journal of Image and Graphics 9, 4 (2021), 122–134.Google ScholarGoogle ScholarCross RefCross Ref
  9. Edisanter Lo. 2019. Target detection algorithms in hyperspectral imaging based on discriminant analysis. Journal of Image and Graphics 7, 4 (2019), 140–144.Google ScholarGoogle ScholarCross RefCross Ref
  10. Vicente Ordonez, Girish Kulkarni, and Tamara Berg. 2011. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems 24 (2011).Google ScholarGoogle Scholar
  11. Jia-Yu Pan, Hyung-Jeong Yang, Pinar Duygulu, and Christos Faloutsos. 2004. Automatic image captioning. In 2004 IEEE International Conference on Multimedia and Expo. 1987–1990.Google ScholarGoogle Scholar
  12. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.Google ScholarGoogle Scholar
  13. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).Google ScholarGoogle Scholar
  14. Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128 (2014).Google ScholarGoogle Scholar
  15. Florian Spiess, Lucas Reinhart, Norbert Strobel, Dennis Kaiser, Samuel Kounev, and Tobias Kaupp. 2021. People detection with depth silhouettes and convolutional neural networks on a mobile robot. Journal of Image and Graphics 9, 4 (2021), 135–139.Google ScholarGoogle ScholarCross RefCross Ref
  16. Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. 2022. From Show to Tell: A Survey on Deep Learning-based Image Captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2022), 539–559.Google ScholarGoogle ScholarCross RefCross Ref
  17. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566–4575.Google ScholarGoogle ScholarCross RefCross Ref
  18. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2017. Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2017), 652–663. https://doi.org/10.1109/TPAMI.2016.2587640Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. PMLR, 2048–2057.Google ScholarGoogle Scholar
  20. Sanqiang Zhao, Piyush Sharma, Tomer Levinboim, and Radu Soricut. 2019. Informative image captioning with external sources of information. arXiv preprint arXiv:1906.08876 (2019).Google ScholarGoogle Scholar

Index Terms

  1. Underwater Image Captioning Based on Feature Fusion

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICIGP '24: Proceedings of the 2024 7th International Conference on Image and Graphics Processing
      January 2024
      480 pages
      ISBN:9798400716720
      DOI:10.1145/3647649

      Copyright © 2024 ACM

      Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 May 2024

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)4

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format