Skip to main content

BiTMulV: Bidirectional-Decoding Based Transformer with Multi-view Visual Representation

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13534))

Included in the following conference series:

  • 2680 Accesses

Abstract

The Transformer-based image captioning models have made significant progress on the generalization performance. However, most methods still have two kinds of limitations in practice: 1) Heavily rely on the single region-based visual feature representation. 2) Not effectively utilize the future semantic information during inference. To solve these issues, we introduce a novel bidirectional-decoding based Transformer with multi-view visual representation (BiTMulV) for image captioning. In the encoding stage, we adopt a modular cross-attention block to fuse both grid features and region features by virtue of multi-view visual feature representation, which realizes full exploitation of image context information and fine-grained information. In the decoding stage, we design the bidirectional decoding structure, which consists of two parallel and consistent forward and backward decoders, to promote the model to effectively combine the history with future semantics for inference. Experimental results on the MSCOCO dataset demonstrate that our proposal significantly outperforms the competitive models, improving by 1.5 points on the CIDEr metric.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 375–383 (2017)

    Google Scholar 

  2. Revaud, J., Almazán, J., Rezende, R.S., Souza, C.R.D.: Learning with average precision: Training image retrieval with a listwise loss. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5107–5116 (2019)

    Google Scholar 

  3. Chen, C., Liu, Y., Kreiss, S., Alahi, A.: Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 6015–6022 (2019)

    Google Scholar 

  4. Fox, E.A., Ingram, W.A.: Introduction to digital libraries. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, pp. 567–568 (2020)

    Google Scholar 

  5. Sen, S., Gupta, K.K., Ekbal, A., Bhattacharyya, P.: Multilingual unsupervised NMT using shared encoder and language-specific decoders. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3083–3089 (2019)

    Google Scholar 

  6. Chen, C., Mu, S., Xiao, V., Ye, Z., Wu, V., Ju, Q.: Improving image captioning with conditional generative adversarial nets. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8142–8150 (2019)

    Google Scholar 

  7. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)

    Google Scholar 

  8. Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)

    Google Scholar 

  9. Yang, Y., Teo, C.L., Daum\(\acute{\,}\)e III, H., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 444–454 (2011)

    Google Scholar 

  10. Kilkarni, G., et al.: Babytalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)

    Article  Google Scholar 

  11. Farhadi, A., et al.: Every picture tells a story: Generating sentences from images. In: European Conference on Computer Vision, pp. 15–29 (2010)

    Google Scholar 

  12. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

    Google Scholar 

  13. Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 684–699 (2018)

    Google Scholar 

  14. Guo, L., Liu, J., Zhu, X., Yao, P., Lu, S., Lu, H.: Normalized and geometry-aware self-attention network for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10327–10336 (2020)

    Google Scholar 

  15. Pan, Y., Yao, T., Li, Y., Mei, T.: X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10971–10980 (2020)

    Google Scholar 

  16. Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4634–4643 (2019)

    Google Scholar 

  17. Yu, J., Li, J., Yu, Z., Huang, Q.: Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 30(12), 4467–4480 (2019)

    Article  Google Scholar 

  18. Liu, L., Utiyama, M., Finch, A., Sumita, E.: Agreement on target-bidirectional neural machine translation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 411–416 (2016)

    Google Scholar 

  19. Zhou, L., Zhang, J., Zong, C.: Synchronous bidirectional neural machine translation. Trans. Assoc. Comput. Linguist. 7, 91–105 (2019)

    Article  Google Scholar 

  20. Zhang, Z., Wu, S., Liu, S., Li, M., Zhou, M., & Xu, T.: Regularizing neural machine translation by target-bidirectional agreement. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 443–450 (2019)

    Google Scholar 

  21. Wang, C., Yang, H., Meinel, C.: Image captioning with deep bidirectional LSTMs and multi-task learning. In: ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14(2s), 1–20 (2018)

    Google Scholar 

  22. Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024 (2017)

    Google Scholar 

  23. Lin, T.Y., et al.: Microsoft coco: Common objects in context. In European Conference on Computer Vision, pp. 740–755 (2014)

    Google Scholar 

  24. Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4894–4902 (2017)

    Google Scholar 

  25. Jiang, W., Ma, L., Jiang, Y. G., Liu, W., Zhang, T.: Recurrent fusion network for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 499–515 (2018)

    Google Scholar 

  26. Qin, Y., Du, J., Zhang, Y., Lu, H.: Look back and predict forward in image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8367–8375 (2019)

    Google Scholar 

  27. Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10685–10694 (2019)

    Google Scholar 

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China 61971421, the open fund for research and development of key technologies of smart mines (H7AC200057) and the Postgraduate Research & Practice Innovation Program of Jiangsu Province (KYCX21_2248).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Bing Liu or Peng Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yu, Q., Wang, X., Wang, D., Chu, X., Liu, B., Liu, P. (2022). BiTMulV: Bidirectional-Decoding Based Transformer with Multi-view Visual Representation. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13534. Springer, Cham. https://doi.org/10.1007/978-3-031-18907-4_57

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-18907-4_57

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-18906-7

  • Online ISBN: 978-3-031-18907-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics