BiTMulV: Bidirectional-Decoding Based Transformer with Multi-view Visual Representation

Yu, Qiankun; Wang, XueKui; Wang, Dong; Chu, Xu; Liu, Bing; Liu, Peng

doi:10.1007/978-3-031-18907-4_57

Qiankun Yu^15,16,
XueKui Wang¹⁷,
Dong Wang¹⁸,
Xu Chu^15,16,
Bing Liu¹⁸ &
…
Peng Liu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13534))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

2680 Accesses

Abstract

The Transformer-based image captioning models have made significant progress on the generalization performance. However, most methods still have two kinds of limitations in practice: 1) Heavily rely on the single region-based visual feature representation. 2) Not effectively utilize the future semantic information during inference. To solve these issues, we introduce a novel bidirectional-decoding based Transformer with multi-view visual representation (BiTMulV) for image captioning. In the encoding stage, we adopt a modular cross-attention block to fuse both grid features and region features by virtue of multi-view visual feature representation, which realizes full exploitation of image context information and fine-grained information. In the decoding stage, we design the bidirectional decoding structure, which consists of two parallel and consistent forward and backward decoders, to promote the model to effectively combine the history with future semantics for inference. Experimental results on the MSCOCO dataset demonstrate that our proposal significantly outperforms the competitive models, improving by 1.5 points on the CIDEr metric.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 375–383 (2017)
Google Scholar
Revaud, J., Almazán, J., Rezende, R.S., Souza, C.R.D.: Learning with average precision: Training image retrieval with a listwise loss. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5107–5116 (2019)
Google Scholar
Chen, C., Liu, Y., Kreiss, S., Alahi, A.: Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 6015–6022 (2019)
Google Scholar
Fox, E.A., Ingram, W.A.: Introduction to digital libraries. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, pp. 567–568 (2020)
Google Scholar
Sen, S., Gupta, K.K., Ekbal, A., Bhattacharyya, P.: Multilingual unsupervised NMT using shared encoder and language-specific decoders. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3083–3089 (2019)
Google Scholar
Chen, C., Mu, S., Xiao, V., Ye, Z., Wu, V., Ju, Q.: Improving image captioning with conditional generative adversarial nets. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8142–8150 (2019)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Google Scholar
Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Yang, Y., Teo, C.L., Daum\(\acute{\,}\)e III, H., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 444–454 (2011)
Google Scholar
Kilkarni, G., et al.: Babytalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)
Article Google Scholar
Farhadi, A., et al.: Every picture tells a story: Generating sentences from images. In: European Conference on Computer Vision, pp. 15–29 (2010)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 684–699 (2018)
Google Scholar
Guo, L., Liu, J., Zhu, X., Yao, P., Lu, S., Lu, H.: Normalized and geometry-aware self-attention network for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10327–10336 (2020)
Google Scholar
Pan, Y., Yao, T., Li, Y., Mei, T.: X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10971–10980 (2020)
Google Scholar
Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4634–4643 (2019)
Google Scholar
Yu, J., Li, J., Yu, Z., Huang, Q.: Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 30(12), 4467–4480 (2019)
Article Google Scholar
Liu, L., Utiyama, M., Finch, A., Sumita, E.: Agreement on target-bidirectional neural machine translation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 411–416 (2016)
Google Scholar
Zhou, L., Zhang, J., Zong, C.: Synchronous bidirectional neural machine translation. Trans. Assoc. Comput. Linguist. 7, 91–105 (2019)
Article Google Scholar
Zhang, Z., Wu, S., Liu, S., Li, M., Zhou, M., & Xu, T.: Regularizing neural machine translation by target-bidirectional agreement. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 443–450 (2019)
Google Scholar
Wang, C., Yang, H., Meinel, C.: Image captioning with deep bidirectional LSTMs and multi-task learning. In: ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 14(2s), 1–20 (2018)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024 (2017)
Google Scholar
Lin, T.Y., et al.: Microsoft coco: Common objects in context. In European Conference on Computer Vision, pp. 740–755 (2014)
Google Scholar
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4894–4902 (2017)
Google Scholar
Jiang, W., Ma, L., Jiang, Y. G., Liu, W., Zhang, T.: Recurrent fusion network for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 499–515 (2018)
Google Scholar
Qin, Y., Du, J., Zhang, Y., Lu, H.: Look back and predict forward in image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8367–8375 (2019)
Google Scholar
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10685–10694 (2019)
Google Scholar

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China 61971421, the open fund for research and development of key technologies of smart mines (H7AC200057) and the Postgraduate Research & Practice Innovation Program of Jiangsu Province (KYCX21_2248).

Author information

Authors and Affiliations

National Joint Engineering Laboratory of Internet Applied Technology of Mines, China University of Mining and Technology, Xuzhou, 221008, Jiangsu, China
Qiankun Yu, Xu Chu & Peng Liu
School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, Jiangsu, China
Qiankun Yu & Xu Chu
Alibaba Group, Hangzhou, 311121, Zhejiang, China
XueKui Wang
School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, Jiangsu, China
Dong Wang & Bing Liu

Authors

Qiankun Yu
View author publications
You can also search for this author in PubMed Google Scholar
XueKui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xu Chu
View author publications
You can also search for this author in PubMed Google Scholar
Bing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Peng Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Bing Liu or Peng Liu .

Editor information

Editors and Affiliations

Southern University of Science and Technology, Shenzhen, China
Shiqi Yu
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhaoxiang Zhang
Hong Kong Baptist University, Hong Kong, China
Pong C. Yuen
Northwestern Polytechnical University, Xi’an, China
Junwei Han
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Hong Kong Baptist University, Hong Kong, China
Yike Guo
Sun Yat-sen University, Guangzhou, China
Jianhuang Lai
Southern University of Science and Technology, Shenzhen, China
Jianguo Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, Q., Wang, X., Wang, D., Chu, X., Liu, B., Liu, P. (2022). BiTMulV: Bidirectional-Decoding Based Transformer with Multi-view Visual Representation. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13534. Springer, Cham. https://doi.org/10.1007/978-3-031-18907-4_57

Download citation

DOI: https://doi.org/10.1007/978-3-031-18907-4_57
Published: 27 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18906-7
Online ISBN: 978-3-031-18907-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BiTMulV: Bidirectional-Decoding Based Transformer with Multi-view Visual Representation