Skip to main content

Advertisement

Log in

Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization

  • Research Article
  • Published:
Machine Intelligence Research Aims and scope Submit manuscript

Abstract

Multimodal sentence summarization (MMSS) is a new yet challenging task that aims to generate a concise summary of a long sentence and its corresponding image. Although existing methods have gained promising success in MMSS, they overlook the powerful generation ability of generative pre-trained language models (GPLMs), which have shown to be effective in many text generation tasks. To fill this research gap, we propose to using GPLMs to promote the performance of MMSS. Notably, adopting GPLMs to solve MMSS inevitably faces two challenges: 1) What fusion strategy should we use to inject visual information into GPLMs properly? 2) How to keep the GPLM′s generation ability intact to the utmost extent when the visual feature is injected into the GPLM. To address these two challenges, we propose a vision enhanced generative pre-trained language model for MMSS, dubbed as Vision-GPLM. In Vision-GPLM, we obtain features of visual and textual modalities with two separate encoders and utilize a text decoder to produce a summary. In particular, we utilize multi-head attention to fuse the features extracted from visual and textual modalities to inject the visual feature into the GPLM. Meanwhile, we train Vision-GPLM in two stages: the vision-oriented pre-training stage and fine-tuning stage. In the vision-oriented pre-training stage, we particularly train the visual encoder by the masked language model task while the other components are frozen, aiming to obtain homogeneous representations of text and image. In the fine-tuning stage, we train all the components of Vision-GPLM by the MMSS task. Extensive experiments on a public MMSS dataset verify the superiority of our model over existing baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. A. M. Rush, S. Chopra, J. Weston. A neural attention model for abstractive sentence summarization. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 379–389, 2015. DOI: https://doi.org/10.18653/v1/D15-1044.

  2. S. Chopra, M. Auli, A. M. Rush. Abstractive sentence sum-marization with attentive recurrent neural networks. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, USA, pp. 93–98, 2016. DOI: https://doi.org/10.18653/v1/N16-1012.

  3. H. R. Li, J. N. Zhu, T. S. Liu, J. J. Zhang, C. Q. Zong. Multi-modal sentence summarization with modality attention and image filtering. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 4152–4158, 2018.

  4. H. R. Li, J. N. Zhu, J. J. Zhang, X. D. He, C. Q. Zong. Multimodal sentence summarization via multimodal selective encoding. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, pp. 5655–5667, 2020. DOI: https://doi.org/10.18653/v1/2020.coling-main.496.

  5. M. Lewis, Y. H. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880, 2020. DOI: https://doi.org/10.18653/v1/2020.acl-main.703.

  6. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Q. Zhou, W. Li, P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, vol. 21, no. 1, Article number 140, 2020.

  7. Y. H. H. Tsai, S. J. Bai, P. P. Liang, J. Z. Kolter, L. P. Morency, R. Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6558–6569, 2019. DOI: https://doi.org/10.18653/v1/P19-1656.

  8. T. Z. Yu, W. L. Dai, Z. H. Liu, P. Fung. Vision guided generative pre-trained language models for multimodal abstractive summarization. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, pp. 3995–4007, 2021. DOI: https://doi.org/10.18653/v1/2021.emnlp-main.326.

  9. J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, USA, Minnesota, pp. 4171–4186, 2019. DOI: https://doi.org/10.18653/v1/N19-1423.

  10. J. T. Gu, Z. D. Lu, H. Li, V. O. K. Li. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 1631–1640, 2016. DOI: https://doi.org/10.18653/v1/P16-1154.

  11. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 3111–3119, 2013.

  12. J. Pennington, R. Socher, C. Manning. GloVe: Global vectors for word representation. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1532–1543, 2014. DOI: https://doi.org/10.3115/v1/D14-1162.

  13. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000–6010, 2017.

  14. X. Song, J. J. Chen, Z. X. Wu, Y. G. Jiang. Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia, vol. 24, pp. 2914–2923, 2022. DOI: https://doi.org/10.1109/TMM.2021.3090595.

    Article  Google Scholar 

  15. T. Hasan, A. Bhattacharjee, M. S. Islam, K. Mubasshir, Y. F. Li, Y. B. Kang, M. S. Rahman, R. Shahriyar. Xl-Sum: Large-scale multilingual abstractive summarization for 44 languages. In Proceedings of Findings of the Association for Computational Linguistics, pp. 4693–4703, 2021. DOI: https://doi.org/10.18653/v1/2021.findings-acl.413.

  16. A. Nighojkar, J. Licato. Improving paraphrase detection with the adversarial paraphrasing task. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 7106–7116, 2021. DOI: https://doi.org/10.18653/v1/2021.acl-long.552.

  17. X. M. Song, L. Q. Jing, D. T. Lin, Z. Z. Zhao, H. Q. Chen, L. Q. Nie. V2P: Vision-to-prompt based multi- modal product summary generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, pp. 992–1001, 2022. DOI: https://doi.org/10.1145/3477495.3532076.

  18. T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations, Scottsdale, USA, 2013. DOI: doi.org/10.48550/arXiv.1301.3781.

  19. G. Kulkarni, V. Premraj, S. Dhar, S. M. Li, Y. Choi, A. C. Berg, T. L. Berg. Baby talk: Understanding and generating simple image descriptions. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Colorado Springs, USA, pp. 1601–1608, 2011. DOI: https://doi.org/10.1109/CVPR.2011.5995466.

    Google Scholar 

  20. A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, D. Forsyth. Every picture tells a story: Generating sentences from images. In Proceedings of the 11th European Conference on Computer Vision, Springer, Heraklion, Greece, pp. 15–29, 2010. DOI: https://doi.org/10.1007/978-3-642-15561-1_2.

    Google Scholar 

  21. M. Mitchell, J. Dodge, A. Goyal, K. Yamaguchi, K. Stratos, X. F. Han, A. Mensch, A. Berg, T. Berg, H. Daumé â… Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, pp. 747–756, 2012.

  22. Q. Z. You, H. L. Jin, Z. W. Wang, C. Fang, J. B. Luo. Image captioning with semantic attention. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 4651–4659, 2016. DOI: https://doi.org/10.1109/CVPR.2016.503.

  23. T. Yao, Y. W. Pan, Y. H. Li, Z. F. Qiu, T. Mei. Boosting image captioning with attributes. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 4904–4912, 2017. DOI: https://doi.org/10.1109/ICCV.2017.524.

  24. T. Yao, Y. W. Pan, Y. H. Li, T. Mei. Exploring visual relationship for image captioning. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 711–727, 2018. DOI: https://doi.org/10.1007/978-3-030-01264-9_42.

    Google Scholar 

  25. L. Ke, W. J. Pei, R. Y. Li, X. Y. Shen, Y. W. Tai. Reflective decoding network for image captioning. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 8887–8896, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00898.

    Google Scholar 

  26. M. Cornia, M. Stefanini, L. Baraldi, R. Cucchiara. Meshed-memory transformer for image captioning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10575–10584, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.01059.

    Google Scholar 

  27. Y. W. Pan, T. Yao, Y. H. Li, T. Mei. X-linear attention networks for image captioning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10968–10977, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.01098.

    Google Scholar 

  28. B. W. Cheng, A. G. Schwing, A. Kirillov. Per-pixel classification is not all you need for semantic segmentation. In Proceedings of the 35th Conference on Neural Information Processing Systems, pp. 17864–17875, 2021.

  29. Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. Lin, B. N. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, USA, pp. 9992–10002, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00986.

    Google Scholar 

  30. J. Cho, J. Lei, H. Tan, M. Bansal. Unifying vision-and-language tasks via text generation. In Proceedings of the 38th International Conference on Machine Learning, pp. 1931–1942, 2021.

  31. K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770–778, 2016. DOI: https://doi.org/10.1109/CVPR.2016.90.

  32. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, pp. 1–21, 2021.

  33. J. Yosinski, J. Clune, Y. Bengio, H. Lipson. How transferable are features in deep neural networks? In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 3320–3328, 2014.

  34. W. L. Taylor. “Cloze procedure”: A new tool for measuring readability. Journalism & Mass Communication Quarterly, vol. 30, no. 4, pp. 415–433, 1953. DOI: https://doi.org/10.1177/107769905303000401.

    Google Scholar 

  35. C. Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Proceedings of Text Summarization Branches Out, Barcelona, Spain, pp. 74–81, 2004.

  36. J. Clarke, M. Lapata. Global inference for sentence compression an integer linear programming approach. Journal of Artificial Intelligence Research, vol.31, no. 1, pp. 399–429, 2008.

    Article  Google Scholar 

  37. Q. Y. Zhou, N. Yang, F. R. Wei, M. Zhou. Selective encoding for abstractive sentence summarization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 1095–1104, 2017. DOI: https://doi.org/10.18653/v1/P17-1101.

  38. J. Libovický, J. Helcl. Attention strategies for multi-source sequence-to- sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 196–202, 2017. DOI: https://doi.org/10.18653/v1/P17-2031.

  39. I. Calixto, Q. Liu, N. Campbell. Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 1913–1924, 2017. DOI: https://doi.org/10.18653/v1/P17-1175.

  40. A. See, P. J. Liu, C. D. Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 1073–1083, 2017. DOI: https://doi.org/10.18653/v1/P17-1099.

  41. K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, pp. 1–14, 2015.

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yiren Li or Xuemeng Song.

Additional information

Liqiang Jing received the B. Eng. degree in computer science and technology from School of Computer Science and Technology, Hefei University of Technology, China in 2020. He is now a master student in computer technology at Department of Computer Science and Technology, Shandong University, China.

His research interests include multimodal learning and natural language processing.

Yiren Li received the B. Eng. degree in finance from Hebei University of Economics and Business, China in 2004, and the degree in industry and business administration from Tianjin University, China in 2007. He is currently the deputy general manager of HBIS Group and the chairman of HBIS Digital Technology Co., Ltd., China. Previously, he successively served as the deputy director of Integrated Management Department of HBIS Group, director of Management Innovation Department of HBIS Group, and strategy director of HBIS Group, China. He has published more than 20 papers.

His research interests include intelligent applications in the iron and steel industry.

Junhao Xu is an undergraduate student in data science and big data technology at Department of Computer Science and Technology, Shandong University, China.

His research interests include information retrieval and natural language processing.

Yongcan Yu is an undergraduate student in data science and big data technology at Department of Computer Science and Technology, Shandong University, China.

His research interests include computer vision and recommendation system.

Pei Shen received the B. Eng. degree in computer and application from Hebei University of Science and Technology, China in 2010. He is currently the general manager of HBIS Digital Technology Co., Ltd, China. He is a member of Steel of Standardization Administration of China, vice chairman of the Smart Enterprise Promotion Committee of the China Enterprise Federation, and director of the Intelligent Manufacturing Alliance of the Iron and Steel Industry, China.

His research interests include intelligent applications in the iron and steel industry.

Xuemeng Song received the B. Eng. degree in in electronic information engineering from University of Science and Technology of China, China in 2012, and the Ph. D. degree in computer science from School of Computing, National University of Singapore, Singapore in 2016. She is currently an associate professor of Shandong University, China. She has published several papers in the top venues, such as ACM SIGIR, MM and TOIS. In addition, she has served as reviewers for many top conferences and journals.

Her research interests include the information retrieval and social network analysis.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jing, L., Li, Y., Xu, J. et al. Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization. Mach. Intell. Res. 20, 289–298 (2023). https://doi.org/10.1007/s11633-022-1372-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11633-022-1372-x

Keywords