Abstract
Multimodal sentence summarization (MMSS) is a new yet challenging task that aims to generate a concise summary of a long sentence and its corresponding image. Although existing methods have gained promising success in MMSS, they overlook the powerful generation ability of generative pre-trained language models (GPLMs), which have shown to be effective in many text generation tasks. To fill this research gap, we propose to using GPLMs to promote the performance of MMSS. Notably, adopting GPLMs to solve MMSS inevitably faces two challenges: 1) What fusion strategy should we use to inject visual information into GPLMs properly? 2) How to keep the GPLM′s generation ability intact to the utmost extent when the visual feature is injected into the GPLM. To address these two challenges, we propose a vision enhanced generative pre-trained language model for MMSS, dubbed as Vision-GPLM. In Vision-GPLM, we obtain features of visual and textual modalities with two separate encoders and utilize a text decoder to produce a summary. In particular, we utilize multi-head attention to fuse the features extracted from visual and textual modalities to inject the visual feature into the GPLM. Meanwhile, we train Vision-GPLM in two stages: the vision-oriented pre-training stage and fine-tuning stage. In the vision-oriented pre-training stage, we particularly train the visual encoder by the masked language model task while the other components are frozen, aiming to obtain homogeneous representations of text and image. In the fine-tuning stage, we train all the components of Vision-GPLM by the MMSS task. Extensive experiments on a public MMSS dataset verify the superiority of our model over existing baselines.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
A. M. Rush, S. Chopra, J. Weston. A neural attention model for abstractive sentence summarization. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 379–389, 2015. DOI: https://doi.org/10.18653/v1/D15-1044.
S. Chopra, M. Auli, A. M. Rush. Abstractive sentence sum-marization with attentive recurrent neural networks. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, USA, pp. 93–98, 2016. DOI: https://doi.org/10.18653/v1/N16-1012.
H. R. Li, J. N. Zhu, T. S. Liu, J. J. Zhang, C. Q. Zong. Multi-modal sentence summarization with modality attention and image filtering. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 4152–4158, 2018.
H. R. Li, J. N. Zhu, J. J. Zhang, X. D. He, C. Q. Zong. Multimodal sentence summarization via multimodal selective encoding. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, pp. 5655–5667, 2020. DOI: https://doi.org/10.18653/v1/2020.coling-main.496.
M. Lewis, Y. H. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880, 2020. DOI: https://doi.org/10.18653/v1/2020.acl-main.703.
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Q. Zhou, W. Li, P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, vol. 21, no. 1, Article number 140, 2020.
Y. H. H. Tsai, S. J. Bai, P. P. Liang, J. Z. Kolter, L. P. Morency, R. Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6558–6569, 2019. DOI: https://doi.org/10.18653/v1/P19-1656.
T. Z. Yu, W. L. Dai, Z. H. Liu, P. Fung. Vision guided generative pre-trained language models for multimodal abstractive summarization. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, pp. 3995–4007, 2021. DOI: https://doi.org/10.18653/v1/2021.emnlp-main.326.
J. Devlin, M. W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, USA, Minnesota, pp. 4171–4186, 2019. DOI: https://doi.org/10.18653/v1/N19-1423.
J. T. Gu, Z. D. Lu, H. Li, V. O. K. Li. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 1631–1640, 2016. DOI: https://doi.org/10.18653/v1/P16-1154.
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 3111–3119, 2013.
J. Pennington, R. Socher, C. Manning. GloVe: Global vectors for word representation. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1532–1543, 2014. DOI: https://doi.org/10.3115/v1/D14-1162.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000–6010, 2017.
X. Song, J. J. Chen, Z. X. Wu, Y. G. Jiang. Spatial-temporal graphs for cross-modal text2video retrieval. IEEE Transactions on Multimedia, vol. 24, pp. 2914–2923, 2022. DOI: https://doi.org/10.1109/TMM.2021.3090595.
T. Hasan, A. Bhattacharjee, M. S. Islam, K. Mubasshir, Y. F. Li, Y. B. Kang, M. S. Rahman, R. Shahriyar. Xl-Sum: Large-scale multilingual abstractive summarization for 44 languages. In Proceedings of Findings of the Association for Computational Linguistics, pp. 4693–4703, 2021. DOI: https://doi.org/10.18653/v1/2021.findings-acl.413.
A. Nighojkar, J. Licato. Improving paraphrase detection with the adversarial paraphrasing task. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 7106–7116, 2021. DOI: https://doi.org/10.18653/v1/2021.acl-long.552.
X. M. Song, L. Q. Jing, D. T. Lin, Z. Z. Zhao, H. Q. Chen, L. Q. Nie. V2P: Vision-to-prompt based multi- modal product summary generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, pp. 992–1001, 2022. DOI: https://doi.org/10.1145/3477495.3532076.
T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations, Scottsdale, USA, 2013. DOI: doi.org/10.48550/arXiv.1301.3781.
G. Kulkarni, V. Premraj, S. Dhar, S. M. Li, Y. Choi, A. C. Berg, T. L. Berg. Baby talk: Understanding and generating simple image descriptions. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Colorado Springs, USA, pp. 1601–1608, 2011. DOI: https://doi.org/10.1109/CVPR.2011.5995466.
A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, D. Forsyth. Every picture tells a story: Generating sentences from images. In Proceedings of the 11th European Conference on Computer Vision, Springer, Heraklion, Greece, pp. 15–29, 2010. DOI: https://doi.org/10.1007/978-3-642-15561-1_2.
M. Mitchell, J. Dodge, A. Goyal, K. Yamaguchi, K. Stratos, X. F. Han, A. Mensch, A. Berg, T. Berg, H. Daumé â… Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, pp. 747–756, 2012.
Q. Z. You, H. L. Jin, Z. W. Wang, C. Fang, J. B. Luo. Image captioning with semantic attention. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 4651–4659, 2016. DOI: https://doi.org/10.1109/CVPR.2016.503.
T. Yao, Y. W. Pan, Y. H. Li, Z. F. Qiu, T. Mei. Boosting image captioning with attributes. In Proceedings of IEEE International Conference on Computer Vision, Venice, Italy, pp. 4904–4912, 2017. DOI: https://doi.org/10.1109/ICCV.2017.524.
T. Yao, Y. W. Pan, Y. H. Li, T. Mei. Exploring visual relationship for image captioning. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 711–727, 2018. DOI: https://doi.org/10.1007/978-3-030-01264-9_42.
L. Ke, W. J. Pei, R. Y. Li, X. Y. Shen, Y. W. Tai. Reflective decoding network for image captioning. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 8887–8896, 2019. DOI: https://doi.org/10.1109/ICCV.2019.00898.
M. Cornia, M. Stefanini, L. Baraldi, R. Cucchiara. Meshed-memory transformer for image captioning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10575–10584, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.01059.
Y. W. Pan, T. Yao, Y. H. Li, T. Mei. X-linear attention networks for image captioning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 10968–10977, 2020. DOI: https://doi.org/10.1109/CVPR42600.2020.01098.
B. W. Cheng, A. G. Schwing, A. Kirillov. Per-pixel classification is not all you need for semantic segmentation. In Proceedings of the 35th Conference on Neural Information Processing Systems, pp. 17864–17875, 2021.
Z. Liu, Y. T. Lin, Y. Cao, H. Hu, Y. X. Wei, Z. Zhang, S. Lin, B. N. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Montreal, USA, pp. 9992–10002, 2021. DOI: https://doi.org/10.1109/ICCV48922.2021.00986.
J. Cho, J. Lei, H. Tan, M. Bansal. Unifying vision-and-language tasks via text generation. In Proceedings of the 38th International Conference on Machine Learning, pp. 1931–1942, 2021.
K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770–778, 2016. DOI: https://doi.org/10.1109/CVPR.2016.90.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. H. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, pp. 1–21, 2021.
J. Yosinski, J. Clune, Y. Bengio, H. Lipson. How transferable are features in deep neural networks? In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, pp. 3320–3328, 2014.
W. L. Taylor. “Cloze procedure”: A new tool for measuring readability. Journalism & Mass Communication Quarterly, vol. 30, no. 4, pp. 415–433, 1953. DOI: https://doi.org/10.1177/107769905303000401.
C. Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Proceedings of Text Summarization Branches Out, Barcelona, Spain, pp. 74–81, 2004.
J. Clarke, M. Lapata. Global inference for sentence compression an integer linear programming approach. Journal of Artificial Intelligence Research, vol.31, no. 1, pp. 399–429, 2008.
Q. Y. Zhou, N. Yang, F. R. Wei, M. Zhou. Selective encoding for abstractive sentence summarization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 1095–1104, 2017. DOI: https://doi.org/10.18653/v1/P17-1101.
J. Libovický, J. Helcl. Attention strategies for multi-source sequence-to- sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 196–202, 2017. DOI: https://doi.org/10.18653/v1/P17-2031.
I. Calixto, Q. Liu, N. Campbell. Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 1913–1924, 2017. DOI: https://doi.org/10.18653/v1/P17-1175.
A. See, P. J. Liu, C. D. Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp. 1073–1083, 2017. DOI: https://doi.org/10.18653/v1/P17-1099.
K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, USA, pp. 1–14, 2015.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Liqiang Jing received the B. Eng. degree in computer science and technology from School of Computer Science and Technology, Hefei University of Technology, China in 2020. He is now a master student in computer technology at Department of Computer Science and Technology, Shandong University, China.
His research interests include multimodal learning and natural language processing.
Yiren Li received the B. Eng. degree in finance from Hebei University of Economics and Business, China in 2004, and the degree in industry and business administration from Tianjin University, China in 2007. He is currently the deputy general manager of HBIS Group and the chairman of HBIS Digital Technology Co., Ltd., China. Previously, he successively served as the deputy director of Integrated Management Department of HBIS Group, director of Management Innovation Department of HBIS Group, and strategy director of HBIS Group, China. He has published more than 20 papers.
His research interests include intelligent applications in the iron and steel industry.
Junhao Xu is an undergraduate student in data science and big data technology at Department of Computer Science and Technology, Shandong University, China.
His research interests include information retrieval and natural language processing.
Yongcan Yu is an undergraduate student in data science and big data technology at Department of Computer Science and Technology, Shandong University, China.
His research interests include computer vision and recommendation system.
Pei Shen received the B. Eng. degree in computer and application from Hebei University of Science and Technology, China in 2010. He is currently the general manager of HBIS Digital Technology Co., Ltd, China. He is a member of Steel of Standardization Administration of China, vice chairman of the Smart Enterprise Promotion Committee of the China Enterprise Federation, and director of the Intelligent Manufacturing Alliance of the Iron and Steel Industry, China.
His research interests include intelligent applications in the iron and steel industry.
Xuemeng Song received the B. Eng. degree in in electronic information engineering from University of Science and Technology of China, China in 2012, and the Ph. D. degree in computer science from School of Computing, National University of Singapore, Singapore in 2016. She is currently an associate professor of Shandong University, China. She has published several papers in the top venues, such as ACM SIGIR, MM and TOIS. In addition, she has served as reviewers for many top conferences and journals.
Her research interests include the information retrieval and social network analysis.
Rights and permissions
About this article
Cite this article
Jing, L., Li, Y., Xu, J. et al. Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization. Mach. Intell. Res. 20, 289–298 (2023). https://doi.org/10.1007/s11633-022-1372-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11633-022-1372-x