Enhancing N-Gram Based Metrics with Semantics for Better Evaluation of Abstractive Text Summarization

He, Jia-Wei; Jiang, Wen-Jun; Chen, Guo-Bang; Le, Yu-Quan; Ding, Xiao-Fei

doi:10.1007/s11390-022-2125-6

Enhancing N-Gram Based Metrics with Semantics for Better Evaluation of Abstractive Text Summarization

Regular Paper
Published: 30 September 2022

Volume 37, pages 1118–1133, (2022)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Jia-Wei He¹,
Wen-Jun Jiang¹,
Guo-Bang Chen¹,
Yu-Quan Le¹ &
…
Xiao-Fei Ding¹

188 Accesses
2 Citations
Explore all metrics

Abstract

Text summarization is an important task in natural language processing and it has been applied in many applications. Recently, abstractive summarization has attracted many attentions. However, the traditional evaluation metrics that consider little semantic information, are unsuitable for evaluating the quality of deep learning based abstractive summarization models, since these models may generate new words that do not exist in the original text. Moreover, the out-of-vocabulary (OOV) problem that affects the evaluation results, has not been well solved yet. To address these issues, we propose a novel model called ENMS, to enhance existing N-gram based evaluation metrics with semantics. To be specific, we present two types of methods: N-gram based Semantic Matching (NSM for short), and N-gram based Semantic Similarity (NSS for short), to improve several widely-used evaluation metrics including ROUGE (Recall-Oriented Understudy for Gisting Evaluation), BLEU (Bilingual Evaluation Understudy), etc. NSM and NSS work in different ways. The former calculates the matching degree directly, while the latter mainly improves the similarity measurement. Moreover we propose an N-gram representation mechanism to explore the vector representation of N-grams (including skip-grams). It serves as the basis of our ENMS model, in which we exploit some simple but effective integration methods to solve the OOV problem efficiently. Experimental results over the TAC AESOP dataset show that the metrics improved by our methods are well correlated with human judgements and can be used to better evaluate abstractive summarization methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder

EXABSUM: a new text summarization approach for generating extractive and abstractive summaries

Article Open access 24 October 2023

Abstractive Summarization Improved by WordNet-Based Extractive Sentences

References

Marujo L, Ribeiro R, Gershman A, De Matos D M, Neto J P, Carbonell J. Event-based summarization using a centrality-as-relevance model. Knowledge and Information Systems, 2017, 50(3): 945-968. https://doi.org/10.1007/s10115-016-0966-4.
Article Google Scholar
Qumsiyeh R, Ng Y K. Enhancing web search by using query-based clusters and multi-document summaries. Knowledge and Information Systems, 2016, 47(2): 355-380. https://doi.org/10.1007/s10115-015-0852-5.
Article Google Scholar
Verberne S, Krahmer E, Wubben S, van den Bosch A. Query-based summarization of discussion threads. Natural Language Engineering, 2020, 26(1): 3-29. https://doi.org/10.1017/S1351324919000123.
Article Google Scholar
Vougiouklis P, Elsahar H, Kaffee L A, Gravier C, Laforest F, Hare J, Simperl E. Neural Wikipedian: Generating textual summaries from knowledge base triples. Journal of Web Semantics, 2018, 52-53: 1-15. https://doi.org/10.1016/j.websem.2018.07.002.
Article Google Scholar
Wan X J, Luo F L, Sun X, Huang S F, Yao J E. Cross-language document summarization via extraction and ranking of multiple summaries. Knowledge and Information Systems, 2019, 58(2): 481-499. https://doi.org/10.1007/s10115-018-1152-7.
Article Google Scholar
Nallapati R, Zhou B W, dos Santos C N, Gulçehre Ç, Xiang B. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proc. the 20th SIGNLL Conference on Computational Natural Language Learning, Aug. 2016, pp.280-290. https://doi.org/10.18653/v1/K16-1028.
Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In Proc. the 27th Int. Conference on Neural Information Processing Systems, Dec. 2014, pp.3104-3112.
Tan J W, Wan X J, Xiao J G. From neural sentence summarization to headline generation: A coarse-to-fine approach. In Proc. the 26th Int. Joint Conference on Artificial Intelligence, Aug. 2017, pp.4109-4115.
Chopra S, Auli M, Rush A M. Abstractive sentence summarization with attentive recurrent neural networks. In Proc. the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2016, pp.93-98. https://doi.org/10.18653/v1/N16-1012.
Rush A M, Chopra S, Weston J. A neural attention model for abstractive sentence summarization. In Proc. the 2015 Conference on Empirical Methods in Natural Language Processing, Sept. 2015, pp.379-389. https://doi.org/10.18653/v1/D15-1044.
Le Y Q, Wang Z J, Quan Z, He J W, Yao B. ACV-tree: A new method for sentence similarity modeling. In Proc. the 27th Int. Joint Conference on Artificial Intelligence, Jul. 2018, pp.4137-4143. https://doi.org/10.24963/ijcai.2018/575.
Lin C Y. ROUGE: A package for automatic evaluation of summaries. In Proc. the Workshop on Text Summarization Branches Out, Jul. 2004, pp.74-81.
Papineni K, Roukos S, Ward T, Zhu W J. BLEU: A method for automatic evaluation of machine translation. In Proc. the 40th Annual Meeting on Association for Computational Linguistics, Jul. 2002, pp.311-318. https://doi.org/10.3115/1073083.1073135.
Dang H T, Owczarzak K. Overview of the TAC 2011 summarization track: Guided task and AESOP task. In Proc. the 2011 Text Analysis Conference, Nov. 2011.
Pastra K, Saggion H. Colouring summaries BLEU. In Proc. the 2003 EACL Workshop on Evaluation Initiatives in Natural Language Processing, Apr. 2003, pp.35-42. https://doi.org/10.3115/1641396.1641402.
Clement R, Sharp D. Ngram and Bayesian classification of documents for topic and authorship. Literary and Linguistic Computing, 2003, 18(4): 423-447. https://doi.org/10.1093/llc/18.4.423.
Article Google Scholar
Tang D Y, Wei F R, Yang N, Zhou M, Liu T, Qin B. Learning sentiment specific word embedding for Twitter sentiment classification. In Proc. the 52nd Annual Meeting of the Association for Computational Linguistics, Jun. 2014, pp.1555-1565. https://doi.org/10.3115/v1/P14-1146.
Farahani M, Gharachorloo M, Manthouri M. Leveraging ParsBERT and pretrained mT5 for Persian abstractive text summarization. In Proc. the 26th Int. Computer Conference, Computer Society of Iran, Mar. 2021. https://doi.org/10.1109/CSICC52343.2021.9420563.
Huang C L, Jiang W J, Wu J, Wang G J. Personalized review recommendation based on users’ aspect sentiment. ACM Transactions on Internet Technology, 2020, 20(4): Article No. 42. https://doi.org/10.1145/3414841.
Calzavara S, Rabitti A, Bugliesi M. Semantics-based analysis of content security policy deployment. ACM Transactions on the Web, 2018, 12(2): Article No. 10. https://doi.org/10.1145/3149408.
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In Proc. the Annual Conference on Neural Information Processing Systems, Dec. 2013, pp.3111-3119.
Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, Oct. 2014, pp.1532-1543. https://doi.org/10.3115/v1/D14-1162.
Leung K W T, Jiang D, Lee D L, Ng W. Constructing maintainable semantic relation network from ambiguous concepts in web content. ACM Transactions on Internet Technology, 2016, 16(1): Article No. 6. https://doi.org/10.1145/2814568.
Ng J P, Abrecht V. Better summarization evaluation with word embeddings for rouge. In Proc. the 2015 Conference on Empirical Methods in Natural Language Processing, Sept. 2014, pp.1925-1930. https://doi.org/10.18653/v1/D15-1222.
ShafieiBavani E, Ebrahimi M, Wang R, Chen F. A semantically motivated approach to compute ROUGE scores. arXiv:1710.07441, 2017. https://arxiv.org/abs/1710.07441, Jul. 2022.
Shao L Q, Zhang H, Jia M, Wang J. Efficient and effective single-document summarizations and a word-embedding measurement of quality. In Proc. the 9th International Conference on Knowledge Discovery and Information Retrieval, Nov. 2017, pp.114-122. https://doi.org/10.5220/0006581301140122.
Gambhir M, Gupta V. Recent automatic text summarization techniques: A survey. Artificial Intelligence Review, 2017, 47(1): 1-66. https://doi.org/10.1007/s10462-016-9475-9.
Article Google Scholar
Jiang W J, Chen J, Ding X F, Wu J, He J W, Wang G J. Review summary generation in online systems: Frameworks for supervised and unsupervised scenarios. ACM Transactions on the Web, 2021, 15(3): Article No. 13. https://doi.org/10.1145/3448015.
Lin H, Bilmes J. Multi-document summarization via budgeted maximization of submodular functions. In Proc. the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Jun. 2010, pp.912-920.
Wang L, Raghavan H, Castelli V, Florian R, Cardie C. A sentence compression based framework to query-focused multi-document summarization. arXiv:1606.07548, 2016. https://arxiv.org/abs/1606.07548, Jul. 2022.
Ding X F, Jiang W J, He J W. Generating expert’s review from the crowds’: Integrating a multi-attention mechanism with encoder-decoder framework. In Proc. the 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation, Oct. 2018, pp.954-961. https://doi.org/10.1109/SmartWorld.2018.00170.
Gerani S, Mehdad Y, Carenini G, Ng R T, Nejat B. Abstractive summarization of product reviews using discourse structure. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, Oct. 2014, pp.1602-1613. https://doi.org/10.3115/v1/D14-1168.
Liu P, Saleh M, Pot E, Goodrich B, Sepassi R, Kaiser L, Shazeer N. Generating Wikipedia by summarizing long sequences. In Proc. the 2018 International Conference on Learning Representations, April 30-May 3, 2018.
Tan J W, Wan X J, Xiao J G. Abstractive document summarization with a graph-based attentional neural model. In Proc. the 55th Annual Meeting of the Association for Computational Linguistics, July 30-August 4, 2017, pp.1171-1181. https://doi.org/10.18653/v1/P17-1108.
Chu E, Liu P. MeanSum: A neural model for unsupervised multi-document abstractive summarization. In Proc. the 2019 International Conference on Machine Learning, Jun. 2019, pp.1223-1232.
Cachola I, Lo K, Cohan A, Weld D. TLDR: Extreme summarization of scientific documents. In Proc. the 2020 Conference on Empirical Methods in Natural Language Processing, Nov. 2020, pp.4766-4777. https://doi.org/10.18653/v1/2020.findings-emnlp.428.
Zhang J Q, Zhao Y, Saleh M, Liu P J. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In Proc. the 37th International Conference on Machine Learning, Jul. 2020, pp.11328–11339.
Kouris P, Alexandridis G, Stafylopatis A. Abstractive text summarization: Enhancing sequence-to-sequence models using word sense disambiguation and semantic content generalization. Computational Linguistics, 2021, 47(4): 813-859. https://doi.org/10.1162/coli_a_00417.
Article Google Scholar
Gunel B, Zhu C G, Zeng M, Huang X D. Mind the facts: Knowledge-boosted coherent abstractive text summarization. arXiv:2006.15435, 2020. https://arxiv.org/abs/2006.15435, Jul. 2022.
Jones K S. Automatic language and information processing: Rethinking evaluation. Natural Language Engineering, 2001, 7(1): 29-46. https://doi.org/10.1017/S1351324901002583.
Article Google Scholar
Lin C Y. Looking for a few good metrics: ROUGE and its evaluation. In Proc. the 4th NTCIR Workshop Meeting, June 2004.
Passonneau R J, Nenkova A, Mckeown K, Sigelman S. Applying the pyramid method in DUC 2005. In Proc. the 2005 Workshop of the Document Understanding Conference, Oct. 2005. https://doi.org/10.7916/D8TX3PVD.
Hovy E H, Lin C Y, Zhou L, Fukumoto J. Automated summarization evaluation with basic elements. In Proc. the 5th Int. Conference on Language Resources and Evaluation, May 2006, pp.899-902.
Torres-Moreno J M, Saggion H, Da Cunha I, SanJuan E, Velázquez-Morales P. Summary evaluation with and without references. Polibits, 2010, 42: 13-19. https://doi.org/10.17562/PB-42-2.
Article Google Scholar
Cabrera-Diego L A, Torres-Moreno J M. SummTriver: A new trivergent model to evaluate summaries automatically without human references. Data Knowledge Engineering, 2018, 113: 184-197. https://doi.org/10.1016/j.datak.2017.09.001.
Article Google Scholar
Radev D R, Tam D, Erkan G. Single-document and multi-document summary evaluation using relative utility. Technical Report, University of Michigan, 2007. https://www.eecs.umich.edu/techreports/cse/2007/CSE-TR-5-38-07.pdf, Jul. 2022.
Shafieibavani E, Ebrahimi M, Wong R, Chen F. A graph-theoretic summary evaluation for ROUGE. In Proc. the 2018 Conference on Empirical Methods in Natural Language Processing, October 31-November 4, 2018, pp.899-902. https://doi.org/10.18653/v1/D18-1085.
Cohan A, Goharian N. Revisiting summarization evaluation for scientific articles. In Proc. the 10th International Conference on Language Resources and Evaluation, May 2016, pp.806-813.
Bengio Y, Ducharme R, Vincent P, Janvin C. A neural probabilistic language model. Journal of Machine Learning Research, 2003, 3: 1137-1155.
MATH Google Scholar
Wieting J, Bansal M, Gimpel K, Livescu K. From para-phrase database to compositional paraphrase model and back. Transactions of the Association for Computational Linguistics, 2015, 3: 345-358. https://doi.org/10.1162/tacl_a_00143.
Article Google Scholar
Passonneau R J, Chen E, Guo W, Perin D. Automated pyramid scoring of summaries using distributional semantics. In Proc. the 51st Annual Meeting of the Association for Computational Linguistics, Aug. 2013, pp.143-147.
Zhao Z, Liu T, Li S, Li B, Du X Y. Ngram2vec: Learning improved word representations from Ngram co-occurrence statistics. In Proc. the 2017 Conference on Empirical Methods in Natural Language Processing, Sept. 2017, pp.244-253. https://doi.org/10.18653/v1/D17-1023.
Mitchell J, Lapata M. Vector-based models of semantic composition. In Proc. the 46th Annual Meeting of the Association for Computational Linguistics, Jun. 2008, pp.236-244.
Kumar N, Srinathan K, Varma V. Using unsupervised system with least linguistic features for TACAESOP task. In Proc. the 4th Text Analysis Conference, Nov. 2011.
Passonneau R J, Chen E, Guo W W, Perin D. Automated pyramid scoring of summaries using distributional semantics. In Proc. the 51st Annual Meeting of the Association for Computational Linguistics (ACL), Aug. 2013, pp.143-147.
Xia P, Jiang W, Wu J, Xiao S, Wang G. Exploiting temporal dynamics in product reviews for dynamic sentiment prediction at the aspect level. ACM Transactions on Knowledge Discovery from Data, 2021, 15(4): Article No. 68. https://doi.org/10.1145/3441451.

Download references

Author information

Authors and Affiliations

College of Information Science and Electronic Engineering, Hunan University, Changsha, 410082, China
Jia-Wei He, Wen-Jun Jiang, Guo-Bang Chen, Yu-Quan Le & Xiao-Fei Ding

Authors

Jia-Wei He
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Jun Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Guo-Bang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Quan Le
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Fei Ding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wen-Jun Jiang.

Supplementary Information

ESM 1

(PDF 395 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

He, JW., Jiang, WJ., Chen, GB. et al. Enhancing N-Gram Based Metrics with Semantics for Better Evaluation of Abstractive Text Summarization. J. Comput. Sci. Technol. 37, 1118–1133 (2022). https://doi.org/10.1007/s11390-022-2125-6

Download citation

Received: 31 December 2021
Accepted: 30 September 2022
Published: 30 September 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s11390-022-2125-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing N-Gram Based Metrics with Semantics for Better Evaluation of Abstractive Text Summarization

Abstract

Access this article

Similar content being viewed by others

SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder

EXABSUM: a new text summarization approach for generating extractive and abstractive summaries

Abstractive Summarization Improved by WordNet-Based Extractive Sentences

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enhancing N-Gram Based Metrics with Semantics for Better Evaluation of Abstractive Text Summarization

Abstract

Access this article

Similar content being viewed by others

SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder

EXABSUM: a new text summarization approach for generating extractive and abstractive summaries

Abstractive Summarization Improved by WordNet-Based Extractive Sentences

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation