Skip to main content
Log in

Enhancing N-Gram Based Metrics with Semantics for Better Evaluation of Abstractive Text Summarization

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Text summarization is an important task in natural language processing and it has been applied in many applications. Recently, abstractive summarization has attracted many attentions. However, the traditional evaluation metrics that consider little semantic information, are unsuitable for evaluating the quality of deep learning based abstractive summarization models, since these models may generate new words that do not exist in the original text. Moreover, the out-of-vocabulary (OOV) problem that affects the evaluation results, has not been well solved yet. To address these issues, we propose a novel model called ENMS, to enhance existing N-gram based evaluation metrics with semantics. To be specific, we present two types of methods: N-gram based Semantic Matching (NSM for short), and N-gram based Semantic Similarity (NSS for short), to improve several widely-used evaluation metrics including ROUGE (Recall-Oriented Understudy for Gisting Evaluation), BLEU (Bilingual Evaluation Understudy), etc. NSM and NSS work in different ways. The former calculates the matching degree directly, while the latter mainly improves the similarity measurement. Moreover we propose an N-gram representation mechanism to explore the vector representation of N-grams (including skip-grams). It serves as the basis of our ENMS model, in which we exploit some simple but effective integration methods to solve the OOV problem efficiently. Experimental results over the TAC AESOP dataset show that the metrics improved by our methods are well correlated with human judgements and can be used to better evaluate abstractive summarization methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Marujo L, Ribeiro R, Gershman A, De Matos D M, Neto J P, Carbonell J. Event-based summarization using a centrality-as-relevance model. Knowledge and Information Systems, 2017, 50(3): 945-968. https://doi.org/10.1007/s10115-016-0966-4.

    Article  Google Scholar 

  2. Qumsiyeh R, Ng Y K. Enhancing web search by using query-based clusters and multi-document summaries. Knowledge and Information Systems, 2016, 47(2): 355-380. https://doi.org/10.1007/s10115-015-0852-5.

    Article  Google Scholar 

  3. Verberne S, Krahmer E, Wubben S, van den Bosch A. Query-based summarization of discussion threads. Natural Language Engineering, 2020, 26(1): 3-29. https://doi.org/10.1017/S1351324919000123.

    Article  Google Scholar 

  4. Vougiouklis P, Elsahar H, Kaffee L A, Gravier C, Laforest F, Hare J, Simperl E. Neural Wikipedian: Generating textual summaries from knowledge base triples. Journal of Web Semantics, 2018, 52-53: 1-15. https://doi.org/10.1016/j.websem.2018.07.002.

    Article  Google Scholar 

  5. Wan X J, Luo F L, Sun X, Huang S F, Yao J E. Cross-language document summarization via extraction and ranking of multiple summaries. Knowledge and Information Systems, 2019, 58(2): 481-499. https://doi.org/10.1007/s10115-018-1152-7.

    Article  Google Scholar 

  6. Nallapati R, Zhou B W, dos Santos C N, Gulçehre Ç, Xiang B. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proc. the 20th SIGNLL Conference on Computational Natural Language Learning, Aug. 2016, pp.280-290. https://doi.org/10.18653/v1/K16-1028.

  7. Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In Proc. the 27th Int. Conference on Neural Information Processing Systems, Dec. 2014, pp.3104-3112.

  8. Tan J W, Wan X J, Xiao J G. From neural sentence summarization to headline generation: A coarse-to-fine approach. In Proc. the 26th Int. Joint Conference on Artificial Intelligence, Aug. 2017, pp.4109-4115.

  9. Chopra S, Auli M, Rush A M. Abstractive sentence summarization with attentive recurrent neural networks. In Proc. the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2016, pp.93-98. https://doi.org/10.18653/v1/N16-1012.

  10. Rush A M, Chopra S, Weston J. A neural attention model for abstractive sentence summarization. In Proc. the 2015 Conference on Empirical Methods in Natural Language Processing, Sept. 2015, pp.379-389. https://doi.org/10.18653/v1/D15-1044.

  11. Le Y Q, Wang Z J, Quan Z, He J W, Yao B. ACV-tree: A new method for sentence similarity modeling. In Proc. the 27th Int. Joint Conference on Artificial Intelligence, Jul. 2018, pp.4137-4143. https://doi.org/10.24963/ijcai.2018/575.

  12. Lin C Y. ROUGE: A package for automatic evaluation of summaries. In Proc. the Workshop on Text Summarization Branches Out, Jul. 2004, pp.74-81.

  13. Papineni K, Roukos S, Ward T, Zhu W J. BLEU: A method for automatic evaluation of machine translation. In Proc. the 40th Annual Meeting on Association for Computational Linguistics, Jul. 2002, pp.311-318. https://doi.org/10.3115/1073083.1073135.

  14. Dang H T, Owczarzak K. Overview of the TAC 2011 summarization track: Guided task and AESOP task. In Proc. the 2011 Text Analysis Conference, Nov. 2011.

  15. Pastra K, Saggion H. Colouring summaries BLEU. In Proc. the 2003 EACL Workshop on Evaluation Initiatives in Natural Language Processing, Apr. 2003, pp.35-42. https://doi.org/10.3115/1641396.1641402.

  16. Clement R, Sharp D. Ngram and Bayesian classification of documents for topic and authorship. Literary and Linguistic Computing, 2003, 18(4): 423-447. https://doi.org/10.1093/llc/18.4.423.

    Article  Google Scholar 

  17. Tang D Y, Wei F R, Yang N, Zhou M, Liu T, Qin B. Learning sentiment specific word embedding for Twitter sentiment classification. In Proc. the 52nd Annual Meeting of the Association for Computational Linguistics, Jun. 2014, pp.1555-1565. https://doi.org/10.3115/v1/P14-1146.

  18. Farahani M, Gharachorloo M, Manthouri M. Leveraging ParsBERT and pretrained mT5 for Persian abstractive text summarization. In Proc. the 26th Int. Computer Conference, Computer Society of Iran, Mar. 2021. https://doi.org/10.1109/CSICC52343.2021.9420563.

  19. Huang C L, Jiang W J, Wu J, Wang G J. Personalized review recommendation based on users’ aspect sentiment. ACM Transactions on Internet Technology, 2020, 20(4): Article No. 42. https://doi.org/10.1145/3414841.

  20. Calzavara S, Rabitti A, Bugliesi M. Semantics-based analysis of content security policy deployment. ACM Transactions on the Web, 2018, 12(2): Article No. 10. https://doi.org/10.1145/3149408.

  21. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In Proc. the Annual Conference on Neural Information Processing Systems, Dec. 2013, pp.3111-3119.

  22. Pennington J, Socher R, Manning C. GloVe: Global vectors for word representation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, Oct. 2014, pp.1532-1543. https://doi.org/10.3115/v1/D14-1162.

  23. Leung K W T, Jiang D, Lee D L, Ng W. Constructing maintainable semantic relation network from ambiguous concepts in web content. ACM Transactions on Internet Technology, 2016, 16(1): Article No. 6. https://doi.org/10.1145/2814568.

  24. Ng J P, Abrecht V. Better summarization evaluation with word embeddings for rouge. In Proc. the 2015 Conference on Empirical Methods in Natural Language Processing, Sept. 2014, pp.1925-1930. https://doi.org/10.18653/v1/D15-1222.

  25. ShafieiBavani E, Ebrahimi M, Wang R, Chen F. A semantically motivated approach to compute ROUGE scores. arXiv:1710.07441, 2017. https://arxiv.org/abs/1710.07441, Jul. 2022.

  26. Shao L Q, Zhang H, Jia M, Wang J. Efficient and effective single-document summarizations and a word-embedding measurement of quality. In Proc. the 9th International Conference on Knowledge Discovery and Information Retrieval, Nov. 2017, pp.114-122. https://doi.org/10.5220/0006581301140122.

  27. Gambhir M, Gupta V. Recent automatic text summarization techniques: A survey. Artificial Intelligence Review, 2017, 47(1): 1-66. https://doi.org/10.1007/s10462-016-9475-9.

    Article  Google Scholar 

  28. Jiang W J, Chen J, Ding X F, Wu J, He J W, Wang G J. Review summary generation in online systems: Frameworks for supervised and unsupervised scenarios. ACM Transactions on the Web, 2021, 15(3): Article No. 13. https://doi.org/10.1145/3448015.

  29. Lin H, Bilmes J. Multi-document summarization via budgeted maximization of submodular functions. In Proc. the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Jun. 2010, pp.912-920.

  30. Wang L, Raghavan H, Castelli V, Florian R, Cardie C. A sentence compression based framework to query-focused multi-document summarization. arXiv:1606.07548, 2016. https://arxiv.org/abs/1606.07548, Jul. 2022.

  31. Ding X F, Jiang W J, He J W. Generating expert’s review from the crowds’: Integrating a multi-attention mechanism with encoder-decoder framework. In Proc. the 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation, Oct. 2018, pp.954-961. https://doi.org/10.1109/SmartWorld.2018.00170.

  32. Gerani S, Mehdad Y, Carenini G, Ng R T, Nejat B. Abstractive summarization of product reviews using discourse structure. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, Oct. 2014, pp.1602-1613. https://doi.org/10.3115/v1/D14-1168.

  33. Liu P, Saleh M, Pot E, Goodrich B, Sepassi R, Kaiser L, Shazeer N. Generating Wikipedia by summarizing long sequences. In Proc. the 2018 International Conference on Learning Representations, April 30-May 3, 2018.

  34. Tan J W, Wan X J, Xiao J G. Abstractive document summarization with a graph-based attentional neural model. In Proc. the 55th Annual Meeting of the Association for Computational Linguistics, July 30-August 4, 2017, pp.1171-1181. https://doi.org/10.18653/v1/P17-1108.

  35. Chu E, Liu P. MeanSum: A neural model for unsupervised multi-document abstractive summarization. In Proc. the 2019 International Conference on Machine Learning, Jun. 2019, pp.1223-1232.

  36. Cachola I, Lo K, Cohan A, Weld D. TLDR: Extreme summarization of scientific documents. In Proc. the 2020 Conference on Empirical Methods in Natural Language Processing, Nov. 2020, pp.4766-4777. https://doi.org/10.18653/v1/2020.findings-emnlp.428.

  37. Zhang J Q, Zhao Y, Saleh M, Liu P J. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In Proc. the 37th International Conference on Machine Learning, Jul. 2020, pp.11328–11339.

  38. Kouris P, Alexandridis G, Stafylopatis A. Abstractive text summarization: Enhancing sequence-to-sequence models using word sense disambiguation and semantic content generalization. Computational Linguistics, 2021, 47(4): 813-859. https://doi.org/10.1162/coli_a_00417.

    Article  Google Scholar 

  39. Gunel B, Zhu C G, Zeng M, Huang X D. Mind the facts: Knowledge-boosted coherent abstractive text summarization. arXiv:2006.15435, 2020. https://arxiv.org/abs/2006.15435, Jul. 2022.

  40. Jones K S. Automatic language and information processing: Rethinking evaluation. Natural Language Engineering, 2001, 7(1): 29-46. https://doi.org/10.1017/S1351324901002583.

    Article  Google Scholar 

  41. Lin C Y. Looking for a few good metrics: ROUGE and its evaluation. In Proc. the 4th NTCIR Workshop Meeting, June 2004.

  42. Passonneau R J, Nenkova A, Mckeown K, Sigelman S. Applying the pyramid method in DUC 2005. In Proc. the 2005 Workshop of the Document Understanding Conference, Oct. 2005. https://doi.org/10.7916/D8TX3PVD.

  43. Hovy E H, Lin C Y, Zhou L, Fukumoto J. Automated summarization evaluation with basic elements. In Proc. the 5th Int. Conference on Language Resources and Evaluation, May 2006, pp.899-902.

  44. Torres-Moreno J M, Saggion H, Da Cunha I, SanJuan E, Velázquez-Morales P. Summary evaluation with and without references. Polibits, 2010, 42: 13-19. https://doi.org/10.17562/PB-42-2.

    Article  Google Scholar 

  45. Cabrera-Diego L A, Torres-Moreno J M. SummTriver: A new trivergent model to evaluate summaries automatically without human references. Data Knowledge Engineering, 2018, 113: 184-197. https://doi.org/10.1016/j.datak.2017.09.001.

    Article  Google Scholar 

  46. Radev D R, Tam D, Erkan G. Single-document and multi-document summary evaluation using relative utility. Technical Report, University of Michigan, 2007. https://www.eecs.umich.edu/techreports/cse/2007/CSE-TR-5-38-07.pdf, Jul. 2022.

  47. Shafieibavani E, Ebrahimi M, Wong R, Chen F. A graph-theoretic summary evaluation for ROUGE. In Proc. the 2018 Conference on Empirical Methods in Natural Language Processing, October 31-November 4, 2018, pp.899-902. https://doi.org/10.18653/v1/D18-1085.

  48. Cohan A, Goharian N. Revisiting summarization evaluation for scientific articles. In Proc. the 10th International Conference on Language Resources and Evaluation, May 2016, pp.806-813.

  49. Bengio Y, Ducharme R, Vincent P, Janvin C. A neural probabilistic language model. Journal of Machine Learning Research, 2003, 3: 1137-1155.

    MATH  Google Scholar 

  50. Wieting J, Bansal M, Gimpel K, Livescu K. From para-phrase database to compositional paraphrase model and back. Transactions of the Association for Computational Linguistics, 2015, 3: 345-358. https://doi.org/10.1162/tacl_a_00143.

    Article  Google Scholar 

  51. Passonneau R J, Chen E, Guo W, Perin D. Automated pyramid scoring of summaries using distributional semantics. In Proc. the 51st Annual Meeting of the Association for Computational Linguistics, Aug. 2013, pp.143-147.

  52. Zhao Z, Liu T, Li S, Li B, Du X Y. Ngram2vec: Learning improved word representations from Ngram co-occurrence statistics. In Proc. the 2017 Conference on Empirical Methods in Natural Language Processing, Sept. 2017, pp.244-253. https://doi.org/10.18653/v1/D17-1023.

  53. Mitchell J, Lapata M. Vector-based models of semantic composition. In Proc. the 46th Annual Meeting of the Association for Computational Linguistics, Jun. 2008, pp.236-244.

  54. Kumar N, Srinathan K, Varma V. Using unsupervised system with least linguistic features for TACAESOP task. In Proc. the 4th Text Analysis Conference, Nov. 2011.

  55. Passonneau R J, Chen E, Guo W W, Perin D. Automated pyramid scoring of summaries using distributional semantics. In Proc. the 51st Annual Meeting of the Association for Computational Linguistics (ACL), Aug. 2013, pp.143-147.

  56. Xia P, Jiang W, Wu J, Xiao S, Wang G. Exploiting temporal dynamics in product reviews for dynamic sentiment prediction at the aspect level. ACM Transactions on Knowledge Discovery from Data, 2021, 15(4): Article No. 68. https://doi.org/10.1145/3441451.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wen-Jun Jiang.

Supplementary Information

ESM 1

(PDF 395 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, JW., Jiang, WJ., Chen, GB. et al. Enhancing N-Gram Based Metrics with Semantics for Better Evaluation of Abstractive Text Summarization. J. Comput. Sci. Technol. 37, 1118–1133 (2022). https://doi.org/10.1007/s11390-022-2125-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-022-2125-6

Keywords

Navigation