ABSTRACT
ROUGE has long been a popular metric for evaluating text summarization tasks as it eliminates time-consuming and costly human evaluations. However, ROUGE is not a fair evaluation metric for extractive summarization task as it is entirely based on lexical overlap. Additionally, ROUGE ignores the quality of the ranker for extractive summarization which performs the actual sentence/phrase extraction job. The main focus of the thesis is to design a nCG (normalized cumulative gain)-based evaluation metric for extractive summarization that is both rank-aware and semantic-aware (called Sem-nCG). One fundamental contribution of the work is that it demonstrates how we can generate more reliable semantic-aware ground truths for evaluating extractive summarization tasks without any additional human intervention. To the best of our knowledge, this work is the first of its kind. Preliminary experimental results demonstrate that the new Sem-nCG metric is indeed semantic-aware and also exhibits higher correlation with human judgement for single document summarization when single reference is considered.
- Mousumi Akter, Naman Bansal, and Shubhra Kanti Karmaker Santu. 2022. Revisiting Automatic Evaluation of Extractive Summarization Task: Can We Do Better than ROUGE?. In Findings of the ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 1547--1560.Google ScholarCross Ref
- Florian Bö hm, Yang Gao, Christian M. Meyer, Ori Shapira, Ido Dagan, and Iryna Gurevych. 2019. Better Rewards Yield Better Summaries: Learning to Summarise Without References. In Proceedings of the EMNLP-IJCNLP 2019. Association for Computational Linguistics, Hong Kong, China, 3108--3118.Google Scholar
- Elizabeth Clark, Asli Celikyilmaz, and Noah A. Smith. 2019. Sentence Mover's Similarity: Automatic Evaluation for Multi-Sentence Texts. In Proceedings of the ACL 2019. Association for Computational Linguistics, Florence, Italy, 2748--2760.Google Scholar
- Alexander R. Fabbri, Wojciech Kryscinski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir R. Radev. 2021. SummEval: Re-evaluating Summarization Evaluation. Trans. Assoc. Comput. Linguistics, Vol. 9 (2021), 391--409.Google ScholarCross Ref
- Yang Gao, Wei Zhao, and Steffen Eger. 2020. SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization. In Proceedings of ACL 2020. Association for Computational Linguistics, Online, 1347--1354.Google ScholarCross Ref
- Yvette Graham. 2015. Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE. In Proceedings of the EMNLP 2015. The Association for Computational Linguistics, Lisbon, Portugal, 128--137.Google ScholarCross Ref
- Hardy, Shashi Narayan, and Andreas Vlachos. 2019. HighRES: Highlight-based Reference-less Evaluation of Summarization. In Proceedings of the ACL 2019. Association for Computational Linguistics, Florence, Italy, 3381--3392.Google Scholar
- Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. ACL, Barcelona, Spain, 74--81.Google Scholar
- Jun-Ping Ng and Viktoria Abrecht. 2015. Better Summarization Evaluation with Word Embeddings for ROUGE. In Proceedings of the EMNLP 2015. The Association for Computational Linguistics, Lisbon, Portugal, 1925--1930.Google ScholarCross Ref
- Dragomir R. Radev and Daniel Tam. 2003. Summarization evaluation using relative utility. In CIKM 2003. ACM, New Orleans, Louisiana, USA, 508--511.Google Scholar
- Hanlu Wu, Tengfei Ma, Lingfei Wu, Tariro Manyumwa, and Shouling Ji. 2020. Unsupervised Reference-Free Summary Quality Evaluation via Contrastive Learning. In Proceedings of the EMNLP 2020. Association for Computational Linguistics, Online, 3612--3621.Google ScholarCross Ref
- An Yang, Kai Liu, Jing Liu, Yajuan Lyu, and Sujian Li. 2018. Adaptations of ROUGE and BLEU to Better Evaluate Machine Reading Comprehension Task. In Proceedings of the Workshop on Machine Reading for Question Answering@ACL 2018. Association for Computational Linguistics, Melbourne, Australia, 98--104.Google ScholarCross Ref
- Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. In ICLR 2020. OpenReview.net, Addis Ababa, Ethiopia.Google Scholar
- Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. 2019. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. In Proceedings of the EMNLP-IJCNLP 2019. Association for Computational Linguistics, Hong Kong, China, 563--578.Google ScholarCross Ref
Index Terms
- Rank-Aware Gain-Based Evaluation of Extractive Summarization
Recommendations
Extractive text summarization using clustering-based topic modeling
AbstractText summarization is the process of converting the input document into a short form, provided that it preserves the overall meaning associated with it. Primarily, text summarization is achieved in two ways, i.e., abstractive and extractive. ...
Sentence Relations for Extractive Summarization with Deep Neural Networks
Sentence regression is a type of extractive summarization that achieves state-of-the-art performance and is commonly used in practical systems. The most challenging task within the sentence regression framework is to identify discriminative features to ...
Unsupervised Extractive Text Summarization with Distance-Augmented Sentence Graphs
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalSupervised summarization has made significant improvements in recent years by leveraging cutting-edge deep learning technologies. However, the true success of supervised methods relies on the availability of large quantity of human-generated summaries of ...
Comments