Abstract
Scientific articles tend to follow a standardised discourse that enables a reader to quickly identify and extract useful or important information. We hypothesise that such structural conventions are strongly influenced by the scientific domain (e.g., Computer Science, Chemistry, etc.) and explore this through a novel extractive algorithm that utilises domain-specific discourse information for the task of abstract generation. In addition to being both simple and lightweight, the proposed algorithm constructs summaries in a structured and interpretable manner. In spite of these factors, we show that our approach outperforms strong baselines on the arXiv scientific summarisation dataset in both automatic and human evaluations, confirming that a scientific article’s domain strongly influences its discourse structure and can be leveraged to effectively improve its summarisation. Our code can be found at: https://github.com/TGoldsack1/DodoRank.
T. Goldsack and Z. Zhang—Equal contribution
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Note that we also experimented with both rounding up and rounding to the nearest integer value for Eq. (1), but found that rounding down gave the best performance.
- 3.
The domain names retrieved are equal to highest-level categories as defined in the arXiv category taxonomy: https://arxiv.org/category_taxonomy.
- 4.
Increasing or decreasing K (which directly influences the number of sentences in the summaries produced by DodoRank) invariably led to a worse average performance, as measured by the metrics described in this Section.
- 5.
All ROUGE calculations are performed using the rouge-score Python package.
- 6.
Judges are native English speakers holding a bachelor’s degree in scientific disciplines.
References
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1371. https://aclanthology.org/D19-1371
Cheng, Y., et al.: Guiding the growth: difficulty-controllable question generation through step-by-step rewriting. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), pp. 5968–5978. Online (2021)
Cohan, A., Beltagy, I., King, D., Dalvi, B., Weld, D.: Pretrained language models for sequential sentence classification. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3693–3699. Association for Computational Linguistics, Hong Kong, China (2019)
Cohan, A., et al.: A discourse-aware attention model for abstractive summarization of long documents. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 615–621. Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-2097. https://aclanthology.org/N18-2097
Collins, E., Augenstein, I., Riedel, S.: A supervised approach to extractive summarisation of scientific papers. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 195–205. Association for Computational Linguistics, Vancouver, Canada (2017)
Contractor, D., Guo, Y., Korhonen, A.: Using argumentative zones for extractive summarization of scientific articles. In: Proceedings of COLING 2012, pp. 663–678. The COLING 2012 Organizing Committee, Mumbai, India (2012). https://aclanthology.org/C12-1041
Cui, P., Hu, L.: Sliding selector network with dynamic memory for extractive summarization of long documents. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5881–5891 (2021)
Cui, P., Hu, L., Liu, Y.: Enhancing extractive text summarization with topic-aware graph neural networks. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 5360–5371 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Dong, Y., Mircea, A., Cheung, J.C.K.: Discourse-Aware unsupervised summarization for long scientific documents. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 1089–1102. Association for Computational Linguistics, Online (2021)
El-Kassas, W., Salama, C., Rafea, A., Mohamed, H.: Automatic text summarization: a comprehensive survey. Expert Syst. Appl. 165, 113679 (2020). https://doi.org/10.1016/j.eswa.2020.113679
Erkan, G., Radev, D.R.: LexRank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)
Ermakova, L., Bordignon, F., Turenne, N., Noel, M.: Is the abstract a mere teaser? Evaluating generosity of article abstracts in the environmental sciences. Front. Res. Metrics Anal. 3, 16 (2018). https://doi.org/10.3389/frma.2018.00016. https://www.frontiersin.org/articles/10.3389/frma.2018.00016
Goldsack, T., Zhang, Z., Lin, C., Scarton, C.: Making science simple: corpora for the lay summarisation of scientific literature. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 10589–10604. Association for Computational Linguistics, Abu Dhabi (2022)
Graetz, N.: Teaching EFL students to extract structural information from abstracts. International Symposium on Language for Special Purposes (1982)
Guo, Y., Korhonen, A., Liakata, M., Silins, I., Sun, L., Stenius, U.: Identifying the information structure of scientific abstracts: an investigation of three different schemes. In: Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, pp. 99–107. Association for Computational Linguistics, Uppsala, Sweden (2010)
Guo, Y., Silins, I., Stenius, U., Korhonen, A.: Active learning-based information structure analysis of full scientific articles and two applications for biomedical literature review. Bioinformatics 29(11), 1440–1447 (2013)
Johnson, F.: Automatic abstracting research. Libr. Rev. 44(8), 28–36 (1995). https://www.proquest.com/scholarly-journals/automatic-abstracting-research/docview/218330298/se-2?accountid=13828
Ju, J., Liu, M., Koh, H.Y., Jin, Y., Du, L., Pan, S.: Leveraging information bottleneck for scientific document summarization. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 4091–4098. Association for Computational Linguistics, Punta Cana, Dominican Republic (2021). https://doi.org/10.18653/v1/2021.findings-emnlp.345. https://aclanthology.org/2021.findings-emnlp.345
Liakata, M.: Zones of conceptualisation in scientific papers: a window to negative and speculative statements. In: Proceedings of the Workshop on Negation and Speculation in Natural Language Processing, pp. 1–4. University of Antwerp, Uppsala, Sweden (2010)
Liakata, M., Dobnik, S., Saha, S., Batchelor, C., Rebholz-Schuhmann, D.: A discourse-driven content model for summarising scientific articles evaluated in a complex question answering task. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 747–757. Association for Computational Linguistics, Seattle, Washington, USA (2013)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004). https://aclanthology.org/W04-1013
Maynez, J., Narayan, S., Bohnet, B., McDonald, R.: On faithfulness and factuality in abstractive summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 1906–1919. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main.173. https://aclanthology.org/2020.acl-main.173
Peng, K., Yin, C., Rong, W., Lin, C., Zhou, D., Xiong, Z.: Named entity aware transfer learning for biomedical factoid question answering. IEEE/ACM Trans. Comput. Biol. Bioinform. 19(4), 2365–2376 (2021)
Steinberger, J., Jezek, K.: Using latent semantic analysis in text summarization and summary evaluation. In: Proceedings of the 7th International Conference ISIM (2004)
Swales, J.: Genre analysis: Eenglish in academic and research settings. Cambridge University Press (1990)
Teufel, S.: Argumentative zoning: information extraction from scientific text, Ph. D. thesis, University of Edinburgh (1999)
Teufel, S., Moens, M.: Articles summarizing scientific articles: experiments with relevance and rhetorical status. Comput. Linguist. 28(4), 409–445 (2002)
Teufel, S., Siddharthan, A., Batchelor, C.: Towards discipline-independent argumentative zoning: evidence from chemistry and computational linguistics. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 3, pp. 1493–1502. EMNLP 2009, Association for Computational Linguistics, USA (2009)
Xiao, W., Carenini, G.: Extractive summarization of long documents by combining global and local context. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3011–3021. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1298. https://aclanthology.org/D19-1298
Zheng, H., Lapata, M.: Sentence centrality revisited for unsupervised summarization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6236–6247 (2019)
Zhong, M., Liu, P., Chen, Y., Wang, D., Qiu, X., Huang, X.J.: Extractive summarization as text matching. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6197–6208 (2020)
Zhu, T., Hua, W., Qu, J., Zhou, X.: summarizing long-form document with rich discourse information. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 2770–2779. CIKM 2021, Association for Computing Machinery, New York, NY, USA (2021)
Acknowledgements
This work was supported by the Centre for Doctoral Training in Speech and Language Technologies (SLT) and their Applications funded by UK Research and Innovation [grant number EP/S023062/1].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Goldsack, T., Zhang, Z., Lin, C., Scarton, C. (2023). Domain-Driven and Discourse-Guided Scientific Summarisation. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13980. Springer, Cham. https://doi.org/10.1007/978-3-031-28244-7_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-28244-7_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28243-0
Online ISBN: 978-3-031-28244-7
eBook Packages: Computer ScienceComputer Science (R0)