Skip to main content

Domain-Driven and Discourse-Guided Scientific Summarisation

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2023)

Abstract

Scientific articles tend to follow a standardised discourse that enables a reader to quickly identify and extract useful or important information. We hypothesise that such structural conventions are strongly influenced by the scientific domain (e.g., Computer Science, Chemistry, etc.) and explore this through a novel extractive algorithm that utilises domain-specific discourse information for the task of abstract generation. In addition to being both simple and lightweight, the proposed algorithm constructs summaries in a structured and interpretable manner. In spite of these factors, we show that our approach outperforms strong baselines on the arXiv scientific summarisation dataset in both automatic and human evaluations, confirming that a scientific article’s domain strongly influences its discourse structure and can be leveraged to effectively improve its summarisation. Our code can be found at: https://github.com/TGoldsack1/DodoRank.

T. Goldsack and Z. Zhang—Equal contribution

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    This sample set is also used within §3.2, as indicated in Fig. 2.

  2. 2.

    Note that we also experimented with both rounding up and rounding to the nearest integer value for Eq. (1), but found that rounding down gave the best performance.

  3. 3.

    The domain names retrieved are equal to highest-level categories as defined in the arXiv category taxonomy: https://arxiv.org/category_taxonomy.

  4. 4.

    Increasing or decreasing K (which directly influences the number of sentences in the summaries produced by DodoRank) invariably led to a worse average performance, as measured by the metrics described in this Section.

  5. 5.

    All ROUGE calculations are performed using the rouge-score Python package.

  6. 6.

    Judges are native English speakers holding a bachelor’s degree in scientific disciplines.

References

  1. Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1371. https://aclanthology.org/D19-1371

  2. Cheng, Y., et al.: Guiding the growth: difficulty-controllable question generation through step-by-step rewriting. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), pp. 5968–5978. Online (2021)

    Google Scholar 

  3. Cohan, A., Beltagy, I., King, D., Dalvi, B., Weld, D.: Pretrained language models for sequential sentence classification. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3693–3699. Association for Computational Linguistics, Hong Kong, China (2019)

    Google Scholar 

  4. Cohan, A., et al.: A discourse-aware attention model for abstractive summarization of long documents. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 615–621. Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-2097. https://aclanthology.org/N18-2097

  5. Collins, E., Augenstein, I., Riedel, S.: A supervised approach to extractive summarisation of scientific papers. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 195–205. Association for Computational Linguistics, Vancouver, Canada (2017)

    Google Scholar 

  6. Contractor, D., Guo, Y., Korhonen, A.: Using argumentative zones for extractive summarization of scientific articles. In: Proceedings of COLING 2012, pp. 663–678. The COLING 2012 Organizing Committee, Mumbai, India (2012). https://aclanthology.org/C12-1041

  7. Cui, P., Hu, L.: Sliding selector network with dynamic memory for extractive summarization of long documents. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5881–5891 (2021)

    Google Scholar 

  8. Cui, P., Hu, L., Liu, Y.: Enhancing extractive text summarization with topic-aware graph neural networks. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 5360–5371 (2020)

    Google Scholar 

  9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423

  10. Dong, Y., Mircea, A., Cheung, J.C.K.: Discourse-Aware unsupervised summarization for long scientific documents. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 1089–1102. Association for Computational Linguistics, Online (2021)

    Google Scholar 

  11. El-Kassas, W., Salama, C., Rafea, A., Mohamed, H.: Automatic text summarization: a comprehensive survey. Expert Syst. Appl. 165, 113679 (2020). https://doi.org/10.1016/j.eswa.2020.113679

  12. Erkan, G., Radev, D.R.: LexRank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)

    Article  Google Scholar 

  13. Ermakova, L., Bordignon, F., Turenne, N., Noel, M.: Is the abstract a mere teaser? Evaluating generosity of article abstracts in the environmental sciences. Front. Res. Metrics Anal. 3, 16 (2018). https://doi.org/10.3389/frma.2018.00016. https://www.frontiersin.org/articles/10.3389/frma.2018.00016

  14. Goldsack, T., Zhang, Z., Lin, C., Scarton, C.: Making science simple: corpora for the lay summarisation of scientific literature. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 10589–10604. Association for Computational Linguistics, Abu Dhabi (2022)

    Google Scholar 

  15. Graetz, N.: Teaching EFL students to extract structural information from abstracts. International Symposium on Language for Special Purposes (1982)

    Google Scholar 

  16. Guo, Y., Korhonen, A., Liakata, M., Silins, I., Sun, L., Stenius, U.: Identifying the information structure of scientific abstracts: an investigation of three different schemes. In: Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, pp. 99–107. Association for Computational Linguistics, Uppsala, Sweden (2010)

    Google Scholar 

  17. Guo, Y., Silins, I., Stenius, U., Korhonen, A.: Active learning-based information structure analysis of full scientific articles and two applications for biomedical literature review. Bioinformatics 29(11), 1440–1447 (2013)

    Article  Google Scholar 

  18. Johnson, F.: Automatic abstracting research. Libr. Rev. 44(8), 28–36 (1995). https://www.proquest.com/scholarly-journals/automatic-abstracting-research/docview/218330298/se-2?accountid=13828

  19. Ju, J., Liu, M., Koh, H.Y., Jin, Y., Du, L., Pan, S.: Leveraging information bottleneck for scientific document summarization. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 4091–4098. Association for Computational Linguistics, Punta Cana, Dominican Republic (2021). https://doi.org/10.18653/v1/2021.findings-emnlp.345. https://aclanthology.org/2021.findings-emnlp.345

  20. Liakata, M.: Zones of conceptualisation in scientific papers: a window to negative and speculative statements. In: Proceedings of the Workshop on Negation and Speculation in Natural Language Processing, pp. 1–4. University of Antwerp, Uppsala, Sweden (2010)

    Google Scholar 

  21. Liakata, M., Dobnik, S., Saha, S., Batchelor, C., Rebholz-Schuhmann, D.: A discourse-driven content model for summarising scientific articles evaluated in a complex question answering task. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 747–757. Association for Computational Linguistics, Seattle, Washington, USA (2013)

    Google Scholar 

  22. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004). https://aclanthology.org/W04-1013

  23. Maynez, J., Narayan, S., Bohnet, B., McDonald, R.: On faithfulness and factuality in abstractive summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 1906–1919. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main.173. https://aclanthology.org/2020.acl-main.173

  24. Peng, K., Yin, C., Rong, W., Lin, C., Zhou, D., Xiong, Z.: Named entity aware transfer learning for biomedical factoid question answering. IEEE/ACM Trans. Comput. Biol. Bioinform. 19(4), 2365–2376 (2021)

    Google Scholar 

  25. Steinberger, J., Jezek, K.: Using latent semantic analysis in text summarization and summary evaluation. In: Proceedings of the 7th International Conference ISIM (2004)

    Google Scholar 

  26. Swales, J.: Genre analysis: Eenglish in academic and research settings. Cambridge University Press (1990)

    Google Scholar 

  27. Teufel, S.: Argumentative zoning: information extraction from scientific text, Ph. D. thesis, University of Edinburgh (1999)

    Google Scholar 

  28. Teufel, S., Moens, M.: Articles summarizing scientific articles: experiments with relevance and rhetorical status. Comput. Linguist. 28(4), 409–445 (2002)

    Article  Google Scholar 

  29. Teufel, S., Siddharthan, A., Batchelor, C.: Towards discipline-independent argumentative zoning: evidence from chemistry and computational linguistics. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 3, pp. 1493–1502. EMNLP 2009, Association for Computational Linguistics, USA (2009)

    Google Scholar 

  30. Xiao, W., Carenini, G.: Extractive summarization of long documents by combining global and local context. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3011–3021. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1298. https://aclanthology.org/D19-1298

  31. Zheng, H., Lapata, M.: Sentence centrality revisited for unsupervised summarization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6236–6247 (2019)

    Google Scholar 

  32. Zhong, M., Liu, P., Chen, Y., Wang, D., Qiu, X., Huang, X.J.: Extractive summarization as text matching. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6197–6208 (2020)

    Google Scholar 

  33. Zhu, T., Hua, W., Qu, J., Zhou, X.: summarizing long-form document with rich discourse information. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 2770–2779. CIKM 2021, Association for Computing Machinery, New York, NY, USA (2021)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Centre for Doctoral Training in Speech and Language Technologies (SLT) and their Applications funded by UK Research and Innovation [grant number EP/S023062/1].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenghua Lin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Goldsack, T., Zhang, Z., Lin, C., Scarton, C. (2023). Domain-Driven and Discourse-Guided Scientific Summarisation. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13980. Springer, Cham. https://doi.org/10.1007/978-3-031-28244-7_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-28244-7_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-28243-0

  • Online ISBN: 978-3-031-28244-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics