Skip to main content

MORTY: Structured Summarization for Targeted Information Extraction from Scholarly Articles

  • Conference paper
  • First Online:
From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries (ICADL 2022)

Abstract

Information extraction from scholarly articles is a challenging task due to the sizable document length and implicit information hidden in text, figures, and citations. Scholarly information extraction has various applications in exploration, archival, and curation services for digital libraries and knowledge management systems. We present MORTY, an information extraction technique that creates structured summaries of text from scholarly articles. Our approach condenses the article’s full-text to property-value pairs as a segmented text snippet called structured summary. We also present a sizable scholarly dataset combining structured summaries retrieved from a scholarly knowledge graph and corresponding publicly available scientific articles, which we openly publish as a resource for the research community. Our results show that structured summarization is a suitable approach for targeted information extraction that complements other commonly used methods such as question answering and named entity recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Code & data (with stats): https://github.com/YaserJaradeh/MORTY.

  2. 2.

    https://www.nlm.nih.gov/databases/download/pubmed_medline.html.

  3. 3.

    Data snapshot was taken on 02.02.2022.

  4. 4.

    Properties that are used solely for information organization and have no semantic value.

  5. 5.

    https://paperswithcode.com/sota/text-summarization-on-pubmed-1.

References

  1. Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1371

  2. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)

  3. Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)

    Article  Google Scholar 

  4. Chua, F.C., Duffy, N.P.: DeepCPCFG: deep learning and context free grammars for end-to-end information extraction. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition - ICDAR 2021, pp. 838–853. Springer International Publishing, Cham (2021)

    Chapter  Google Scholar 

  5. Clement, C.B., Bierbaum, M., O’Keeffe, K.P., Alemi, A.A.: On the use of arxiv as a dataset (2019)

    Google Scholar 

  6. Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N.A., Gardner, M.: A dataset of information-seeking questions and answers anchored in research papers. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4599–4610. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.naacl-main.365

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  8. Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Commun. ACM 51(12), 68–74 (2008)

    Article  Google Scholar 

  9. Jaradeh, M.Y., et al.: Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge. In: Proceedings of the 10th International Conference on Knowledge Capture, pp. 243–246 (2019)

    Google Scholar 

  10. Jaradeh, M.Y., Singh, K., Stocker, M., Auer, S.: Triple classification for scholarly knowledge graph completion. In: Proceedings of the 11th on Knowledge Capture Conference, pp. 225–232. K-CAP 2021, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3460210.3493582

  11. Jeschke, J., et al.: Hi-knowledge, version 2.0. https://hi-knowledge.org/ (2020). Accessed 23 May 2022

  12. Ji, D., Tao, P., Fei, H., Ren, Y.: An end-to-end joint model for evidence information extraction from court record document. Inf. Process. Manage. 57(6), 102305 (2020). https://doi.org/10.1016/j.ipm.2020.102305

    Article  Google Scholar 

  13. Pinheiro, V., Pequeno, T., Furtado, V., Nogueira, D.: Information extraction from text based on semantic inferentialism. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L. (eds.) FQAS 2009. LNCS (LNAI), vol. 5822, pp. 333–344. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04957-6_29

    Chapter  Google Scholar 

  14. Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)

  15. Liu, Y., Bai, K., Mitra, P., Giles, C.L.: TableSeer: automatic table metadata extraction and searching in digital libraries. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 91–100 JCDL 2007, Association for Computing Machinery, New York, NY, USA (2007). https://doi.org/10.1145/1255175.1255193

  16. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  17. Lopez, P.: GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 473–474. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04346-8_62

    Chapter  Google Scholar 

  18. Nakayama, T., Hirai, N., Yamazaki, S., Naito, M.: Adoption of structured abstracts by general medical journals and format for a structured abstract. J. Med. Libr. Assoc. 93(2), 237–242 (2005)

    Google Scholar 

  19. Nasar, Z., Jaffry, S.W., Malik, M.K.: Information extraction from scientific articles: a survey. Scientometrics 117(3), 1931–1990 (2018). https://doi.org/10.1007/s11192-018-2921-5

    Article  Google Scholar 

  20. Palmatier, R.W., Houston, M.B., Hulland, J.: Review articles: purpose, process, and structure. J. Acad. Mark. Sci. 46(1), 1–5 (2018)

    Google Scholar 

  21. Pang, B., Nijkamp, E., Kryściński, W., Savarese, S., Zhou, Y., Xiong, C.: Long document summarization with top-down and bottom-up inference. arXiv preprint arXiv:2203.07586 (2022)

  22. Piskorski, J., Yangarber, R.: Information extraction: past, present and future. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Heidelberg(2013). https://doi.org/10.1007/978-3-642-28569-1_2

  23. Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  24. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019)

  25. Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for squad. arXiv preprint arXiv:1806.03822 (2018)

  26. Ray Choudhury, S., Mitra, P., Giles, C.L.: Automatic extraction of figures from scholarly documents. In: Proceedings of the 2015 ACM Symposium on Document Engineering, pp. 47–50 (2015)

    Google Scholar 

  27. Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. arXiv preprint arXiv:cs/0306050 (2003)

  28. Sarawagi, S.: Information extraction. Now Publishers Inc (2008)

    Google Scholar 

  29. Singh, M., et al.: OCR++: a robust framework for information extraction from scholarly articles. CoRR arXiv preprint arXiv:abs/1609.06423 (2016)

  30. Sollaci, L.B., Pereira, M.G.: The introduction, methods, results, and discussion (imrad) structure: a fifty-year survey. J. Med. Libr. Assoc. 92(3), 364–367 (2004). https://pubmed.ncbi.nlm.nih.gov/15243643

  31. Spadaro, G., Tiddi, I., Columbus, S., Jin, S., Teije, A.t., Balliet, D.: The cooperation databank: machine-readable science accelerates research synthesis (2020). https://doi.org/10.31234/osf.io/rveh3

  32. Tahir, N., et al.: FNG-IE: an improved graph-based method for keyword extraction from scholarly big-data. PeerJ Comput. Sci. 7, e389 (2021)

    Google Scholar 

  33. Tas, O., Kiyani, F.: A survey automatic text summarization. PressAcademia Procedia 5(1), 205–213 (2007)

    Article  Google Scholar 

  34. Vaswani, A., et al.: Attention is all you need. CoRR arXiv preprint arXiv:abs/1706.03762 (2017)

  35. Williams, K., Wu, J., Wu, Z., Giles, C.L.: Information extraction for scholarly digital libraries. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 287–288 (2016)

    Google Scholar 

  36. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics (2020). www.aclweb.org/anthology/2020.emnlp-demos.6

  37. Xia, F., Wang, W., Bekele, T.M., Liu, H.: Big scholarly data: a survey. IEEE Trans. Big Data 3(1), 18–35 (2017). https://doi.org/10.1109/TBDATA.2016.2641460

    Article  Google Scholar 

  38. Yan, Y., et al.: ProphetNet: predicting future N-Gram for sequence-to-sequence pre-training. arXiv preprint arXiv:2001.04063 (2020)

  39. Yao, X., Van Durme, B.: Information extraction over structured data: question answering with freebase. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 956–966 (2014)

    Google Scholar 

  40. Zaheer, M., et al.: Big bird: transformers for longer sequences. Adv. Neural. Inf. Process. Syst. 33, 17283–17297 (2020)

    Google Scholar 

  41. Zhang, J., Zhao, Y., Saleh, M., Liu, P.: Pegasus: pre-training with extracted gap-sentences for abstractive summarization. In: International Conference on Machine Learning, pp. 11328–11339. PMLR (2020)

    Google Scholar 

  42. Zhang, P., et al.: TRIE: end-to-end text reading and information extraction for document understanding, pp. 1413–1422. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3413900

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamad Yaser Jaradeh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jaradeh, M.Y., Stocker, M., Auer, S. (2022). MORTY: Structured Summarization for Targeted Information Extraction from Scholarly Articles. In: Tseng, YH., Katsurai, M., Nguyen, H.N. (eds) From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries. ICADL 2022. Lecture Notes in Computer Science, vol 13636. Springer, Cham. https://doi.org/10.1007/978-3-031-21756-2_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21756-2_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21755-5

  • Online ISBN: 978-3-031-21756-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics