MORTY: Structured Summarization for Targeted Information Extraction from Scholarly Articles

Jaradeh, Mohamad Yaser; Stocker, Markus; Auer, Sören

doi:10.1007/978-3-031-21756-2_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13636))

Included in the following conference series:

International Conference on Asian Digital Libraries

1012 Accesses
2 Altmetric

Abstract

Information extraction from scholarly articles is a challenging task due to the sizable document length and implicit information hidden in text, figures, and citations. Scholarly information extraction has various applications in exploration, archival, and curation services for digital libraries and knowledge management systems. We present MORTY, an information extraction technique that creates structured summaries of text from scholarly articles. Our approach condenses the article’s full-text to property-value pairs as a segmented text snippet called structured summary. We also present a sizable scholarly dataset combining structured summaries retrieved from a scholarly knowledge graph and corresponding publicly available scientific articles, which we openly publish as a resource for the research community. Our results show that structured summarization is a suitable approach for targeted information extraction that complements other commonly used methods such as question answering and named entity recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Keyphrase Extraction in Scholarly Digital Library Search Engines

Knowledge Extraction and Modeling from Scientific Publications

Semantic Graph Based Automatic Summarization of Multiple Related Work Sections of Scientific Articles

Notes

1.
Code & data (with stats): https://github.com/YaserJaradeh/MORTY.
2.
https://www.nlm.nih.gov/databases/download/pubmed_medline.html.
3.
Data snapshot was taken on 02.02.2022.
4.
Properties that are used solely for information organization and have no semantic value.
5.
https://paperswithcode.com/sota/text-summarization-on-pubmed-1.

References

Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1371
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)
Article Google Scholar
Chua, F.C., Duffy, N.P.: DeepCPCFG: deep learning and context free grammars for end-to-end information extraction. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition - ICDAR 2021, pp. 838–853. Springer International Publishing, Cham (2021)
Chapter Google Scholar
Clement, C.B., Bierbaum, M., O’Keeffe, K.P., Alemi, A.A.: On the use of arxiv as a dataset (2019)
Google Scholar
Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N.A., Gardner, M.: A dataset of information-seeking questions and answers anchored in research papers. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4599–4610. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.naacl-main.365
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Commun. ACM 51(12), 68–74 (2008)
Article Google Scholar
Jaradeh, M.Y., et al.: Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge. In: Proceedings of the 10th International Conference on Knowledge Capture, pp. 243–246 (2019)
Google Scholar
Jaradeh, M.Y., Singh, K., Stocker, M., Auer, S.: Triple classification for scholarly knowledge graph completion. In: Proceedings of the 11th on Knowledge Capture Conference, pp. 225–232. K-CAP 2021, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3460210.3493582
Jeschke, J., et al.: Hi-knowledge, version 2.0. https://hi-knowledge.org/ (2020). Accessed 23 May 2022
Ji, D., Tao, P., Fei, H., Ren, Y.: An end-to-end joint model for evidence information extraction from court record document. Inf. Process. Manage. 57(6), 102305 (2020). https://doi.org/10.1016/j.ipm.2020.102305
Article Google Scholar
Pinheiro, V., Pequeno, T., Furtado, V., Nogueira, D.: Information extraction from text based on semantic inferentialism. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L. (eds.) FQAS 2009. LNCS (LNAI), vol. 5822, pp. 333–344. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04957-6_29
Chapter Google Scholar
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
Liu, Y., Bai, K., Mitra, P., Giles, C.L.: TableSeer: automatic table metadata extraction and searching in digital libraries. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 91–100 JCDL 2007, Association for Computing Machinery, New York, NY, USA (2007). https://doi.org/10.1145/1255175.1255193
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Lopez, P.: GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 473–474. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04346-8_62
Chapter Google Scholar
Nakayama, T., Hirai, N., Yamazaki, S., Naito, M.: Adoption of structured abstracts by general medical journals and format for a structured abstract. J. Med. Libr. Assoc. 93(2), 237–242 (2005)
Google Scholar
Nasar, Z., Jaffry, S.W., Malik, M.K.: Information extraction from scientific articles: a survey. Scientometrics 117(3), 1931–1990 (2018). https://doi.org/10.1007/s11192-018-2921-5
Article Google Scholar
Palmatier, R.W., Houston, M.B., Hulland, J.: Review articles: purpose, process, and structure. J. Acad. Mark. Sci. 46(1), 1–5 (2018)
Google Scholar
Pang, B., Nijkamp, E., Kryściński, W., Savarese, S., Zhou, Y., Xiong, C.: Long document summarization with top-down and bottom-up inference. arXiv preprint arXiv:2203.07586 (2022)
Piskorski, J., Yangarber, R.: Information extraction: past, present and future. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Heidelberg(2013). https://doi.org/10.1007/978-3-642-28569-1_2
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019)
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for squad. arXiv preprint arXiv:1806.03822 (2018)
Ray Choudhury, S., Mitra, P., Giles, C.L.: Automatic extraction of figures from scholarly documents. In: Proceedings of the 2015 ACM Symposium on Document Engineering, pp. 47–50 (2015)
Google Scholar
Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. arXiv preprint arXiv:cs/0306050 (2003)
Sarawagi, S.: Information extraction. Now Publishers Inc (2008)
Google Scholar
Singh, M., et al.: OCR++: a robust framework for information extraction from scholarly articles. CoRR arXiv preprint arXiv:abs/1609.06423 (2016)
Sollaci, L.B., Pereira, M.G.: The introduction, methods, results, and discussion (imrad) structure: a fifty-year survey. J. Med. Libr. Assoc. 92(3), 364–367 (2004). https://pubmed.ncbi.nlm.nih.gov/15243643
Spadaro, G., Tiddi, I., Columbus, S., Jin, S., Teije, A.t., Balliet, D.: The cooperation databank: machine-readable science accelerates research synthesis (2020). https://doi.org/10.31234/osf.io/rveh3
Tahir, N., et al.: FNG-IE: an improved graph-based method for keyword extraction from scholarly big-data. PeerJ Comput. Sci. 7, e389 (2021)
Google Scholar
Tas, O., Kiyani, F.: A survey automatic text summarization. PressAcademia Procedia 5(1), 205–213 (2007)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. CoRR arXiv preprint arXiv:abs/1706.03762 (2017)
Williams, K., Wu, J., Wu, Z., Giles, C.L.: Information extraction for scholarly digital libraries. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 287–288 (2016)
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics (2020). www.aclweb.org/anthology/2020.emnlp-demos.6
Xia, F., Wang, W., Bekele, T.M., Liu, H.: Big scholarly data: a survey. IEEE Trans. Big Data 3(1), 18–35 (2017). https://doi.org/10.1109/TBDATA.2016.2641460
Article Google Scholar
Yan, Y., et al.: ProphetNet: predicting future N-Gram for sequence-to-sequence pre-training. arXiv preprint arXiv:2001.04063 (2020)
Yao, X., Van Durme, B.: Information extraction over structured data: question answering with freebase. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 956–966 (2014)
Google Scholar
Zaheer, M., et al.: Big bird: transformers for longer sequences. Adv. Neural. Inf. Process. Syst. 33, 17283–17297 (2020)
Google Scholar
Zhang, J., Zhao, Y., Saleh, M., Liu, P.: Pegasus: pre-training with extracted gap-sentences for abstractive summarization. In: International Conference on Machine Learning, pp. 11328–11339. PMLR (2020)
Google Scholar
Zhang, P., et al.: TRIE: end-to-end text reading and information extraction for document understanding, pp. 1413–1422. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3413900

Download references

Author information

Authors and Affiliations

L3S Research Center, Leibniz University, Hannover, Germany
Mohamad Yaser Jaradeh
Leibniz Information Centre for Science and Technology, Hanover, Germany
Markus Stocker & Sören Auer

Authors

Mohamad Yaser Jaradeh
View author publications
You can also search for this author in PubMed Google Scholar
Markus Stocker
View author publications
You can also search for this author in PubMed Google Scholar
Sören Auer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamad Yaser Jaradeh .

Editor information

Editors and Affiliations

National Taiwan Normal University, Taipei, Taiwan
Yuen-Hsien Tseng
Doshisha University, Kyoto, Japan
Marie Katsurai
VNU University of Engineering and Technology, Hanoi, Vietnam
Hoa N. Nguyen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jaradeh, M.Y., Stocker, M., Auer, S. (2022). MORTY: Structured Summarization for Targeted Information Extraction from Scholarly Articles. In: Tseng, YH., Katsurai, M., Nguyen, H.N. (eds) From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries. ICADL 2022. Lecture Notes in Computer Science, vol 13636. Springer, Cham. https://doi.org/10.1007/978-3-031-21756-2_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-21756-2_23
Published: 07 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21755-5
Online ISBN: 978-3-031-21756-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MORTY: Structured Summarization for Targeted Information Extraction from Scholarly Articles