Abstract
Information extraction from scholarly articles is a challenging task due to the sizable document length and implicit information hidden in text, figures, and citations. Scholarly information extraction has various applications in exploration, archival, and curation services for digital libraries and knowledge management systems. We present MORTY, an information extraction technique that creates structured summaries of text from scholarly articles. Our approach condenses the article’s full-text to property-value pairs as a segmented text snippet called structured summary. We also present a sizable scholarly dataset combining structured summaries retrieved from a scholarly knowledge graph and corresponding publicly available scientific articles, which we openly publish as a resource for the research community. Our results show that structured summarization is a suitable approach for targeted information extraction that complements other commonly used methods such as question answering and named entity recognition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Code & data (with stats): https://github.com/YaserJaradeh/MORTY.
- 2.
- 3.
Data snapshot was taken on 02.02.2022.
- 4.
Properties that are used solely for information organization and have no semantic value.
- 5.
References
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1371
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)
Chua, F.C., Duffy, N.P.: DeepCPCFG: deep learning and context free grammars for end-to-end information extraction. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition - ICDAR 2021, pp. 838–853. Springer International Publishing, Cham (2021)
Clement, C.B., Bierbaum, M., O’Keeffe, K.P., Alemi, A.A.: On the use of arxiv as a dataset (2019)
Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N.A., Gardner, M.: A dataset of information-seeking questions and answers anchored in research papers. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4599–4610. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.naacl-main.365
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Commun. ACM 51(12), 68–74 (2008)
Jaradeh, M.Y., et al.: Open research knowledge graph: next generation infrastructure for semantic scholarly knowledge. In: Proceedings of the 10th International Conference on Knowledge Capture, pp. 243–246 (2019)
Jaradeh, M.Y., Singh, K., Stocker, M., Auer, S.: Triple classification for scholarly knowledge graph completion. In: Proceedings of the 11th on Knowledge Capture Conference, pp. 225–232. K-CAP 2021, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3460210.3493582
Jeschke, J., et al.: Hi-knowledge, version 2.0. https://hi-knowledge.org/ (2020). Accessed 23 May 2022
Ji, D., Tao, P., Fei, H., Ren, Y.: An end-to-end joint model for evidence information extraction from court record document. Inf. Process. Manage. 57(6), 102305 (2020). https://doi.org/10.1016/j.ipm.2020.102305
Pinheiro, V., Pequeno, T., Furtado, V., Nogueira, D.: Information extraction from text based on semantic inferentialism. In: Andreasen, T., Yager, R.R., Bulskov, H., Christiansen, H., Larsen, H.L. (eds.) FQAS 2009. LNCS (LNAI), vol. 5822, pp. 333–344. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04957-6_29
Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
Liu, Y., Bai, K., Mitra, P., Giles, C.L.: TableSeer: automatic table metadata extraction and searching in digital libraries. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 91–100 JCDL 2007, Association for Computing Machinery, New York, NY, USA (2007). https://doi.org/10.1145/1255175.1255193
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Lopez, P.: GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 473–474. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04346-8_62
Nakayama, T., Hirai, N., Yamazaki, S., Naito, M.: Adoption of structured abstracts by general medical journals and format for a structured abstract. J. Med. Libr. Assoc. 93(2), 237–242 (2005)
Nasar, Z., Jaffry, S.W., Malik, M.K.: Information extraction from scientific articles: a survey. Scientometrics 117(3), 1931–1990 (2018). https://doi.org/10.1007/s11192-018-2921-5
Palmatier, R.W., Houston, M.B., Hulland, J.: Review articles: purpose, process, and structure. J. Acad. Mark. Sci. 46(1), 1–5 (2018)
Pang, B., Nijkamp, E., Kryściński, W., Savarese, S., Zhou, Y., Xiong, C.: Long document summarization with top-down and bottom-up inference. arXiv preprint arXiv:2203.07586 (2022)
Piskorski, J., Yangarber, R.: Information extraction: past, present and future. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer, Heidelberg(2013). https://doi.org/10.1007/978-3-642-28569-1_2
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019)
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for squad. arXiv preprint arXiv:1806.03822 (2018)
Ray Choudhury, S., Mitra, P., Giles, C.L.: Automatic extraction of figures from scholarly documents. In: Proceedings of the 2015 ACM Symposium on Document Engineering, pp. 47–50 (2015)
Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. arXiv preprint arXiv:cs/0306050 (2003)
Sarawagi, S.: Information extraction. Now Publishers Inc (2008)
Singh, M., et al.: OCR++: a robust framework for information extraction from scholarly articles. CoRR arXiv preprint arXiv:abs/1609.06423 (2016)
Sollaci, L.B., Pereira, M.G.: The introduction, methods, results, and discussion (imrad) structure: a fifty-year survey. J. Med. Libr. Assoc. 92(3), 364–367 (2004). https://pubmed.ncbi.nlm.nih.gov/15243643
Spadaro, G., Tiddi, I., Columbus, S., Jin, S., Teije, A.t., Balliet, D.: The cooperation databank: machine-readable science accelerates research synthesis (2020). https://doi.org/10.31234/osf.io/rveh3
Tahir, N., et al.: FNG-IE: an improved graph-based method for keyword extraction from scholarly big-data. PeerJ Comput. Sci. 7, e389 (2021)
Tas, O., Kiyani, F.: A survey automatic text summarization. PressAcademia Procedia 5(1), 205–213 (2007)
Vaswani, A., et al.: Attention is all you need. CoRR arXiv preprint arXiv:abs/1706.03762 (2017)
Williams, K., Wu, J., Wu, Z., Giles, C.L.: Information extraction for scholarly digital libraries. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 287–288 (2016)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics (2020). www.aclweb.org/anthology/2020.emnlp-demos.6
Xia, F., Wang, W., Bekele, T.M., Liu, H.: Big scholarly data: a survey. IEEE Trans. Big Data 3(1), 18–35 (2017). https://doi.org/10.1109/TBDATA.2016.2641460
Yan, Y., et al.: ProphetNet: predicting future N-Gram for sequence-to-sequence pre-training. arXiv preprint arXiv:2001.04063 (2020)
Yao, X., Van Durme, B.: Information extraction over structured data: question answering with freebase. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 956–966 (2014)
Zaheer, M., et al.: Big bird: transformers for longer sequences. Adv. Neural. Inf. Process. Syst. 33, 17283–17297 (2020)
Zhang, J., Zhao, Y., Saleh, M., Liu, P.: Pegasus: pre-training with extracted gap-sentences for abstractive summarization. In: International Conference on Machine Learning, pp. 11328–11339. PMLR (2020)
Zhang, P., et al.: TRIE: end-to-end text reading and information extraction for document understanding, pp. 1413–1422. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394171.3413900
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jaradeh, M.Y., Stocker, M., Auer, S. (2022). MORTY: Structured Summarization for Targeted Information Extraction from Scholarly Articles. In: Tseng, YH., Katsurai, M., Nguyen, H.N. (eds) From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries. ICADL 2022. Lecture Notes in Computer Science, vol 13636. Springer, Cham. https://doi.org/10.1007/978-3-031-21756-2_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-21756-2_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21755-5
Online ISBN: 978-3-031-21756-2
eBook Packages: Computer ScienceComputer Science (R0)