skip to main content
10.1145/3529372.3533295acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
short-paper

Vision and natural language for metadata extraction from scientific PDF documents: a multimodal approach

Published:20 June 2022Publication History

ABSTRACT

The challenge of automatically extracting metadata from scientific PDF documents varies depending on the diversity of layouts within the PDF collection. In some disciplines such as German social sciences, the authors are not required to generate their papers according to a specific template and they often create their own templates which yield a high appearance diversity across publications. Overcoming this diversity using only Natural Language Processing (NLP) approaches is not always effective which is reflected in the metadata unavailability of a large portion of German social science publications. Therefore, we propose in this paper a multimodal neural network model that employs NLP together with Computer Vision (CV) for metadata extraction from scientific PDF documents. The aim is to benefit from both modalities to increase the overall accuracy of metadata extraction. The extensive experiments of the proposed model on around 8800 documents proved its effectiveness over unimodal models, with an overall F1 score of 92.3%.

References

  1. 2008--2021. GROBID. https://github.com/kermitt2/grobid. (2008--2021). swh:1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3cGoogle ScholarGoogle Scholar
  2. Philip G. Altbach and Hans de Wit. 2018. Too much academic research is being published. https://www.universityworldnews.com/post.php?story=20180905095203579Google ScholarGoogle Scholar
  3. Dong An, Liangcai Gao, Zhuoren Jiang, Runtao Liu, and Zhi Tang. 2017. Citation Metadata Extraction via Deep Neural Network-Based Segment Sequence Labeling. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 1967--1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Sam Anzaroot and Andrew Mccallum. 2013. A New Dataset for Fine-Grained Citation Field Extraction. ICML Workshop on Peer Reviewing and Publishing Models. (2013).Google ScholarGoogle Scholar
  5. Vidhya Balasubramanian, Sooryanarayan Gobu Doraisamy, and Navaneeth Kumar Kanakarajan. 2016. A Multimodal Approach for Extracting Content Descriptive Metadata from Lecture Videos. J. Intell. Inf. Syst. 46, 1 (2016), 121--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Zeyd Boukhers, Nada Beili, Timo Hartmann, Prantik Goswami, and Muhammad Arslan Zafar. 2021. MexPub: Deep Transfer Learning for Metadata Extraction from German Publications. In 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  7. Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, and Ting Liu. 2018. Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Tree-bank Concatenation. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics, Brussels, Belgium, 55--64. http://www.aclweb.org/anthology/K18-2005Google ScholarGoogle Scholar
  8. Chien-Chih Chen, Kai-Hsiang Yang, Chuen-Liang Chen, and Jan-Ming Ho. 2012. Bibpro: A citation parser based on sequence alignment. IEEE Transactions on Knowledge and Data Engineering 24, 2 (2012) (2012), 236--250.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Pei-Shan Chi. 2014. Which role do non-source items play in the social sciences? A case study in political science in Germany. Scientometrics 101, 2 (2014), 1195--1213.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Pei-Shan Chi. 2014. Which role do non-source items play in the social sciences? A case study in political science in Germany. Scientometrics 101, 2 (2014), 1195--1213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jason P. C. Chiu and Eric Nichols. 2015. Named Entity Recognition with Bidirectional LSTM-CNNs. CoRR abs/1511.08308 (2015). http://arxiv.org/abs/1511.08308Google ScholarGoogle Scholar
  12. Giovanni Colavizza and Matteo Romanello. 2019. Citation Mining of Humanities Journals: The Progress to Date and the Challenges Ahead. Journal of European Periodical Studies 4 (2019), 36--53. Google ScholarGoogle ScholarCross RefCross Ref
  13. Isaac G Councill, C Lee Giles, and Min-Yen Kan. 2008. ParsCit: an Open-source CRF Reference String Parsing Package. LREC, Vol. 8. (2008), 661--667.Google ScholarGoogle Scholar
  14. Min-Yuh Day, Richard Tzong-Han Tsai, Cheng-Lung Sung, Chiu-Chen Hsieh, Cheng-Wei Lee, Shih-Hung Wu, Kun-Pin Wu, Chorng-Shyong Ong, and Wen-Lian Hsu. 2007. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems 43, 1 (2007), 152--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hui Han, C.L. Giles, E. Manavoglu, Hongyuan Zha, Zhenyue Zhang, and E.A. Fox. 2003. Automatic document metadata extraction using support vector machines. In 2003 Joint Conference on Digital Libraries, 2003. Proceedings. 37--48. Google ScholarGoogle ScholarCross RefCross Ref
  16. Hui Han, Eren Manavoglu, Hongyuan Zha, Kostas Tsioutsiouliklis, C. Lee Giles, and Xiangmin Zhang. 2005. Rule-Based Word Clustering for Document Metadata Extraction. In Proceedings of the 2005 ACM Symposium on Applied Computing (Santa Fe, New Mexico) (SAC '05). Association for Computing Machinery, New York, NY, USA, 1049--1053. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV). 2980--2988. Google ScholarGoogle ScholarCross RefCross Ref
  18. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR abs/1508.01991 (2015). http://arxiv.org/abs/1508.01991Google ScholarGoogle Scholar
  19. Asanee Kawtrakul and Chaiyakorn Yingsaeree. 2005. A unified framework for automatic metadata extraction from electronic document. In Proceedings of The International Advanced Digital Library Conference. Nagoya, Japan.Google ScholarGoogle Scholar
  20. Huajing Li, Isaac Councill, Wang-Chien Lee, and C Lee Giles. 2006. CiteSeerx: an architecture and web service design for an academic document search engine. In Proceedings of the 15th international conference on World Wide Web. 883--884.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Mario Lipinski, Kevin Yao, Corinna Breitinger, Joeran Beel, and Bela Gipp. 2013. Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries. 385--386.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Runtao Liu, Liangcai Gao, Dong An, Zhuoren Jiang, and Zhi Tang. 2018. Automatic Document Metadata Extraction Based on Deep Networks. In Natural Language Processing and Chinese Computing, Xuanjing Huang, Jing Jiang, Dongyan Zhao, Yansong Feng, and Yu Hong (Eds.). Springer International Publishing, Cham, 305--317.Google ScholarGoogle Scholar
  23. Fuchun Peng and Andrew McCallum. 2006. Information Extraction from Research Papers Using Conditional Random Fields. Inf. Process. Manage. 42, 4 (2006), 963--979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kristie Seymore, Andrew Mccallum, and Ronald Rosenfeld. 1999. Learning Hidden Markov Model Structure for Information Extraction. In In AAAI 99 Workshop on Machine Learning for Information Extraction. 37--42.Google ScholarGoogle Scholar
  25. Christopher G. Stahl, Steven R. Young, Drahomira Herrmannova, Robert M. Patton, and Jack C. Wells. 2018. DeepPDF: A Deep Learning Approach to Extracting Text from PDFs. (2018). https://www.osti.gov/biblio/1460210Google ScholarGoogle Scholar
  26. Dominika Tkaczyk. 2017. New Methods for Metadata Extraction from Scientific Literature. CoRR abs/1710.10201 (2017). arXiv:1710.10201 http://arxiv.org/abs/1710.10201Google ScholarGoogle Scholar
  27. Dominika Tkaczyk, Paweł Szostek, Mateusz Fedoryszak, Piotr Jan Dendek, and Lukasz Bolikowski. 2015. CERMINE: Automatic Extraction of Structured Metadata from Scientific Literature. Int. J. Doc. Anal. Recognit. 18, 4 (2015), 317--335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yi Wu, Edward Y. Chang, Kevin Chen-Chuan Chang, and John R. Smith. 2004. Optimal Multimodal Fusion for Multimedia Data Analysis. In Proceedings of the 12th Annual ACM International Conference on Multimedia (MULTIMEDIA '04). Association for Computing Machinery, New York, NY, USA, 572--579. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016. Aggregated Residual Transformations for Deep Neural Networks. arXiv preprint arXiv:1611.05431 (2016).Google ScholarGoogle Scholar
  30. Guixian Xu, Yueting Meng, Xiaoyu Qiu, Ziheng Yu, and Xu Wu. 2019. Sentiment analysis of comment texts based on BiLSTM. Ieee Access 7 (2019), 51522--51532.Google ScholarGoogle ScholarCross RefCross Ref
  31. Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. 2019. Publaynet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1015--1022.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Vision and natural language for metadata extraction from scientific PDF documents: a multimodal approach

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          JCDL '22: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries
          June 2022
          392 pages
          ISBN:9781450393454
          DOI:10.1145/3529372
          • General Chairs:
          • Akiko Aizawa,
          • Thomas Mandl,
          • Zeljko Carevic,
          • Program Chairs:
          • Annika Hinze,
          • Philipp Mayr,
          • Philipp Schaer

          Copyright © 2022 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 June 2022

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • short-paper

          Acceptance Rates

          JCDL '22 Paper Acceptance Rate35of132submissions,27%Overall Acceptance Rate415of1,482submissions,28%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader