short-paper

Vision and natural language for metadata extraction from scientific PDF documents: a multimodal approach

Authors:
Zeyd Boukhers

University of Koblenz-Landau, Germany

University of Koblenz-Landau, Germany
View Profile

,
Azeddine Bouabdallah

University of Koblenz-Landau, Germany

University of Koblenz-Landau, Germany
View Profile

JCDL '22: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital LibrariesJune 2022Article No.: 6Pages 1–5https://doi.org/10.1145/3529372.3533295

Published:20 June 2022Publication History

JCDL '22: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries

Pages 1–5

ABSTRACT

The challenge of automatically extracting metadata from scientific PDF documents varies depending on the diversity of layouts within the PDF collection. In some disciplines such as German social sciences, the authors are not required to generate their papers according to a specific template and they often create their own templates which yield a high appearance diversity across publications. Overcoming this diversity using only Natural Language Processing (NLP) approaches is not always effective which is reflected in the metadata unavailability of a large portion of German social science publications. Therefore, we propose in this paper a multimodal neural network model that employs NLP together with Computer Vision (CV) for metadata extraction from scientific PDF documents. The aim is to benefit from both modalities to increase the overall accuracy of metadata extraction. The extensive experiments of the proposed model on around 8800 documents proved its effectiveness over unimodal models, with an overall F1 score of 92.3%.

References

2008--2021. GROBID. https://github.com/kermitt2/grobid. (2008--2021). swh:1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3cGoogle Scholar
Philip G. Altbach and Hans de Wit. 2018. Too much academic research is being published. https://www.universityworldnews.com/post.php?story=20180905095203579Google Scholar
Dong An, Liangcai Gao, Zhuoren Jiang, Runtao Liu, and Zhi Tang. 2017. Citation Metadata Extraction via Deep Neural Network-Based Segment Sequence Labeling. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 1967--1970. Google ScholarDigital Library
Sam Anzaroot and Andrew Mccallum. 2013. A New Dataset for Fine-Grained Citation Field Extraction. ICML Workshop on Peer Reviewing and Publishing Models. (2013).Google Scholar
Vidhya Balasubramanian, Sooryanarayan Gobu Doraisamy, and Navaneeth Kumar Kanakarajan. 2016. A Multimodal Approach for Extracting Content Descriptive Metadata from Lecture Videos. J. Intell. Inf. Syst. 46, 1 (2016), 121--145. Google ScholarDigital Library
Zeyd Boukhers, Nada Beili, Timo Hartmann, Prantik Goswami, and Muhammad Arslan Zafar. 2021. MexPub: Deep Transfer Learning for Metadata Extraction from German Publications. In 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE.Google ScholarCross Ref
Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, and Ting Liu. 2018. Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Tree-bank Concatenation. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics, Brussels, Belgium, 55--64. http://www.aclweb.org/anthology/K18-2005Google Scholar
Chien-Chih Chen, Kai-Hsiang Yang, Chuen-Liang Chen, and Jan-Ming Ho. 2012. Bibpro: A citation parser based on sequence alignment. IEEE Transactions on Knowledge and Data Engineering 24, 2 (2012) (2012), 236--250.Google ScholarDigital Library
Pei-Shan Chi. 2014. Which role do non-source items play in the social sciences? A case study in political science in Germany. Scientometrics 101, 2 (2014), 1195--1213.Google ScholarDigital Library
Pei-Shan Chi. 2014. Which role do non-source items play in the social sciences? A case study in political science in Germany. Scientometrics 101, 2 (2014), 1195--1213. Google ScholarDigital Library
Jason P. C. Chiu and Eric Nichols. 2015. Named Entity Recognition with Bidirectional LSTM-CNNs. CoRR abs/1511.08308 (2015). http://arxiv.org/abs/1511.08308Google Scholar
Giovanni Colavizza and Matteo Romanello. 2019. Citation Mining of Humanities Journals: The Progress to Date and the Challenges Ahead. Journal of European Periodical Studies 4 (2019), 36--53. Google ScholarCross Ref
Isaac G Councill, C Lee Giles, and Min-Yen Kan. 2008. ParsCit: an Open-source CRF Reference String Parsing Package. LREC, Vol. 8. (2008), 661--667.Google Scholar
Min-Yuh Day, Richard Tzong-Han Tsai, Cheng-Lung Sung, Chiu-Chen Hsieh, Cheng-Wei Lee, Shih-Hung Wu, Kun-Pin Wu, Chorng-Shyong Ong, and Wen-Lian Hsu. 2007. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems 43, 1 (2007), 152--167. Google ScholarDigital Library
Hui Han, C.L. Giles, E. Manavoglu, Hongyuan Zha, Zhenyue Zhang, and E.A. Fox. 2003. Automatic document metadata extraction using support vector machines. In 2003 Joint Conference on Digital Libraries, 2003. Proceedings. 37--48. Google ScholarCross Ref
Hui Han, Eren Manavoglu, Hongyuan Zha, Kostas Tsioutsiouliklis, C. Lee Giles, and Xiangmin Zhang. 2005. Rule-Based Word Clustering for Document Metadata Extraction. In Proceedings of the 2005 ACM Symposium on Applied Computing (Santa Fe, New Mexico) (SAC '05). Association for Computing Machinery, New York, NY, USA, 1049--1053. Google ScholarDigital Library
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV). 2980--2988. Google ScholarCross Ref
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR abs/1508.01991 (2015). http://arxiv.org/abs/1508.01991Google Scholar
Asanee Kawtrakul and Chaiyakorn Yingsaeree. 2005. A unified framework for automatic metadata extraction from electronic document. In Proceedings of The International Advanced Digital Library Conference. Nagoya, Japan.Google Scholar
Huajing Li, Isaac Councill, Wang-Chien Lee, and C Lee Giles. 2006. CiteSeerx: an architecture and web service design for an academic document search engine. In Proceedings of the 15th international conference on World Wide Web. 883--884.Google ScholarDigital Library
Mario Lipinski, Kevin Yao, Corinna Breitinger, Joeran Beel, and Bela Gipp. 2013. Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries. 385--386.Google ScholarDigital Library
Runtao Liu, Liangcai Gao, Dong An, Zhuoren Jiang, and Zhi Tang. 2018. Automatic Document Metadata Extraction Based on Deep Networks. In Natural Language Processing and Chinese Computing, Xuanjing Huang, Jing Jiang, Dongyan Zhao, Yansong Feng, and Yu Hong (Eds.). Springer International Publishing, Cham, 305--317.Google Scholar
Fuchun Peng and Andrew McCallum. 2006. Information Extraction from Research Papers Using Conditional Random Fields. Inf. Process. Manage. 42, 4 (2006), 963--979. Google ScholarDigital Library
Kristie Seymore, Andrew Mccallum, and Ronald Rosenfeld. 1999. Learning Hidden Markov Model Structure for Information Extraction. In In AAAI 99 Workshop on Machine Learning for Information Extraction. 37--42.Google Scholar
Christopher G. Stahl, Steven R. Young, Drahomira Herrmannova, Robert M. Patton, and Jack C. Wells. 2018. DeepPDF: A Deep Learning Approach to Extracting Text from PDFs. (2018). https://www.osti.gov/biblio/1460210Google Scholar
Dominika Tkaczyk. 2017. New Methods for Metadata Extraction from Scientific Literature. CoRR abs/1710.10201 (2017). arXiv:1710.10201 http://arxiv.org/abs/1710.10201Google Scholar
Dominika Tkaczyk, Paweł Szostek, Mateusz Fedoryszak, Piotr Jan Dendek, and Lukasz Bolikowski. 2015. CERMINE: Automatic Extraction of Structured Metadata from Scientific Literature. Int. J. Doc. Anal. Recognit. 18, 4 (2015), 317--335. Google ScholarDigital Library
Yi Wu, Edward Y. Chang, Kevin Chen-Chuan Chang, and John R. Smith. 2004. Optimal Multimodal Fusion for Multimedia Data Analysis. In Proceedings of the 12th Annual ACM International Conference on Multimedia (MULTIMEDIA '04). Association for Computing Machinery, New York, NY, USA, 572--579. Google ScholarDigital Library
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016. Aggregated Residual Transformations for Deep Neural Networks. arXiv preprint arXiv:1611.05431 (2016).Google Scholar
Guixian Xu, Yueting Meng, Xiaoyu Qiu, Ziheng Yu, and Xu Wu. 2019. Sentiment analysis of comment texts based on BiLSTM. Ieee Access 7 (2019), 51522--51532.Google ScholarCross Ref
Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. 2019. Publaynet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1015--1022.Google ScholarCross Ref

Index Terms

Vision and natural language for metadata extraction from scientific PDF documents: a multimodal approach
1. Applied computing
  1. Document management and text processing
2. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by regression
    2. Machine learning approaches
      1. Neural networks

Recommendations

Reference Metadata Extraction from Scientific Papers
PDCAT '11: Proceedings of the 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies

Bibliographical information of scientific papers is of great value since the Science Citation Index is introduced to measure research impact. Most scientific documents available on the web are unstructured or semi-structured, and the automatic reference ...
Read More
Evaluation of header metadata extraction approaches and tools for scientific PDF documents
JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

This paper evaluates the performance of tools for the extraction of metadata from scientific articles. Accurate metadata extraction is an important task for automating the management of digital libraries. This comparative study is a guide for developers ...
Read More
Reference metadata extraction using a hierarchical knowledge representation framework

The integration of bibliographical information on scholarly publications available on the Internet is an important task in the academic community. Accurate reference metadata extraction from such publications is essential for the integration of metadata ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '22: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries
June 2022
392 pages
ISBN:9781450393454
DOI:10.1145/3529372
General Chairs:
Akiko Aizawa
National Institute of Informatics, Japan
,
Thomas Mandl
University of Hildesheim, Germany
,
Zeljko Carevic
GESIS - Leibniz Institute for the Social Sciences, Germany
,
Program Chairs:
Annika Hinze
University of Waikato, New Zealand
,
Philipp Mayr
GESIS - Leibniz Institute for the Social Sciences, Germany
,
Philipp Schaer
TH Köln (University of Applied Sciences), Germany
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 June 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CV
NLP
metadata extraction
multimodal ML
Qualifiers
- short-paper
Conference

Acceptance Rates
JCDL '22 Paper Acceptance Rate35of132submissions,27%Overall Acceptance Rate415of1,482submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 164
  Total Downloads
- Downloads (Last 12 months)76
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Vision and natural language for metadata extraction from scientific PDF documents: a multimodal approach

JCDL '22: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Reference Metadata Extraction from Scientific Papers

Evaluation of header metadata extraction approaches and tools for scientific PDF documents

Reference metadata extraction using a hierarchical knowledge representation framework