ABSTRACT
The challenge of automatically extracting metadata from scientific PDF documents varies depending on the diversity of layouts within the PDF collection. In some disciplines such as German social sciences, the authors are not required to generate their papers according to a specific template and they often create their own templates which yield a high appearance diversity across publications. Overcoming this diversity using only Natural Language Processing (NLP) approaches is not always effective which is reflected in the metadata unavailability of a large portion of German social science publications. Therefore, we propose in this paper a multimodal neural network model that employs NLP together with Computer Vision (CV) for metadata extraction from scientific PDF documents. The aim is to benefit from both modalities to increase the overall accuracy of metadata extraction. The extensive experiments of the proposed model on around 8800 documents proved its effectiveness over unimodal models, with an overall F1 score of 92.3%.
- 2008--2021. GROBID. https://github.com/kermitt2/grobid. (2008--2021). swh:1:dir:dab86b296e3c3216e2241968f0d63b68e8209d3cGoogle Scholar
- Philip G. Altbach and Hans de Wit. 2018. Too much academic research is being published. https://www.universityworldnews.com/post.php?story=20180905095203579Google Scholar
- Dong An, Liangcai Gao, Zhuoren Jiang, Runtao Liu, and Zhi Tang. 2017. Citation Metadata Extraction via Deep Neural Network-Based Segment Sequence Labeling. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM '17). Association for Computing Machinery, New York, NY, USA, 1967--1970. Google ScholarDigital Library
- Sam Anzaroot and Andrew Mccallum. 2013. A New Dataset for Fine-Grained Citation Field Extraction. ICML Workshop on Peer Reviewing and Publishing Models. (2013).Google Scholar
- Vidhya Balasubramanian, Sooryanarayan Gobu Doraisamy, and Navaneeth Kumar Kanakarajan. 2016. A Multimodal Approach for Extracting Content Descriptive Metadata from Lecture Videos. J. Intell. Inf. Syst. 46, 1 (2016), 121--145. Google ScholarDigital Library
- Zeyd Boukhers, Nada Beili, Timo Hartmann, Prantik Goswami, and Muhammad Arslan Zafar. 2021. MexPub: Deep Transfer Learning for Metadata Extraction from German Publications. In 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE.Google ScholarCross Ref
- Wanxiang Che, Yijia Liu, Yuxuan Wang, Bo Zheng, and Ting Liu. 2018. Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Tree-bank Concatenation. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics, Brussels, Belgium, 55--64. http://www.aclweb.org/anthology/K18-2005Google Scholar
- Chien-Chih Chen, Kai-Hsiang Yang, Chuen-Liang Chen, and Jan-Ming Ho. 2012. Bibpro: A citation parser based on sequence alignment. IEEE Transactions on Knowledge and Data Engineering 24, 2 (2012) (2012), 236--250.Google ScholarDigital Library
- Pei-Shan Chi. 2014. Which role do non-source items play in the social sciences? A case study in political science in Germany. Scientometrics 101, 2 (2014), 1195--1213.Google ScholarDigital Library
- Pei-Shan Chi. 2014. Which role do non-source items play in the social sciences? A case study in political science in Germany. Scientometrics 101, 2 (2014), 1195--1213. Google ScholarDigital Library
- Jason P. C. Chiu and Eric Nichols. 2015. Named Entity Recognition with Bidirectional LSTM-CNNs. CoRR abs/1511.08308 (2015). http://arxiv.org/abs/1511.08308Google Scholar
- Giovanni Colavizza and Matteo Romanello. 2019. Citation Mining of Humanities Journals: The Progress to Date and the Challenges Ahead. Journal of European Periodical Studies 4 (2019), 36--53. Google ScholarCross Ref
- Isaac G Councill, C Lee Giles, and Min-Yen Kan. 2008. ParsCit: an Open-source CRF Reference String Parsing Package. LREC, Vol. 8. (2008), 661--667.Google Scholar
- Min-Yuh Day, Richard Tzong-Han Tsai, Cheng-Lung Sung, Chiu-Chen Hsieh, Cheng-Wei Lee, Shih-Hung Wu, Kun-Pin Wu, Chorng-Shyong Ong, and Wen-Lian Hsu. 2007. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems 43, 1 (2007), 152--167. Google ScholarDigital Library
- Hui Han, C.L. Giles, E. Manavoglu, Hongyuan Zha, Zhenyue Zhang, and E.A. Fox. 2003. Automatic document metadata extraction using support vector machines. In 2003 Joint Conference on Digital Libraries, 2003. Proceedings. 37--48. Google ScholarCross Ref
- Hui Han, Eren Manavoglu, Hongyuan Zha, Kostas Tsioutsiouliklis, C. Lee Giles, and Xiangmin Zhang. 2005. Rule-Based Word Clustering for Document Metadata Extraction. In Proceedings of the 2005 ACM Symposium on Applied Computing (Santa Fe, New Mexico) (SAC '05). Association for Computing Machinery, New York, NY, USA, 1049--1053. Google ScholarDigital Library
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In 2017 IEEE International Conference on Computer Vision (ICCV). 2980--2988. Google ScholarCross Ref
- Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR abs/1508.01991 (2015). http://arxiv.org/abs/1508.01991Google Scholar
- Asanee Kawtrakul and Chaiyakorn Yingsaeree. 2005. A unified framework for automatic metadata extraction from electronic document. In Proceedings of The International Advanced Digital Library Conference. Nagoya, Japan.Google Scholar
- Huajing Li, Isaac Councill, Wang-Chien Lee, and C Lee Giles. 2006. CiteSeerx: an architecture and web service design for an academic document search engine. In Proceedings of the 15th international conference on World Wide Web. 883--884.Google ScholarDigital Library
- Mario Lipinski, Kevin Yao, Corinna Breitinger, Joeran Beel, and Bela Gipp. 2013. Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries. 385--386.Google ScholarDigital Library
- Runtao Liu, Liangcai Gao, Dong An, Zhuoren Jiang, and Zhi Tang. 2018. Automatic Document Metadata Extraction Based on Deep Networks. In Natural Language Processing and Chinese Computing, Xuanjing Huang, Jing Jiang, Dongyan Zhao, Yansong Feng, and Yu Hong (Eds.). Springer International Publishing, Cham, 305--317.Google Scholar
- Fuchun Peng and Andrew McCallum. 2006. Information Extraction from Research Papers Using Conditional Random Fields. Inf. Process. Manage. 42, 4 (2006), 963--979. Google ScholarDigital Library
- Kristie Seymore, Andrew Mccallum, and Ronald Rosenfeld. 1999. Learning Hidden Markov Model Structure for Information Extraction. In In AAAI 99 Workshop on Machine Learning for Information Extraction. 37--42.Google Scholar
- Christopher G. Stahl, Steven R. Young, Drahomira Herrmannova, Robert M. Patton, and Jack C. Wells. 2018. DeepPDF: A Deep Learning Approach to Extracting Text from PDFs. (2018). https://www.osti.gov/biblio/1460210Google Scholar
- Dominika Tkaczyk. 2017. New Methods for Metadata Extraction from Scientific Literature. CoRR abs/1710.10201 (2017). arXiv:1710.10201 http://arxiv.org/abs/1710.10201Google Scholar
- Dominika Tkaczyk, Paweł Szostek, Mateusz Fedoryszak, Piotr Jan Dendek, and Lukasz Bolikowski. 2015. CERMINE: Automatic Extraction of Structured Metadata from Scientific Literature. Int. J. Doc. Anal. Recognit. 18, 4 (2015), 317--335. Google ScholarDigital Library
- Yi Wu, Edward Y. Chang, Kevin Chen-Chuan Chang, and John R. Smith. 2004. Optimal Multimodal Fusion for Multimedia Data Analysis. In Proceedings of the 12th Annual ACM International Conference on Multimedia (MULTIMEDIA '04). Association for Computing Machinery, New York, NY, USA, 572--579. Google ScholarDigital Library
- Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016. Aggregated Residual Transformations for Deep Neural Networks. arXiv preprint arXiv:1611.05431 (2016).Google Scholar
- Guixian Xu, Yueting Meng, Xiaoyu Qiu, Ziheng Yu, and Xu Wu. 2019. Sentiment analysis of comment texts based on BiLSTM. Ieee Access 7 (2019), 51522--51532.Google ScholarCross Ref
- Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. 2019. Publaynet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1015--1022.Google ScholarCross Ref
Index Terms
- Vision and natural language for metadata extraction from scientific PDF documents: a multimodal approach
Recommendations
Reference Metadata Extraction from Scientific Papers
PDCAT '11: Proceedings of the 2011 12th International Conference on Parallel and Distributed Computing, Applications and TechnologiesBibliographical information of scientific papers is of great value since the Science Citation Index is introduced to measure research impact. Most scientific documents available on the web are unstructured or semi-structured, and the automatic reference ...
Evaluation of header metadata extraction approaches and tools for scientific PDF documents
JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital librariesThis paper evaluates the performance of tools for the extraction of metadata from scientific articles. Accurate metadata extraction is an important task for automating the management of digital libraries. This comparative study is a guide for developers ...
Reference metadata extraction using a hierarchical knowledge representation framework
The integration of bibliographical information on scholarly publications available on the Internet is an important task in the academic community. Accurate reference metadata extraction from such publications is essential for the integration of metadata ...
Comments