PARDA: A Dataset for Scholarly PDF Document Metadata Extraction Evaluation

Fan, Tiantian; Liu, Junming; Qiu, Yeliang; Jiang, Congfeng; Zhang, Jilin; Zhang, Wei; Wan, Jian

doi:10.1007/978-3-030-12981-1_29

Tiantian Fan^19,20,
Junming Liu^19,20,
Yeliang Qiu^19,20,
Congfeng Jiang^19,20,
Jilin Zhang^19,20,
Wei Zhang^19,20 &
…
Jian Wan^20,21

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 268))

Included in the following conference series:

International Conference on Collaborative Computing: Networking, Applications and Worksharing

879 Accesses
1 Citations

Abstract

Metadata extraction from scholarly PDF documents is the fundamental work of publishing, archiving, digital library construction, bibliometrics, and scientific competitiveness analysis and evaluations. However, different scholarly PDF documents have different layout and document elements, which make it impossible to compare different extract approaches since testers use different source of test documents even if the documents are from the same journal or conference. Therefore, standard datasets based performance evaluation of various extraction approaches can setup a fair and reproducible comparison. In this paper we present a dataset, namely, PARDA(Pdf Analysis and Recognition DAtaset), for performance evaluation and analysis of scholarly documents, especially on metadata extraction, such as title, authors, affiliation, author-affiliation-email matching, year, date, etc. The dataset covers computer science, physics, life science, management, mathematics, and humanities from various publishers including ACM, IEEE, Springer, Elsevier, arXiv, etc. And each document has distinct layouts and appearance in terms of formatting of metadata. We also construct the ground truth metadata in Dublin Core XML format and BibTex format file associated this dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Lipinski, M., Yao, K., Breitinger, C., Beel, J., Gipp, B.: Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In: JCDL 2013 Indianapolis, Indiana, USA, 22–26 July 2013, pp. 385–386 (2010)
Google Scholar
Do, H.H.N., Chandrasekaran, M.K., Cho, P.S., Kan, M.Y.: Extracting and matching authors and affiliations in scholarly documents. In: JCDL 2013, Indianapolis, Indiana, USA, 22–26 July 2013, pp. 219–228 (2013)
Google Scholar
Jiang, C., Liu, J., Ou, D., Wang, Y., Yu, L.: Implicit semantics based metadata extraction and matching of scholarly documents. J. Database Manag. (JDM) 29, 1–22 (2018). https://doi.org/10.4018/JDM.2018040101
Article Google Scholar
Tkaczyk, D., Szostek, P., Bolikowski, Ł.: GROTOAP2—the methodology of creating a large ground truth dataset of scientific articles. 20(11/12) (2014)
Google Scholar
Märgner, V., El Abed, H.: Tools and metrics for document analysis systems evaluation. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 1011–1036
Chapter Google Scholar
Antonacopoulos, A., Bridson, D., Papadopoulos, C., Pletschacher, S.: A realistic dataset for performance evaluation of document layout analysis. In: 10th International Conference on Document Analysis and Recognition, ICDAR 2005 (2005)
Google Scholar
Nartker, T.A., Rice, S.V., Lumos, S.E.: Software tools and test data for research and testing of page-reading OCR systems. In: SPIE and IS&T (2005)
Google Scholar
Todoran, L., Worring, M., Smeulders, A.W.M.: The UvA color document dataset. IJDAR 7, 228–240 (2005)
Article Google Scholar
Becker, C., Duretec, K.: Free benchmark corpora for preservation experiments: using model-driven engineering to generate data sets. In: JCDL 2013, pp. 349–358 (2013)
Google Scholar
Caragea, C., et al.: CiteSeer^x: a scholarly big dataset. In: de Rijke, Maarten, et al. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 311–322. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_26
Chapter Google Scholar
Antonacopoulos, A., Karatzas, D., Bridson, D.: Ground truth for layout analysis performance evaluation. In: IAPR International Workshop on Document Analysis Systems, DAS 2006 (2006)
Chapter Google Scholar
Tkaczyk, D., Czeczko, A., Rusek, K., Bolikowski, L., Bogacewicz, R.: GROTOAP: ground truth for open access publications. In: JCDL 2012, pp. 381–382 (2012)
Google Scholar
Tao, X., Tang, Z., Xu, C., Gao, L.: Ground-truth and performance evaluation for page layout analysis of born-digital documents. In: 2014 11th IAPR International Workshop on Document Analysis Systems, DAS 2014, pp. 247–251 (2014)
Google Scholar
Valveny, E.: Datasets and annotations for document analysis and recognition. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 983–1009
Chapter Google Scholar
http://pdfbox.apache.org
Jeffery, K.G., Houssos, N., Jörg, B., Asserson, A.: Research information management: the CERIF approach. Int. J. Metadata Semant. Ontol. 9, 5–14 (2014)
Article Google Scholar
http://dublincore.org/

Download references

Acknowledgment

The funding support of this work by Natural Science Fund of China (No. 61472109, No. 61572163, No. 61672200, and No. 61772165) is greatly appreciated.

Author information

Authors and Affiliations

School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, China
Tiantian Fan, Junming Liu, Yeliang Qiu, Congfeng Jiang, Jilin Zhang & Wei Zhang
Key Laboratory of Complex Systems Modeling and Simulation, Ministry of Education, Hangzhou, 310018, China
Tiantian Fan, Junming Liu, Yeliang Qiu, Congfeng Jiang, Jilin Zhang, Wei Zhang & Jian Wan
School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou, 310023, China
Jian Wan

Authors

Tiantian Fan
View author publications
You can also search for this author in PubMed Google Scholar
Junming Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yeliang Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Congfeng Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Jilin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jian Wan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Congfeng Jiang .

Editor information

Editors and Affiliations

Shanghai University, Shanghai, China
Honghao Gao
University of West London, London, UK
Xinheng Wang
Hangzhou Dianzi University, Hangzhou Shi, Zhejiang, China
Yuyu Yin
London South Bank University, London, UK
Muddesar Iqbal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, T. et al. (2019). PARDA: A Dataset for Scholarly PDF Document Metadata Extraction Evaluation. In: Gao, H., Wang, X., Yin, Y., Iqbal, M. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2018. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 268. Springer, Cham. https://doi.org/10.1007/978-3-030-12981-1_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-12981-1_29
Published: 07 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12980-4
Online ISBN: 978-3-030-12981-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics