ABSTRACT
Most scientific articles are available in PDF format. The PDF standard allows the generation of metadata that is included within the document. However, many authors do not define this information, making this feature unreliable or incomplete. This fact has been motivating research which aims to extract metadata automatically. Automatic metadata extraction has been identified as one of the most challenging tasks in document engineering. This work proposes Artic, a method for metadata extraction from scientific papers which employs a two-layer probabilistic framework based on Conditional Random Fields. The first layer aims at identifying the main sections with metadata information, and the second layer finds, for each section, the corresponding metadata. Given a PDF file containing a scientific paper, Artic extracts the title, author names, emails, affiliations, and venue information. We report on experiments using 100 real papers from a variety of publishers. Our results outperformed the state-of-the-art system used as the baseline, achieving a precision of over 99%.
- AUMÜLLER, D. Retrieving metadata for your local scholarly papers. In BTW (2009), pp. 577--583.Google Scholar
- COUNCILL, I. G., GILES, C. L., AND YEN KAN, M. Parscit: An open-source crf reference string parsing package. In LREC (2008).Google Scholar
- DO, H. H. N., CHANDRASEKARAN, M. K., CHO, P. S., AND KAN, M. Y. Extracting and matching authors and affiliations in scholarly documents. In JCDL (2013), pp. 219--228. Google ScholarDigital Library
- FLYNN, P., ZHOU, L., MALY, K., ZEIL, S., AND ZUBAIR, M. Automated template-based metadata extraction architecture. In Intl Conf. on Asian digital libraries (2007), pp. 327--336. Google ScholarDigital Library
- HUANG, Z., JIN, H., YUAN, P., AND HAN, Z. Header metadata extraction from semi-structured documents using template matching. In Proc. Intel Conf. On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part II (2006), pp. 1776--1785. Google ScholarDigital Library
- JOHN LAFFERTY, ANDREW MCCALLUM, F. P. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ScholarlyCommons (2001).Google Scholar
- LIPINSKI, M., YAO, K., BREITINGER, C., BEEL, J., AND GIPP, B. Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In JCDL (2013), pp. 385--386. Google ScholarDigital Library
- LUONG, M.-T., NGUYEN, T. D., AND KAN, M.-Y. Logical structure recovery in scholarly articles with rich document features. IJDLS 1, 4 (2010), 1--23. Google ScholarDigital Library
- POTTHAST, M., STEIN, B., BARRÓN-CEDEÑO, A., AND ROSSO, P. An evaluation framework for plagiarism detection. In COLING (2010), pp. 997--1005. Google ScholarDigital Library
- ROSENTHOL, L. Developing with PDF: Dive Into the Portable Document Format, 1 ed. O'REILLY, October 2013.Google Scholar
- SEYMORE, K., MCCALLUM, A., AND ROSENFELD, R. Learning hidden markov model structure for information extraction. In AAAI Workshop on Machine Learning for Information Extraction (1999), pp. 37--42.Google Scholar
- WALLACH, H. M. Conditional random fields: An introduction. ScholarlyCommons (2004).Google Scholar
- YIN, P., ZHANG, M., DENG, Z., AND YANG, D. Metadata extraction from bibliographies using bigram HMM. In Intl Conf. on Asian digital libraries (2004). Google ScholarDigital Library
Index Terms
- ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF
Recommendations
Evaluation of header metadata extraction approaches and tools for scientific PDF documents
JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital librariesThis paper evaluates the performance of tools for the extraction of metadata from scientific articles. Accurate metadata extraction is an important task for automating the management of digital libraries. This comparative study is a guide for developers ...
Reference Metadata Extraction from Scientific Papers
PDCAT '11: Proceedings of the 2011 12th International Conference on Parallel and Distributed Computing, Applications and TechnologiesBibliographical information of scientific papers is of great value since the Science Citation Index is introduced to measure research impact. Most scientific documents available on the web are unstructured or semi-structured, and the automatic reference ...
Semantic Metadata Integration Support Method for Editable Re-flowable Document OOXML and Fixed-layout Document PDF
CSAE '23: Proceedings of the 7th International Conference on Computer Science and Application EngineeringDocument semantic annotation based on metadata lays the foundation for the automatic understanding and processing of document information. At present, most common documents can only support a small amount of preset metadata, and cannot support semantic ...
Comments