research-article

ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF

Authors:
Alan Souza

Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil

Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
View Profile

,
Viviane Moreira

Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil

Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
View Profile

,
Carlos Heuser

Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil

Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
View Profile

DocEng '14: Proceedings of the 2014 ACM symposium on Document engineeringSeptember 2014Pages 121–130https://doi.org/10.1145/2644866.2644872

Published:16 September 2014Publication History

DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering

Pages 121–130

ABSTRACT

Most scientific articles are available in PDF format. The PDF standard allows the generation of metadata that is included within the document. However, many authors do not define this information, making this feature unreliable or incomplete. This fact has been motivating research which aims to extract metadata automatically. Automatic metadata extraction has been identified as one of the most challenging tasks in document engineering. This work proposes Artic, a method for metadata extraction from scientific papers which employs a two-layer probabilistic framework based on Conditional Random Fields. The first layer aims at identifying the main sections with metadata information, and the second layer finds, for each section, the corresponding metadata. Given a PDF file containing a scientific paper, Artic extracts the title, author names, emails, affiliations, and venue information. We report on experiments using 100 real papers from a variety of publishers. Our results outperformed the state-of-the-art system used as the baseline, achieving a precision of over 99%.

References

AUMÜLLER, D. Retrieving metadata for your local scholarly papers. In BTW (2009), pp. 577--583.Google Scholar
COUNCILL, I. G., GILES, C. L., AND YEN KAN, M. Parscit: An open-source crf reference string parsing package. In LREC (2008).Google Scholar
DO, H. H. N., CHANDRASEKARAN, M. K., CHO, P. S., AND KAN, M. Y. Extracting and matching authors and affiliations in scholarly documents. In JCDL (2013), pp. 219--228. Google ScholarDigital Library
FLYNN, P., ZHOU, L., MALY, K., ZEIL, S., AND ZUBAIR, M. Automated template-based metadata extraction architecture. In Intl Conf. on Asian digital libraries (2007), pp. 327--336. Google ScholarDigital Library
HUANG, Z., JIN, H., YUAN, P., AND HAN, Z. Header metadata extraction from semi-structured documents using template matching. In Proc. Intel Conf. On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part II (2006), pp. 1776--1785. Google ScholarDigital Library
JOHN LAFFERTY, ANDREW MCCALLUM, F. P. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ScholarlyCommons (2001).Google Scholar
LIPINSKI, M., YAO, K., BREITINGER, C., BEEL, J., AND GIPP, B. Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In JCDL (2013), pp. 385--386. Google ScholarDigital Library
LUONG, M.-T., NGUYEN, T. D., AND KAN, M.-Y. Logical structure recovery in scholarly articles with rich document features. IJDLS 1, 4 (2010), 1--23. Google ScholarDigital Library
POTTHAST, M., STEIN, B., BARRÓN-CEDEÑO, A., AND ROSSO, P. An evaluation framework for plagiarism detection. In COLING (2010), pp. 997--1005. Google ScholarDigital Library
ROSENTHOL, L. Developing with PDF: Dive Into the Portable Document Format, 1 ed. O'REILLY, October 2013.Google Scholar
SEYMORE, K., MCCALLUM, A., AND ROSENFELD, R. Learning hidden markov model structure for information extraction. In AAAI Workshop on Machine Learning for Information Extraction (1999), pp. 37--42.Google Scholar
WALLACH, H. M. Conditional random fields: An introduction. ScholarlyCommons (2004).Google Scholar
YIN, P., ZHANG, M., DENG, Z., AND YANG, D. Metadata extraction from bibliographies using bigram HMM. In Intl Conf. on Asian digital libraries (2004). Google ScholarDigital Library

Index Terms

ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF

Recommendations

Evaluation of header metadata extraction approaches and tools for scientific PDF documents
JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

This paper evaluates the performance of tools for the extraction of metadata from scientific articles. Accurate metadata extraction is an important task for automating the management of digital libraries. This comparative study is a guide for developers ...
Read More
Reference Metadata Extraction from Scientific Papers
PDCAT '11: Proceedings of the 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies

Bibliographical information of scientific papers is of great value since the Science Citation Index is introduced to measure research impact. Most scientific documents available on the web are unstructured or semi-structured, and the automatic reference ...
Read More
Semantic Metadata Integration Support Method for Editable Re-flowable Document OOXML and Fixed-layout Document PDF
CSAE '23: Proceedings of the 7th International Conference on Computer Science and Application Engineering

Document semantic annotation based on metadata lays the foundation for the automatic understanding and processing of document information. At present, most common documents can only support a small amount of preset metadata, and cannot support semantic ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering
September 2014
226 pages
ISBN:9781450329491
DOI:10.1145/2644866
General Chair:
Steven Simske
Hewlett-Packard, Fort Collins, USA
,
Program Chair:
Sebastian Rönnau
Zalando AG, Berlin, Germany
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 September 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crf
machine learning
metadata extraction
pdf
Qualifiers
- research-article
Conference

Acceptance Rates
DocEng '14 Paper Acceptance Rate15of41submissions,37%Overall Acceptance Rate178of537submissions,33%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 241
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF

DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluation of header metadata extraction approaches and tools for scientific PDF documents

Reference Metadata Extraction from Scientific Papers

Semantic Metadata Integration Support Method for Editable Re-flowable Document OOXML and Fixed-layout Document PDF