skip to main content
10.1145/2644866.2644872acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF

Authors Info & Claims
Published:16 September 2014Publication History

ABSTRACT

Most scientific articles are available in PDF format. The PDF standard allows the generation of metadata that is included within the document. However, many authors do not define this information, making this feature unreliable or incomplete. This fact has been motivating research which aims to extract metadata automatically. Automatic metadata extraction has been identified as one of the most challenging tasks in document engineering. This work proposes Artic, a method for metadata extraction from scientific papers which employs a two-layer probabilistic framework based on Conditional Random Fields. The first layer aims at identifying the main sections with metadata information, and the second layer finds, for each section, the corresponding metadata. Given a PDF file containing a scientific paper, Artic extracts the title, author names, emails, affiliations, and venue information. We report on experiments using 100 real papers from a variety of publishers. Our results outperformed the state-of-the-art system used as the baseline, achieving a precision of over 99%.

References

  1. AUMÜLLER, D. Retrieving metadata for your local scholarly papers. In BTW (2009), pp. 577--583.Google ScholarGoogle Scholar
  2. COUNCILL, I. G., GILES, C. L., AND YEN KAN, M. Parscit: An open-source crf reference string parsing package. In LREC (2008).Google ScholarGoogle Scholar
  3. DO, H. H. N., CHANDRASEKARAN, M. K., CHO, P. S., AND KAN, M. Y. Extracting and matching authors and affiliations in scholarly documents. In JCDL (2013), pp. 219--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. FLYNN, P., ZHOU, L., MALY, K., ZEIL, S., AND ZUBAIR, M. Automated template-based metadata extraction architecture. In Intl Conf. on Asian digital libraries (2007), pp. 327--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. HUANG, Z., JIN, H., YUAN, P., AND HAN, Z. Header metadata extraction from semi-structured documents using template matching. In Proc. Intel Conf. On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part II (2006), pp. 1776--1785. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. JOHN LAFFERTY, ANDREW MCCALLUM, F. P. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ScholarlyCommons (2001).Google ScholarGoogle Scholar
  7. LIPINSKI, M., YAO, K., BREITINGER, C., BEEL, J., AND GIPP, B. Evaluation of header metadata extraction approaches and tools for scientific PDF documents. In JCDL (2013), pp. 385--386. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. LUONG, M.-T., NGUYEN, T. D., AND KAN, M.-Y. Logical structure recovery in scholarly articles with rich document features. IJDLS 1, 4 (2010), 1--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. POTTHAST, M., STEIN, B., BARRÓN-CEDEÑO, A., AND ROSSO, P. An evaluation framework for plagiarism detection. In COLING (2010), pp. 997--1005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. ROSENTHOL, L. Developing with PDF: Dive Into the Portable Document Format, 1 ed. O'REILLY, October 2013.Google ScholarGoogle Scholar
  11. SEYMORE, K., MCCALLUM, A., AND ROSENFELD, R. Learning hidden markov model structure for information extraction. In AAAI Workshop on Machine Learning for Information Extraction (1999), pp. 37--42.Google ScholarGoogle Scholar
  12. WALLACH, H. M. Conditional random fields: An introduction. ScholarlyCommons (2004).Google ScholarGoogle Scholar
  13. YIN, P., ZHANG, M., DENG, Z., AND YANG, D. Metadata extraction from bibliographies using bigram HMM. In Intl Conf. on Asian digital libraries (2004). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering
          September 2014
          226 pages
          ISBN:9781450329491
          DOI:10.1145/2644866

          Copyright © 2014 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 16 September 2014

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          DocEng '14 Paper Acceptance Rate15of41submissions,37%Overall Acceptance Rate178of537submissions,33%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader