Error Detection of CRF-Based Bibliography Extraction from Reference Strings

Ohta, Manabu; Arauchi, Daiki; Takasu, Atsuhiro; Adachi, Jun

doi:10.1007/978-3-642-34752-8_29

Manabu Ohta¹⁸,
Daiki Arauchi¹⁸,
Atsuhiro Takasu¹⁹ &
…
Jun Adachi¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7634))

Included in the following conference series:

International Conference on Asian Digital Libraries

2084 Accesses
2 Citations

Abstract

We proposed a parsing method for reference strings usually listed at the end of research papers to extract important bibliographies such as a title from them. The method uses a conditional random field (CRF) to estimate the correct bibliographic label for each token in the token sequence generated from a reference string. Although we achieved reasonable parsing accuracies for a Japanese academic journal, errors are inevitable. Therefore, this paper proposes ways to increase confidence for CRF-based bibliography parsing to detect such parsing errors. This paper also reports an empirical evaluation of the proposed parsing on the basis not only of its accuracies but also of how easy it is to detect errors. The experiments showed that the proposed measures reasonably indicated parsing errors and could be used to improve the quality of extracted bibliographies at a moderate manual post-editing cost.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Structured References from PDF Articles: Assessing the Tools for Bibliographic Reference Extraction and Parsing

Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications

Two-Tier Machine Learning Using Conditional Random Fields with Constraints

References

Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: HLT-NAACL, pp. 329–336 (2004)
Google Scholar
Okada, T., Takasu, A., Adachi, J.: Bibliographic Component Extraction Using Support Vector Machines and Hidden Markov Models. In: Heery, R., Lyon, L. (eds.) ECDL 2004. LNCS, vol. 3232, pp. 501–512. Springer, Heidelberg (2004)
Chapter Google Scholar
Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: An open-source CRF reference string parsing package. In: Proc. of Language Resources and Evaluation Conference (LREC 2008), pp. 661–667 (2008)
Google Scholar
Ohta, M., Inoue, R., Takasu, A.: Empirical evaluation of CRF-based bibliography extraction from research papers. In: Proc. of IADIS IS 2012, pp. 18–26 (2012)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. of 18th International Conference on Machine Learning, pp. 282–289 (2001)
Google Scholar
Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to Japanese morphological analysis. In: Proc. of EMNLP 2004, pp. 230–237 (2004)
Google Scholar
Ohta, M., Arauchi, D., Takasu, A., Adachi, J.: CRF-based bibliography extraction from reference strings focusing on various token granularities. In: Proc. of IAPR DAS 2012, pp. 276–281 (2012)
Google Scholar
Settles, B., Craven, M.: An analysis of active learning strategies for sequence labeling tasks. In: Proc. of EMNLP 2008, pp. 1070–1079 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Okayama University, Okayama, 700-8530, Japan
Manabu Ohta & Daiki Arauchi
National Institute of Informatics, Tokyo, 101-8430, Japan
Atsuhiro Takasu & Jun Adachi

Authors

Manabu Ohta
View author publications
You can also search for this author in PubMed Google Scholar
Daiki Arauchi
View author publications
You can also search for this author in PubMed Google Scholar
Atsuhiro Takasu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Adachi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Taiwan University, No. 1, Sec.4, Roosevelt Road, 10617, Taipei, Taiwan
Hsin-Hsi Chen
University of Technology Sydney, Broadway, PO Box 123, 2007, Sydney, NSW, Australia
Gobinda Chowdhury

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ohta, M., Arauchi, D., Takasu, A., Adachi, J. (2012). Error Detection of CRF-Based Bibliography Extraction from Reference Strings. In: Chen, HH., Chowdhury, G. (eds) The Outreach of Digital Libraries: A Globalized Resource Network. ICADL 2012. Lecture Notes in Computer Science, vol 7634. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34752-8_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-34752-8_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34751-1
Online ISBN: 978-3-642-34752-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics