Abstract
We proposed a parsing method for reference strings usually listed at the end of research papers to extract important bibliographies such as a title from them. The method uses a conditional random field (CRF) to estimate the correct bibliographic label for each token in the token sequence generated from a reference string. Although we achieved reasonable parsing accuracies for a Japanese academic journal, errors are inevitable. Therefore, this paper proposes ways to increase confidence for CRF-based bibliography parsing to detect such parsing errors. This paper also reports an empirical evaluation of the proposed parsing on the basis not only of its accuracies but also of how easy it is to detect errors. The experiments showed that the proposed measures reasonably indicated parsing errors and could be used to improve the quality of extracted bibliographies at a moderate manual post-editing cost.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: HLT-NAACL, pp. 329–336 (2004)
Okada, T., Takasu, A., Adachi, J.: Bibliographic Component Extraction Using Support Vector Machines and Hidden Markov Models. In: Heery, R., Lyon, L. (eds.) ECDL 2004. LNCS, vol. 3232, pp. 501–512. Springer, Heidelberg (2004)
Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: An open-source CRF reference string parsing package. In: Proc. of Language Resources and Evaluation Conference (LREC 2008), pp. 661–667 (2008)
Ohta, M., Inoue, R., Takasu, A.: Empirical evaluation of CRF-based bibliography extraction from research papers. In: Proc. of IADIS IS 2012, pp. 18–26 (2012)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. of 18th International Conference on Machine Learning, pp. 282–289 (2001)
Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to Japanese morphological analysis. In: Proc. of EMNLP 2004, pp. 230–237 (2004)
Ohta, M., Arauchi, D., Takasu, A., Adachi, J.: CRF-based bibliography extraction from reference strings focusing on various token granularities. In: Proc. of IAPR DAS 2012, pp. 276–281 (2012)
Settles, B., Craven, M.: An analysis of active learning strategies for sequence labeling tasks. In: Proc. of EMNLP 2008, pp. 1070–1079 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ohta, M., Arauchi, D., Takasu, A., Adachi, J. (2012). Error Detection of CRF-Based Bibliography Extraction from Reference Strings. In: Chen, HH., Chowdhury, G. (eds) The Outreach of Digital Libraries: A Globalized Resource Network. ICADL 2012. Lecture Notes in Computer Science, vol 7634. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34752-8_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-34752-8_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34751-1
Online ISBN: 978-3-642-34752-8
eBook Packages: Computer ScienceComputer Science (R0)