Abstract
The effective use of digital libraries demands maintenance of bibliographic databases. Especially, the reference fields of academic papers are full of useful bibliographic information such as authors’ names and paper titles. We, therefore, propose a method of automatically extracting bibliographic information from reference strings using a conditional random field (CRF). However, at least a few hundred reference strings are necessary for training the CRF to achieve high extraction accuracies. As described herein, we propose the use of active sampling and pseudo-training data to reduce the amount of training data. Then we evaluate the associated training costs by experimentation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ohta, M., Arauchi, D., Takasu, A., Adachi, J.: Error detection of CRF-Based bibliography extraction from reference strings. In: Chen, H.-H., Chowdhury, G. (eds.) ICADL 2012. LNCS, vol. 7634, pp. 229–238. Springer, Heidelberg (2012)
Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: HLT-NAACL, pp. 329–336 (2004)
Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: An open-source CRF reference string parsing package. In: Proc. of Language Resources and Evaluation Conference (LREC 20), pp. 661–667 (2008)
Takasu, A., Ohta, M.: Rule management for information extraction from title pages of academic papers. In: Proc. of ICPRAM 2014, pp. 438–444 (2014)
Ohta, M., Arauchi, D., Takasu, A., Adachi, J.: Empirical evaluation of CRF-based bibliography extraction from reference strings. In: Proc. of IAPR DAS 2014, pp. 287–292 (2014)
Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to Japanese morphological analysis. In: Proc. of EMNLP 2004, pp. 230–237 (2004)
Settles, B., Craven, M.: An analysis of active learning strategies for sequence labeling tasks. In: Proc. of EMNLP 2008, pp. 1070–1079 (2008)
Saar-Tsechansky, M., Provost, F.: Active sampling for class probability estimation and ranking. Machine Learning 54, 153–178 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Kawakami, N., Ohta, M., Takasu, A., Adachi, J. (2014). Cost Evaluation of CRF-Based Bibliography Extraction from Reference Strings. In: Tuamsuk, K., Jatowt, A., Rasmussen, E. (eds) The Emergence of Digital Libraries – Research and Practices. ICADL 2014. Lecture Notes in Computer Science, vol 8839. Springer, Cham. https://doi.org/10.1007/978-3-319-12823-8_28
Download citation
DOI: https://doi.org/10.1007/978-3-319-12823-8_28
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12822-1
Online ISBN: 978-3-319-12823-8
eBook Packages: Computer ScienceComputer Science (R0)