Abstract
It is important to utilize retrospective documents when constructing a large digital library. This paper proposes a method for analyzing recognized bibliographic strings using an extended hidden Markov model. The proposed method enables analysis of erroneous bibliographic strings and integrates many documents accumulated as printed articles in a citation index. The proposed method has the advantage of providing a robust bibliographic matching function using the statistical description of the syntax of bibliographic strings, a language model and an Optical Character Recognition (OCR) error model. The method also has the advantage of reducing the cost of preparing training data for parameter estimation, using records in the bibliographic database.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
F. H. Ayres, J. A. W. Huggill, and E. J. Yannakoudakis. The universal standard bibligraphic code (usbc): its use for clearing, merging and controlling large databases. Program— Automated Library and Information Systems, 22(2):117–132, 1988.
A. Belaid, J. C. Anigbogu, and Y. Chenevoy. Qualitative Analysis of Low-Level Logical Structures. In Proc. of International Conference on Electronic Publishing, pages 435–446, 1994.
H. Bunke and P.S.P. Wang, editors. Handbook of Character Recoginition and Document Image Analysis. World Scientific, 1997.
CrossRef The central source for reference linking:. http://www.crossref.org/. In Proc. of International Conference on Digital Libraries, pages 89–98, 1998.
C. L. Giles, K. D. Bollacker, and S. Lawrence. CiteSeer: An Automatic Citation Indexing System. In Proc. of International Conference on Digital Libraries, pages 89–98, 1998.
P. Goyal. An investigation of different string coding methods. Journal of the American Society for Information Science, 35(4):248–252, 1984.
P. Goyal. Duplicate record identification in bibiliographic databases. Information Systems, 12(3):239–242, 1987.
The Digital Object Identifier:. http://www.doi.org/. In Proc. of International Conference on Digital Libraries, pages 89–98, 1998.
S. Kahan, T. Pavlidis, and H. S. Baird. On the recognition of printed characters of any font and size. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(2):274–288, March 1987.
Karen Kukich. “Techniques for Automtically Correcting Words in Text”. ACM Computing Surveys, 24(4):377–439, 1992.
S. Lawrence, C. L. Giles, and K. D. Bollacker. Digital libraries and autonmous citation indexing. IEEE Computer, 32(6):67–71, June 1999.
Y. Li, D. Lopresti, and A. Tomkins. “Validation of Document Image Defect Models for Optical Character Recognition”. In Proc. of 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 137–150, 1994.
T. O'Neill, E., A. Rogers, S., and M. Oskins, W. Characteristics of duplicate records in OCLC’s online union catalog. Library Resources & Technical Services, 37(1):59–71, 1992.
F. Parmentier and A. Belaid. “Bibliography References Validation Using Emergent Architecture”. In Proc. of IAPR International Conference on Document Analysis and Recognition, pages 532–535, 1995.
G. A. Story, L. O'Gorman, D. Fox, L. L. Schaper, and H. V. Jagadish. The rightpages image-based electronic library for alerting and browsing. IEEE Computer., 25(9):17–26, 1992.
A. Takasu. Probabilistic interpage analysis for article extraction from document images. In Proc. of 14th International Conference on Pattern Recognition, pages 932–935. IAPR, 1998.
A. Takasu and K. Aihara. “DVHMM: Variable Length Text Recognition Error Model”. In submit to 15th Internationa Conference on Pattern Recognition, pages xx–xx, 2002.
A. Takasu, N. Katayama, and et. al. “Approximate Matching for OCR-Processed Bibliographic Data”. In Proc. of 13th Internationa Conference on Pattern Recognition, pages 175–179, 1996.
K. Y. Wong, R. G. Casey, and F. M. Wahl. “Document Analysis System”. IBM journal Research and Development, 26(6):647–656, 1982.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Takasu, A. (2002). Statistical Analysis of Bibliographic Strings for Constructing an Integrated Document Space. In: Agosti, M., Thanos, C. (eds) Research and Advanced Technology for Digital Libraries. ECDL 2002. Lecture Notes in Computer Science, vol 2458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45747-X_6
Download citation
DOI: https://doi.org/10.1007/3-540-45747-X_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44178-6
Online ISBN: 978-3-540-45747-3
eBook Packages: Springer Book Archive