Abstract
Record linkage is an important application area of text pattern analysis. In this paper we propose a new sequence labeling method that can be used to extract entities from a string for record linkage. The proposed method combines a classifier and a Hidden Markov Model (HMM) to utilize both syntactical and textual information from the string. We first describe the model used in the proposed method and then discuss the parameter estimation for this model. The proposed method incorporates a classifier for handling textual information and integrates the classifier with the HMM statistically by estimating the error probability of the classifier. We applied the proposed method to the bibliographic sequence labeling problem, in which bibliographic components are extracted from reference strings. We compared the proposed method with other methods that use textual or syntactical information alone and showed that the proposed method outperforms them.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aizawa, A., Oyama, K.: A First Linkage Detection Scheme for Multi-Source Information Integration. In: Proc. of Intl. Workshop. on Challenges in Web Information Retrieval and Integration (WIRI 2005), pp. 31–40 (2005)
Aizawa, A., Oyama, K., Takasu, A., Adachi, J.: Techniques and research trends in record linkage studies. IEICE Tran. on Inforamtion and Systems J88-D-I(3), 576–589 (2005)
Altun, Y., Hofmann, T.: Large Margin Methods for Label Sequence Learning. In: Proc. of 8th European Conf. on Speeach Communication and Technology (2003)
Ayres, F.H., Huggill, J.A.W., Yannakoudakis, E.J.: The universal standard bibligraphic code (USBC): its use for clearing, merging and controlling large databases. Program - Automated Library and Information Systems 22(2), 117–132 (1988)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Infromation Retrieval. Addison-Wesley, Reading (1999)
Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: Proc. of 9th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD 2003), pp. 39–48 (2003)
CiteSeer.IST: Scientific Leterature Digital Library, http://citeseer.ist.psu.edu/cs
McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Information Retrieval 3(2), 127–163 (2000)
Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
Okada, T., Takasu, A., Adachi, J.: Bibliographic component extraction using support vector machines and hidden markov models. In: Heery, R., Lyon, L. (eds.) ECDL 2004. LNCS, vol. 3232, pp. 501–512. Springer, Heidelberg (2004)
Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
Google Scholar, http://scholar.google.com
Takasu, A.: Bibliographic Attribute Extraction from Erroneous References Based on a Statistical Model. In: Proc. of ACM-IEEE Joint Conference on Digital Libraries (JCDL 2003), pp. 49–60 (2003)
Takasu, A., Katayama, N., et al.: Approximate Matching for OCR-Processed Bibliographic Data. In: Proc. of 13th Internationa Conference on Pattern Recognition, pp. 175–179 (1996)
Tax, D.M.J., Duin, R.P.W.: Using Two-Class Classifiers for Multiclass Classification. In: Proc. of Infl. Conf. on Pattern Recognition (2002)
Li, D.R.X., Morie, P.: Semantic integration over text: From ambiguous names to identifiable entities. AI Magazine 26(1), 45–58 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Takasu, A. (2005). A Sequence Labeling Method Using Syntactical and Textual Patterns for Record Linkage. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds) Pattern Recognition and Data Mining. ICAPR 2005. Lecture Notes in Computer Science, vol 3686. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551188_22
Download citation
DOI: https://doi.org/10.1007/11551188_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28757-5
Online ISBN: 978-3-540-28758-2
eBook Packages: Computer ScienceComputer Science (R0)