A Sequence Labeling Method Using Syntactical and Textual Patterns for Record Linkage

Takasu, Atsuhiro

doi:10.1007/11551188_22

Atsuhiro Takasu²⁰

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 3686))

Included in the following conference series:

International Conference on Pattern Recognition and Image Analysis

1897 Accesses
1 Citations

Abstract

Record linkage is an important application area of text pattern analysis. In this paper we propose a new sequence labeling method that can be used to extract entities from a string for record linkage. The proposed method combines a classifier and a Hidden Markov Model (HMM) to utilize both syntactical and textual information from the string. We first describe the model used in the proposed method and then discuss the parameter estimation for this model. The proposed method incorporates a classifier for handling textual information and integrates the classifier with the HMM statistically by estimating the error probability of the classifier. We applied the proposed method to the bibliographic sequence labeling problem, in which bibliographic components are extracted from reference strings. We compared the proposed method with other methods that use textual or syntactical information alone and showed that the proposed method outperforms them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Record Linkage in Medieval and Early Modern Text

Utilization of Multiple Sequence Analyzers for Bibliographic Information Extraction

Supervised Negative Binomial Classifier for Probabilistic Record Linkage

References

Aizawa, A., Oyama, K.: A First Linkage Detection Scheme for Multi-Source Information Integration. In: Proc. of Intl. Workshop. on Challenges in Web Information Retrieval and Integration (WIRI 2005), pp. 31–40 (2005)
Google Scholar
Aizawa, A., Oyama, K., Takasu, A., Adachi, J.: Techniques and research trends in record linkage studies. IEICE Tran. on Inforamtion and Systems J88-D-I(3), 576–589 (2005)
Google Scholar
Altun, Y., Hofmann, T.: Large Margin Methods for Label Sequence Learning. In: Proc. of 8th European Conf. on Speeach Communication and Technology (2003)
Google Scholar
Ayres, F.H., Huggill, J.A.W., Yannakoudakis, E.J.: The universal standard bibligraphic code (USBC): its use for clearing, merging and controlling large databases. Program - Automated Library and Information Systems 22(2), 117–132 (1988)
Article Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Infromation Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: Proc. of 9th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD 2003), pp. 39–48 (2003)
Google Scholar
CiteSeer.IST: Scientific Leterature Digital Library, http://citeseer.ist.psu.edu/cs
McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Information Retrieval 3(2), 127–163 (2000)
Article Google Scholar
Mecab, http://chasen.org/~taku/software/mecab/
Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
Article Google Scholar
Okada, T., Takasu, A., Adachi, J.: Bibliographic component extraction using support vector machines and hidden markov models. In: Heery, R., Lyon, L. (eds.) ECDL 2004. LNCS, vol. 3232, pp. 501–512. Springer, Heidelberg (2004)
Chapter Google Scholar
Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
Article Google Scholar
Google Scholar, http://scholar.google.com
Takasu, A.: Bibliographic Attribute Extraction from Erroneous References Based on a Statistical Model. In: Proc. of ACM-IEEE Joint Conference on Digital Libraries (JCDL 2003), pp. 49–60 (2003)
Google Scholar
Takasu, A., Katayama, N., et al.: Approximate Matching for OCR-Processed Bibliographic Data. In: Proc. of 13th Internationa Conference on Pattern Recognition, pp. 175–179 (1996)
Google Scholar
Tax, D.M.J., Duin, R.P.W.: Using Two-Class Classifiers for Multiclass Classification. In: Proc. of Infl. Conf. on Pattern Recognition (2002)
Google Scholar
TinySVM, http://chasen.org/~taku/software/TinySVM/
Li, D.R.X., Morie, P.: Semantic integration over text: From ambiguous names to identifiable entities. AI Magazine 26(1), 45–58 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan
Atsuhiro Takasu

Authors

Atsuhiro Takasu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research School of Infomatics, Loughborough, UK
Sameer Singh
ATR Lab, Research School of Informatics, University of Loughborough, Loughborough, UK
Maneesha Singh
IBM Corporation, 1133 Wetchester Avenue, White Plains, 10604, New York, United States
Chid Apte
Institute of Computer Vision and applied Computer Sciences, IBaI, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Takasu, A. (2005). A Sequence Labeling Method Using Syntactical and Textual Patterns for Record Linkage. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds) Pattern Recognition and Data Mining. ICAPR 2005. Lecture Notes in Computer Science, vol 3686. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551188_22

Download citation

DOI: https://doi.org/10.1007/11551188_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28757-5
Online ISBN: 978-3-540-28758-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics