Abstract
We propose a latent feature extraction method for record linkage. We first introduce a probabilistic model that generates records with their latent topics. The proposed generative model is designed to utilize the co-occurrence among the attributes of the record. Then, we derive a topic estimation algorithm using the Gibbs sampling technique. The estimated topics are used to identify records. The proposed algorithm works in an unsupervised way; i.e., we do not need to prepare labor-intensive training data. We evaluated the proposed model using bibliographic records and proved that the proposed method tended to perform better for records with more attributes by utilizing their co-occurrence.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bhattacharya, L., Getoor, L.: A Latent Dirichlet Model for Unsupervised Entity Resolution. In: SIAM Intl. Conf. on Data Mining (2006)
Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: International Conference on Knowledge Discovery and Data Mining, pp. 39–48 (2003)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Bollegara, D., Matsuo, Y., Ishizuka, M.: Extracting key phrases to disambiguate personal name queries in web search. In: Coling-ACL Workshop on How Can Computational Liguistics Improve Information Retrieval (2006)
Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Distance Metrics for Name-matching Tasks. In: IIWeb 2003, pp. 73–78 (2003)
Fleischman, M., Hovy, E.: Multi-document Person Name Resolution. In: ACL Workshop on Reference Resolution (2004)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. In: Proc. of the National Academy of Sciences, vol. 101 (suppl. 1), pp. 5228–5235 (2004)
Shu, L., Long, B., Meng, W.: A latent topic model for complete entity resolution. In: Intl. Conf. on Data Engineering, pp. 880–891 (2009)
Song, Y., Huang, J., Councill, I.G., Li, J., Giles, C.L.: Efficient topic-based unsupervised name disambiguation. In: Joint Conference on Digital Libraries, pp. 342–351 (2007)
Wang, X., Mohanty, N., McCallum, A.: Group and topic discovery from relations and text. In: Proc. of 3rd International Workshop on Link Discovery (LinkKDD), pp. 28–35 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Takasu, A., Fukagawa, D., Akutsu, T. (2009). Latent Topic Extraction from Relational Table for Record Matching. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds) Discovery Science. DS 2009. Lecture Notes in Computer Science(), vol 5808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04747-3_38
Download citation
DOI: https://doi.org/10.1007/978-3-642-04747-3_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04746-6
Online ISBN: 978-3-642-04747-3
eBook Packages: Computer ScienceComputer Science (R0)