Abstract
Record linking is the task of detecting records in several databases that refer to the same entity. This task aims at exploring the relationship between entities, which normally lack common identifiers in heterogeneous datasets. When entities contain multiple relational records, linking them across datasets can be more accurate by treating the records as groups, which leads to group linking methods. Even so, individual record links may still be needed for the final group linking step. This problem can be solved by multiple instance learning, in which group links are modelled as bags, and record links are considered as instances. In this paper, we propose a novel method for instance classification and group record linkage via bag reconstruction from instances. The bag reconstruction is based on the modeling of the distribution of negative instances in the training bags via kernel density estimation. We evaluate this approach on both synthetic and real-world data. Our results show that the proposed method can outperform several baseline methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 11–18 (2004)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, pp. 39–48 (2003)
Chartrand, G.: Introductory Graph Theory. Dover Publications (1985)
Chen, Y., Bi, J., Wang, J.: MILES: Multiple-instance learning via embedded instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(12), 1931–1947 (2006)
Christen, P.: Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 151–159. ACM (2008)
Christen, P.: Development and user experiences of an open source data cleaning, deduplication and record linkage system. ACM SIGKDD Explorations 11(1), 39–48 (2009)
Christen, P.: Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer (2012)
Dunn, H.L.: Record linkage. American Journal of Public Health 36(12), 1412–1416 (1946)
Elfeky, M., Verykios, V., Elmagarmid, A.: Tailor: A record linkage toolbox. In: Proceedings of the 18th International Conference on Data Engineering, pp. 17–28 (2002)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)
Fu, Z., Christen, P., Boot, M.: Automatic cleaning and linking of historical census data using household information. In: Proceedings of the 15th International Workshop on Domain Driven Data Mining, Vancouver, Canada, pp. 413–420 (2011)
Fu, Z., Christen, P., Boot, M.: A supervised learning and group linking method for historical census household linkage. In: Proceedings of the 19th Ninth Australasian Data Mining Conference, Ballarat, Australia (2011)
Fu, Z., Zhou, J., Christen, P., Boot, M.: Multiple Instance Learning for Group Record Linkage. In: Tan, P.-N., Chawla, S., Ho, C.K., Bailey, J. (eds.) PAKDD 2012, Part I. LNCS, vol. 7301, pp. 171–182. Springer, Heidelberg (2012)
Fu, Z., Robles-Kelly, A., Zhou, J.: MILIS: Multiple instance learning with instance selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(5), 958–977 (2011)
Herschel, M., Naumann, F.: Scaling up duplicate detection in graph data. In: Proceedings of the ACM International Conference on Information and Knowledge Management, Napa Valley, California, pp. 1325–1326 (2008)
Herzog, T.N., Scheuren, F., Winkler, W.E.: Data quality and record linkage techniques. Springer ( (2007)
Kalashnikov, D.V., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems 31(2), 716–767 (2006)
Namata, G.M., Kok, S., Getoor, L.: Collective graph identification. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 87–95 (2011)
Naumann, F., Herschel, M.: An introduction to duplicate detection. Synthesis Lectures on Data Management 2(1), 1–87 (2010)
On, B.W., Koudas, N., Lee, D., Srivastava, D.: Group linkage. In: Proceeding of the IEEE International Conference on Data Engineering, Istanbul, Turkey, pp. 496–505 (2007)
Rossi, R.A., KcDowell, L.K., Aha, D.W., Neville, J.: Transforming graph representations for statistical relational learning. Journal of Artificial Intelligence Research (2012)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer (1995)
Winkler, W.E.: Methods for record linkage and bayesian networks. Technical report, US Bureau of the Census (2001)
Xiang, R., Neville, J., Rogati, M.: Modeling relationship strength in online social networks. In: Proceedings of the 19th International World Wide Web Conference, pp. 981–990 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fu, Z., Zhou, J., Peng, F., Christen, P. (2012). A Bag Reconstruction Method for Multiple Instance Classification and Group Record Linkage. In: Zhou, S., Zhang, S., Karypis, G. (eds) Advanced Data Mining and Applications. ADMA 2012. Lecture Notes in Computer Science(), vol 7713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35527-1_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-35527-1_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35526-4
Online ISBN: 978-3-642-35527-1
eBook Packages: Computer ScienceComputer Science (R0)