Abstract
Entity resolution is a common data cleaning and data integration problem that involves determining which records in one or more data sets refer to the same real-world entities. It has numerous applications for commercial, academic and government organisations. For most practical entity resolution applications, training data does not exist which limits the type of classification models that can be applied. This also prevents complex techniques such as Markov logic networks from being used on real-world problems. In this paper we apply an active learning based technique to generate training data for a Markov logic network based entity resolution model and learn the weights for the formulae in a Markov logic network. We evaluate our technique on real-world data sets and show that we can generate balanced training data and learn and also learn approximate weights for the formulae in the Markov logic network.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: ACM SIGMOD, pp. 783–794, Indianapolis (2010)
Bellare, K., Iyengar, S., Parameswaran, A.G., Rastogi, V.: Active sampling for entity matching. In: ACM SIGKDD. ACM (2012)
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. ACM TKDD 1(1), 5 (2007)
Christen, V.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Springer, Heidelberg (2012)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE 24(9), 1537–1555 (2012)
Christen, P., Vatsalan, D., Fu, Z.: Advanced record linkage methods and privacy aspects for population reconstruction - a survey and case studies. In: Bloothooft, G., Christen, P., Mandemakers, K., Schraagen, M. (eds.) Population Reconstruction, pp. 87–110. Springer, Switzerland (2015)
Dal Bianco, G., Galante, R., Gonalves, M., Canuto, S., Heuser, C.: A practical and effective sampling selection strategy for large scale deduplication. IEEE KDE 27(9), 2305–2319 (2015)
Du, J., Ling, C.: Active learning with human-like noisy oracle. In: IEEE ICDM, pp. 797–802 (2010)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE TKDE 19(1), 1–16 (2007)
Fisher, J., Christen, P., Wang, Q., Rahm, V.: A clustering-based framework to control block sizes for entity resolution. In: ACM SIGKDD (2015)
Fu, Z., Christen, P., Zhou, J.: A graph matching method for historical census household linkage. In: Tseng, V.S., Ho, T.B., Zhou, Z.-H., Chen, A.L.P., Kao, H.-Y. (eds.) PAKDD 2014, Part I. LNCS, vol. 8443, pp. 485–496. Springer, Heidelberg (2014)
Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. DMKD 2(1), 9–37 (1998)
Huynh, T.N., Mooney, R.J.: Discriminative structure and parameter learning for Markov logic networks. In: ACM ICML (2008)
Huynh, T.N., Mooney, R.J.: Online max-margin weight learning for Markov logic networks. In: SDM, pp. 642–651 (2011)
Kalashnikov, D., Mehrotra, S.: Domain-independent data cleaning via analysis of entity-relationship graph. ACM TODS 31(2), 716–767 (2006)
Kok, S., Domingos, P.: Learning the structure of Markov logic networks. In: ACM ICML (2005)
Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)
MacKay, D.J.: Information-based objective functions for active data selection. Neural Comput. 4(4), 590–604 (1992)
Mihalkova, L., Mooney, R.: Learning to disambiguate search queries from short sessions. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 111–127. Springer, Heidelberg (2009)
On, B.W., Elmacioglu, E., Lee, D., Kang, J., Pei, J.: Improving grouped-entity resolution using quasi-cliques. In: IEEE ICDM, pp. 1008–1015 (2006)
Rastogi, V., Dalvi, N., Garofalakis, M.: Large-scale collective entity matching. VLDB Endowment 4, 208–218 (2011)
Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006)
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: ACM SIGKDD (2002)
Settles, B.: Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin, Madison (2010)
Settles, B., Craven, M.: An analysis of active learning strategies for sequence labeling tasks. In: ACL Empirical methods in NLP (2008)
Singla, P., Domingos, P.: Discriminative training of Markov logic networks. AAAI 5, 868–873 (2005)
Singla, P., Domingos, P.: Entity resolution with Markov logic. In: IEEE ICDM, pp. 572–582 (2006)
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. Proc. VLDB Endow. 5(11), 1483–1494 (2012)
Wang, Q., Vatsalan, D., Christen, P.: Efficient interactive training selection for large-scale entity resolution. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS, vol. 9078, pp. 562–573. Springer, Heidelberg (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Fisher, J., Christen, P., Wang, Q. (2016). Active Learning Based Entity Resolution Using Markov Logic. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-31750-2_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)