Abstract
Although entity resolution (ER) is known to be an important problem that has wide-spread applications in many areas, including e-commerce, health-care, social science, and crime and fraud detection, one aspect that has largely been neglected is to monitor the quality of entity resolution and repair erroneous matching decisions over time. In this paper we develop an efficient method for incrementally repairing ER, i.e., fix detected erroneous matches and non-matches. Our method is based on an efficient clustering algorithm that eliminates inconsistencies among matching decisions, and an efficient provenance indexing data structure that allows us to trace the evidence of clustering for supporting ER repairing. We have evaluated our method over real-world databases, and our experimental results show that the quality of entity resolution can be significantly improved through repairing over time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Available from: http://www.cs.umass.edu/~mccallum/.
- 2.
Available from: ftp://alt.ncsbe.gov/data/.
References
Afrati, F.N., Kolaitis, P.G.: Repair checking in inconsistent databases: algorithms and complexity. In: ICDT, pp. 31–41 (2009)
Arasu, A., Ré, C., Suciu, D.: Large-scale deduplication with constraints using dedupalog. In: ICDE, pp. 952–963 (2009)
Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Mach. Learn. 56(1–3), 89–113 (2004)
Barnes, M.: A practioner’s guide to evaluating entity resolution results (2014)
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: PVLDB, pp. 315–326 (2007)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE TKDE 19, 1–16 (2007)
Fellegi, I., Sunter, A.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Fisher, J., Christen, P., Wang, Q., Rahm, E.: A clustering-based framework to control block sizes for entity resolution. In: KDD, pp. 279–288 (2015)
Schewe, K.-D., Wang, Q.: A theoretical framework for knowledge-based entity resolution. TCS 549, 101–126 (2014)
Shen, W., Li, X., Doan, A.: Constraint-based entity matching. In: AAAI, pp. 862–867 (2005)
Shen, Z., Wang, Q.: Entity resolution with weighted constraints. In: Manolopoulos, Y., Trajcevski, G., Kon-Popovska, M. (eds.) ADBIS 2014. LNCS, vol. 8716, pp. 308–322. Springer, Heidelberg (2014)
Wang, Q., Schewe, K.-D., Wang, W.: Provenance-aware entity resolution: leveraging provenance to improve quality. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M.A. (eds.) DASFAA 2015. LNCS, vol. 9049, pp. 474–490. Springer, Heidelberg (2015)
Wang, Q., Vatsalan, D., Christen, P.: Efficient interactive training selection for large-scale entity resolution. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS, vol. 9078, pp. 562–573. Springer, Heidelberg (2015)
Whang, S.E., Garcia-Molina, H.: Entity resolution with evolving rules. VLDB 3(1–2), 1326–1337 (2010)
Wijsen, J.: Database repairing using updates. TODS 30(3), 722–768 (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Wang, Q., Gao, J., Christen, P. (2016). A Clustering-Based Framework for Incrementally Repairing Entity Resolution. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-31750-2_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)