Abstract
This paper proposes a notion of entity enhancing, which unifies entity resolution and conflict resolution, to identify tuples that refer to the same real-world entity and at the same time, correct semantic inconsistencies. We propose to unify rule-based and machine learning (ML) methods for entity enhancing, by embedding ML classifiers as predicates in logic rules. We model entity enhancing by extending the chase. We show that the chase warrants correctness justification and the Church-Rosser property. Moreover, we settle fundamental problems associated with entity enhancing, including the enhancing, consistency, satisfiability, and implication problems, ranging from NP-complete and coNP-complete to Π p2 -complete. Taken together, these provide a new theoretical framework for unifying entity resolution and conflict resolution.
Similar content being viewed by others
References
Wikibon. A comprehensive list of big data statistics, 2012. http://wikibon.org/blog/big-data-statistics/
Fan W F, Gao H, Jia X B, et al. Dynamic constraints for record matching. VLDB J, 2011, 20: 495–520
Bertossi L, Kolahi S, Lakshmanan L V S. Data cleaning and query answering with matching dependencies and matching functions. Theory Comput Syst, 2013, 52: 441–482
Bhattacharya I, Getoor L. Collective entity resolution in relational data. ACM Trans Knowl Discov Data, 2007, 1: 5
Arasu A, Ré C, Suciu D. Large-scale deduplication with constraints using Dedupalog. In: Proceedings of the 25th International Conference on Data Engineering, 2009
Mudgal S, Li H, Rekatsinas T, et al. Deep learning for entity matching: a design space exploration. In: Proceedings of International Conference on Management of Data, 2018
Arasu A, Götz M, Kaushik R. On active learning of record matching packages. In: Proceedings of International Conference on Management of Data, 2010
Fan W F, Geerts F, Jia X B, et al. Conditional functional dependencies for capturing data inconsistencies. ACM Trans Database Syst, 2008, 33: 1–48
Golab L, Karloff H, Korn F, et al. On generating near-optimal tableaux for conditional functional dependencies. In: Proceedings of the VLDB Endowment, 2008
Fan W F, Geerts F, Tang N, et al. Conflict resolution with data currency and consistency. J Data Inf Qual, 2014, 5: 1–37
Arenas M, Bertossi L, Chomicki J. Consistent query answers in inconsistent databases. In: Proceedings of Symposium on Principles of Database Systems, 1999
Chu X, Ilyas I F, Papotti P. Holistic data cleaning: putting violations into context. In: Proceedings of IEEE International Conference on Data Engineering, 2013
Chiticariu L, Li Y Y, Reiss F R. Rule-based information extraction is dead! Long live rule-based information extraction systems! In: Proceedings of Empirical Methods in Natural Language Processing, 2013
Fan W F, Li J Z, Ma S, et al. Interaction between record matching and data repairing. In: Proceedings of International Conference on Management of Data, 2011
Dong X, Halevy A, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of International Conference on Management of Data, 2005
Whang S E, Benjelloun O, Garcia-Molina H. Generic entity resolution with negative rules. VLDB J, 2009, 18: 1261–1277
Sadri F, Ullman J D. The interaction between functional dependencies and template dependencies. In: Proceedings of International Conference on Management of Data, 1980
Bahmani Z, Bertossi L, Vasiloglou N. ERBlox: combining matching dependencies with machine learning for entity resolution. Int J Approx Reason, 2017, 83: 118–141
Whang S E, Garcia-Molina H. Joint entity resolution on multiple datasets. VLDB J, 2013, 22: 773–795
Verroios V, Garcia-Molina H, Papakonstantinou Y. Waldo: an adaptive human interface for crowd entity resolution. In: Proceedings of International Conference on Management of Data, 2017
Firmani D, Saha B, Srivastava D. Online entity resolution using an Oracle. Proc VLDB Endow, 2016, 9: 384–395
Ebraheem M, Thirumuruganathan S, Joty S, et al. Distributed representations of tuples for entity resolution. In: Proceedings of Very Large Data Bases, 2018
Qian K, Popa L, Sen P. Active learning for large-scale entity resolution. In: Proceedings of Conference on Information and Knowledge Management, 2017
Zhang D X, Guo L, He X N, et al. A graph-theoretic fusion framework for unsupervised entity resolution. In: Proceedings of the 34th International Conference on Data Engineering, 2018
Yakout M, Elmagarmid A K, Neville J, et al. Guided data repair. In: Proceedings of Very Large Data Bases, 2011
He J, Veltri E, Santoro D, et al. Interactive and deterministic data cleaning. In: Proceedings of International Conference on Management of Data, 2016
Assadi A, Milo T, Novgorodov S. Dance: data cleaning with constraints and experts. In: Proceedings of International Conference on Data Engineering, 2017
Guo S T, Dong X L, Srivastava D, et al. Record linkage with uniqueness constraints and erroneous values. In: Proceedings of Very Large Data Bases, 2010
Fan W F, Li J Z, Ma S, et al. Towards certain fixes with editing rules and master data. VLDB J, 2012, 21: 213–238
Fan W F, Lu P, Tian C, et al. Deducing certain fixes to graphs. Proc VLDB Endow, 2019, 12: 752–765
Yakout M, Berti-Équille L, Elmagarmid A K. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of International Conference on Management of Data, 2013. 553–564
Abiteboul S, Hull R, Vianu V. Foundations of Databases. Reading: Addison-Wesley, 1995
Aires J P, Meneguzzi F. Norm conflict identification using deep learning. In: Proceedings of International Conference on Autonomous Agents and Multiagent Systems, 2017. 194–207
Sycara K P. Machine learning for intelligent support of conflict resolution. Decision Support Syst, 1993, 10: 121–136
Loshin D. Master Data Management. San Francisco: Knowledge Integrity Inc., 2009
Chandra A K, Merlin P M. Optimal implementation of conjunctive queries in relational data bases. In: Proceedings of Symposium on the Theory of Computing, 1977
Aggarwal C C. Data Classification: Algorithms and Applications. Boca Raton: CRC Press, 2014
Fan W F, Geerts F. Foundations of Data Quality Management. San Rafael: Morgan & Claypool Publishers, 2012
Klug A. On conjunctive queries containing inequalities. J ACM, 1988, 35: 146–160
Baudinet M, Chomicki J, Wolper P. Constraint-generating dependencies. J Comput Syst Sci, 1999, 59: 94–115
Beeri C, Bernstein P A. Computational problems related to the design of normal form relational schemas. ACM Trans Database Syst, 1979, 4: 30–59
Rutenburg V. Complexity of generalized graph coloring. In: Proceedings of International Symposium on Mathematical Foundations of Computer Science, 1986
Schaefer M, Umans C. Completeness in the polynomial-time hierarchy: a compendium. 2002. http://ovid.cs.depaul.edu/documents/phcom.pdf
Acknowledgements
This work was supported in part by Shenzhen Institute of Computing Sciences, Beijing Advanced Innovation Center for Big Data and Brain Computing (Beihang University), Royal Society Wolfson Research Merit Award (Grant No. WRM/R1/180014), European Research Council (Grant No. 652976), Engineering and Physical Sciences Research Council (Grant No. EP/M025268/1).
Author information
Authors and Affiliations
Corresponding author
Additional information
Professor Wenfei Fan is the chair of web data management at the University of Edinburgh, UK, the chief scientist of Shenzhen Institute of Computing Science, and a chief scientist of Beijing Advanced Innovation Center for Big Data and Brain Computing, China. He received his Ph.D. from the University of Pennsylvania (USA), and his MS.c. and BS.c. from Peking University (China). He joined the University of Edinburgh in 2004; prior to that, he was a member of technical staff at Bell Laboratories in Murray Hill, NJ, USA.
He is a foreign member of Chinese Academy of Science, a fellow of the Royal Society (FRS), a fellow of the Royal Society of Edinburgh (FRSE), a member of the Academy of Europe (MAE), and an ACM Fellow (FACM). He is a recipient of Royal Society Wolfson Research Merit Award in 2018, ERC Advanced Fellowship in 2015, the Roger Needham Award in 2008 (UK), Yangtze River Scholar in 2007 (China), the Outstanding Overseas Young Scholar Award in 2003 (China), the Career Award in 2001 (USA), and several Test-of-Time and Best Paper Awards (Alberto O. Mendelzon Test-of-Time Award of ACM PODS 2015 and 2010, Best Paper Awards for SIGMOD 2017, VLDB 2010, ICDE 2007, and Computer Networks 2002).
Prof. Fan “has made fundamental contributions to both theory and practice of data management. He has both formalized the problems of querying big data and has developed radically new techniques that overcome the limits associated with conventional database systems. In addition, he has made seminal contributions to data quality, in which he devised new techniques for data cleaning that have found wide commercial adoption. He has also contributed to our understanding of semi-structured data” (cf. the Royal Society, UK). His current research interests include database theory and systems, in particular big data, data quality, data sharing, distributed computation, query languages, and social media marketing.
Rights and permissions
About this article
Cite this article
Fan, W., Lu, P. & Tian, C. Unifying logic rules and machine learning for entity enhancing. Sci. China Inf. Sci. 63, 172001 (2020). https://doi.org/10.1007/s11432-020-2917-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-020-2917-1