Abstract
Entity resolution in settings with rich relational structure often introduces complex dependencies between co-references. Exploiting these dependencies is challenging—it requires seamlessly combining statistical, relational, and logical dependencies. One task of particular interest is entity resolution in familial networks. In this setting, multiple partial representations of a family tree are provided, from the perspective of different family members, and the challenge is to reconstruct a family tree from these multiple, noisy, partial views. This reconstruction is crucial for applications such as understanding genetic inheritance, tracking disease contagion, and performing census surveys. Here, we design a model that incorporates statistical signals (such as name similarity), relational information (such as sibling overlap), logical constraints (such as transitivity and bijective matching), and predictions from other algorithms (such as logistic regression and support vector machines), in a collective model. We show how to integrate these features using probabilistic soft logic, a scalable probabilistic programming framework. In experiments on real-world data, our model significantly outperforms state-of-the-art classifiers that use relational features but are incapable of collective reasoning.
Similar content being viewed by others
Notes
Code and data available at: https://github.com/pkouki/icdm2017.
ancestry.com, genealogy.com, familysearch.org, 23andMe.com.
References
Arasu A, Ré C, Suciu D (2009) Large-scale deduplication with constraints using dedupalog. In: IEEE international conference on data engineering (ICDE)
Bach S, Broecheler M, Huang B, Getoor L (2017) Hinge-loss markov random fields and probabilistic soft logic. J Mach Learn Res (JMLR) 18(109):1–67
Bach S, Huang B, London B, Getoor L (2013) Hinge-loss Markov random fields: convex inference for structured prediction. In: Uncertainty in artificial intelligence (UAI)
Belin T, Rubin D (1995) A method for calibrating false-match rates in record linkage. J Am Stat Assoc 90(430):694–707
Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. ACM Trans Knowl Discov Data (TKDD) 1(1). https://doi.org/10.1145/1217299.1217304
Cessie S, Houwelingen J (1992) Ridge estimators in logistic regression. Appl Stat 41(1):191–201
Chang C, Lin C (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):2:27:1–27:27
Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, Berlin
Culotta A, McCallum A (2005) Joint deduplication of multiple record types in relational data. In: ACM international conference on information and knowledge management (CIKM)
Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: ACM special interest group on management of data (SIGMOD)
Driessens K, Reutemann P, Pfahringer B, Leschi C (2006) Using weighted nearest neighbor to benefit from unlabeled data. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD)
Efremova J, Ranjbar-Sahraei B, Rahmani H, Oliehoek F, Calders T, Tuyls K, Weiss G (2015) Multi-source entity resolution for genealogical data, population reconstruction
Fellegi P, Sunter B (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
Frank E, Hall M, Witten I (2016) The WEKA Workbench. In: Gray J (ed) Practical machine learning tools and techniques. Morgan Kaufmann, Burlington (Online appendix for data mining)
Goergen A, Ashida S, Skapinsky K, de Heer H, Wilkinson A, Koehly L (2016) Knowledge is power: improving family health history knowledge of diabetes and heart disease among multigenerational mexican origin families. Public Health Genomics 19(2):93–101
Hand D, Christen P (2017) A note on using the f-measure for evaluating record linkage algorithms. Stat Comput 28(3):539–547
Hanneman R, Riddle F (2005) Introduction to social network methods. University of California, Riverside
Harron K, Wade A, Gilbert R, Muller-Pebody B, Goldstein H (2014) Evaluating bias due to data linkage error in electronic healthcare records. BMC Med Res Methodol 14:36
Hsu C, Chang C, Lin C (2003) A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University
Kalashnikov D, Mehrotra S (2006) Domain-independent data cleaning via analysis of entity-relationship graph. ACM Trans Database Syst (TODS) 31(2):716–767
Kouki P, Marcum C, Koehly L, Getoor L (2016) Entity resolution in familial networks. In: SIGKDD conference on knowledge discovery and data mining (KDD), workshop on mining and learning with graphs
Kouki P, Pujara J, Marcum C, Koehly L, Getoor L (2017) Collective entity resolution in familial networks. In: IEEE international conference on data mining (ICDM)
Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 95(1–2):161–205
Li X, Shen C (2008) Linkage of patient records from disparate sources. Stat Methods Med Res 22(1):31–8
Lin J, Marcum C, Myers M, Koehly L (2017) Put the family back in family health history: a multiple-informant approach. Am J Prev Med 5(52):640–644
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Newcombe H (1988) Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press Inc, Oxford
Nowozin S, Gehler P, Jancsary J, Lampert C (2014) Advanced structured prediction. The MIT Press, Cambridge
Platanios E, Poon H, Mitchell T, Horvitz E (2017) Estimating accuracy from unlabeled data: a probabilistic logic approach. In: Conference on neural information processing systems (NIPS)
Pujara J, Getoor L (2016) Generic statistical relational entity resolution in knowledge graphs. In: International joint conference on artificial intelligence (IJCAI), workshop on statistical relational artificial intelligence (StarAI)
Rastogi V, Dalvi N, Garofalakis M (2011) Large-scale collective entity matching. In: International conference on very large databases (VLDB)
Singla P, Domingos P (2006) Entity resolution with Markov logic. In: IEEE international conference on data mining (ICDM)
Suchanek F, Abiteboul S, Senellart P (2011) Paris: probabilistic alignment of relations, instances, and schema. In: Proceedings of the very large data bases endowment (PVLDB), vol 5(3)
Winkler W (2006) Overview of record linkage and current research directions. Technical report, US Census Bureau
Acknowledgements
We would like to thank Peter Christen and Jon Berry for insightful comments on this paper. This work was partially supported by the National Science Foundation Grants IIS-1218488, CCF-1740850, and IIS-1703331 and by the National Human Genome Research Institute Division of Intramural Research at the National Institutes of Health (ZIA HG2000397 and ZIA HG200395, Koehly PI). We would also like to thank the Sandia LDRD (Laboratory-Directed Research and Development) program for support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, the National Institutes of Health, or the Sandia Labs.
Author information
Authors and Affiliations
Corresponding author
APPENDIX: PSL model rules
APPENDIX: PSL model rules
Name similarity rules
Personal information similarity rules
Relational similarity rules of 1st degree
Relational similarity rules of 2nd degree
Transitive relational (similarity) rules of 1st degree
Transitive relational (similarity) rules of 2nd degree
Bijection and transitivity rules
Rules to leverage existing classification algorithms
Prior rule
Rights and permissions
About this article
Cite this article
Kouki, P., Pujara, J., Marcum, C. et al. Collective entity resolution in multi-relational familial networks. Knowl Inf Syst 61, 1547–1581 (2019). https://doi.org/10.1007/s10115-018-1246-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-018-1246-2